Home Community UCSD Researchers Evaluate GPT-4’s Performance in a Turing Test: Unveiling the Dynamics of Human-like Deception and Communication Strategies

UCSD Researchers Evaluate GPT-4’s Performance in a Turing Test: Unveiling the Dynamics of Human-like Deception and Communication Strategies

UCSD Researchers Evaluate GPT-4’s Performance in a Turing Test: Unveiling the Dynamics of Human-like Deception and Communication Strategies

The GPT-4 was tested using a public Turing test on the web by a gaggle of researchers from UCSD. The perfect performing GPT-4 prompt was successful in 41% of games, which was higher than the baselines given by ELIZA (27%), GPT-3.5 (14%), and random probability (63%), nevertheless it still must be quite there. The outcomes of the Turing Test showed that participants judged totally on language style (35% of the full) and social-emotional qualities (27%). Neither participants’ education nor their prior experience with LLMs predicted their ability to identify the deceit, demonstrating that even individuals who’re well-versed in such matters could also be vulnerable to trickery. While the Turing Test has been widely criticized for its shortcomings as a measure of intellect, two researchers from the San Diego (University of California) maintain that it stays useful as a gauge of spontaneous communication and deceit. They’ve artificial intelligence models that may pass as humans, which might need far-reaching social effects. Thus, they examine the efficacy of varied methodologies and criteria for determining human likeness. 

The Turing Test is interesting for reasons unrelated to the controversy over its sensitivity to intelligence. To start, it’s crucial to investigate the power being tested, which is whether or not or not a system can idiot an interlocutor into pondering it’s human. The Turing Test offers a reliable approach to monitoring the event of such abilities in models. As well as, it enables us to learn concerning the various points that play a job in lying, equivalent to model size and performance, prompting strategies, ancillary infrastructure like access to real-time information, and the Interrogator’s level of experience.

Second, the Turing Test will be used as a research tool to probe the common perception of what it means for a machine to seem human. Machines’ performance is assessed, but human participants’ cultural, ethical, and psychological presuppositions are also investigated. Interrogators’ underlying opinions concerning the attributes that constitute being human and which of those qualities can be hardest to own are revealed as they craft and refine inquiries.

The Turing Test is modified by a single interrogator and a single witness to find out if the latter is human or artificial intelligence. This contrasts Turing’s original three-person formulation, nevertheless it has change into fairly standard since it eliminates the necessity to have multiple human participants online for every game.

Using the OpenAI API, they developed 25 LLM Witnesses that might answer questions posed by users. Each LLM witness had a temperature setting (0.2, 0.5, or 1.0), a model (GPT-3.5 or GPT-4), and a prompt. The alert explained the situation and gave the agent specific guidance on proceeding.

The chat app’s user interface was made to resemble an ordinary messaging service. The Interrogator sent the primary message, and every user could only send one message concurrently. Each communication could only be 300 characters long, and the full discussion time was 5 minutes. Users were restricted from pasting into the text box, and a content filter was applied to forestall abusive messages. Within the upper right corner of the screen, you’ll discover a countdown timer, buttons that reveal instructions overlays, and a form for reporting abusive people.

The questioner used the choice interface to supply the next:

i) a verdict (machine or human witness)

ii) how sure they’re that they made the fitting call (0-100).

iii) a justification (free-form language) for his or her selection.

IRB at UC San Diego gave their clearance to the study—the Turing test. The live website played host to the experiment. Online social media platforms were used to enlist participants. They read instructions that outlined the principles of the sport and consented to their data getting used for scientific research.

They recruited 650 people and made 1,400 “games” through which that they had short talks with either a human or a GPT model. The topics got a dialogue partner and asked to discover them.

Study participants were tricked 41% of the time by GPT-4 models but only 5% to 14% of the time by GPT -3.5 models. Only in 63% of the trials could people persuade participants they weren’t machines.


As a consequence of its nature as an open, online experiment, this study has several caveats that will make its findings lower than solid. The primary problem is that the sample may very well be more representative of the community because participants were recruited through social media. Second, there needed to be incentives for the participants. Thus, there’s a probability that the interrogators and the witnesses could have done their best. Human witnesses have been ‘trolling’ by acting like they’re artificial intelligence. Some investigators used this type of conduct as justification for human verdicts as well. Subsequently, the outcomes may understate human performance while overestimating AI’s. Third, a few of the questioners admitted that that they had prior knowledge concerning the witness.

To sum it up –

The Turing Test has been often condemned as an imperfect measure of intelligence: each for being too, but to the extent that this occurred and interrogators didn’t reveal it, researchers can have overestimated human performance. Finally, as there was just one online user at a time, they were often paired with the identical artificial intelligence witness. In consequence, people had a preconceived notion that any given testimony was an AI, which could have resulted in lower SR across the board. This bias likely affected the outcomes despite efforts to counteract it by removing games where an interrogator had played against an AI greater than 3 times in succession. Finally, they only employed a small subset of the available prompts, which were developed without knowing how real people would interact with the sport. The outcomes actually understate GPT-4’s potential performance on the Turing Test because there are simpler prompts.

Try the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to hitch our 32k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.

If you happen to like our work, you’ll love our newsletter..

We’re also on Telegram and WhatsApp.


” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-169×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-576×1024.jpg”>

Dhanshree Shenwai is a Computer Science Engineer and has a great experience in FinTech corporations covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is obsessed with exploring latest technologies and advancements in today’s evolving world making everyone’s life easy.

🔥 Meet Retouch4me: A Family of Artificial Intelligence-Powered Plug-Ins for Photography Retouching


Please enter your comment!
Please enter your name here