Home Community This AI Research Uncovers the Mechanics of Dishonesty in Large Language Models: A Deep Dive into Prompt Engineering and Neural Network Evaluation

This AI Research Uncovers the Mechanics of Dishonesty in Large Language Models: A Deep Dive into Prompt Engineering and Neural Network Evaluation

0
This AI Research Uncovers the Mechanics of Dishonesty in Large Language Models: A Deep Dive into Prompt Engineering and Neural Network Evaluation

Understanding large language models (LLMs) and promoting their honest conduct has change into increasingly crucial as these models have demonstrated growing capabilities and commenced widely adopted by society. Researchers contend that latest risks, resembling scalable disinformation, manipulation, fraud, election tampering, or the speculative risk of lack of control, arise from the potential for models to be deceptive (which they define as “the systematic inducement of false beliefs within the pursuit of some end result apart from the reality”). Research indicates that even while the models’ activations have the mandatory information, they might need greater than misalignment to supply the proper result. 

Previous studies have distinguished between truthfulness and honesty, saying that the previous refrains from making false claims, while the latter refrains from making claims it doesn’t “imagine.” This distinction helps to make sense of it. Subsequently, a model may generate misleading assertions owing to misalignment in the shape of dishonesty reasonably than an absence of skill. Since then, several studies have tried to deal with LLM honesty by delving right into a model’s internal state to search out truthful representations. Proposals for recent black box techniques have also been made to discover and provoke massive language model lying. Notably, previous work demonstrates that improving the extraction of internal model representations could also be achieved by forcing models to contemplate a notion actively. 

Moreover, models include a “critical” intermediary layer in context-following environments, beyond which representations of true or incorrect responses in context-following are inclined to diverge a phenomenon often called “overthinking.” Motivated by previous studies, the researchers broadened the main focus from incorrectly labeled in-context learning to deliberate dishonesty, during which they gave the model explicit instructions to lie. Using probing and mechanical interpretability methodologies, the research team from Cornell University, the University of Pennsylvania, and the University of Maryland hopes to discover and comprehend which layers and a focus heads within the model are accountable for dishonesty on this context. 

The next are their contributions: 

1. The research team shows that, as determined by considerably below-chance accuracy on true/false questions, LLaMA-2-70b-chat could be trained to lie. In keeping with the study team, this could be quite delicate and must be fastidiously and quickly engineered. 

2. Using activation patching and probing, the research team finds independent evidence for five model layers critical to dishonest conduct. 

3. Only 46 attention heads, or 0.9% of all heads within the network, were effectively subjected to causal interventions by the study team, which forced deceptive models to reply truthfully. These treatments are resilient over several dataset splits and prompts. 

In a nutshell the research team looks at a simple case of lying, where they supply LLM instructions on whether to inform the reality or not. Their findings reveal that massive models can display dishonest behaviour, producing right answers when asked to be honest and erroneous responses if pushed to lie. These findings construct on earlier research that implies activation probing can generalize out-of-distribution when prompted. Nevertheless, the research team does discover that this will necessitate lengthy prompt engineering attributable to problems just like the model’s tendency to output the “False” token sooner within the sequence than the “True” token. 

By utilizing prefix injection, the research team can consistently induce lying. Subsequently, the team compares the activations of the dishonest and honest models, localizing the layers and a focus heads involved in lying. By employing linear probes to research this lying behavior, the research team discovers that early-to-middle layers see comparable model representations for honest and liar prompts before diverging drastically to change into anti-parallel. This might show that prior layers must have a context-invariant representation of truth, as desired by a body of literature. Activation patching is one other tool the research team uses to grasp more in regards to the workings of specific layers and heads. The researchers discovered that localized interventions could completely address the mismatch between the honest-prompted and liar models in either direction. 

Significantly, these interventions on a mere 46 attention heads reveal a solid degree of cross-dataset and cross-prompt resilience. The research team focuses on lying by utilizing an accessible dataset and specifically telling the model to lie, in contrast to earlier work that has largely examined the accuracy and integrity of models which are honest by default. Because of this context, researchers have learned an important deal in regards to the subtleties of encouraging dishonest conduct and the methods by which big models engage in dishonest behavior. To ensure the moral and secure application of LLMs in the true world, the research team hopes that more work on this context will result in latest approaches to stopping LLM lying.


Try the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.

If you happen to like our work, you’ll love our newsletter..


Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed toward harnessing the facility of machine learning. His research interest is image processing and is obsessed with constructing solutions around it. He loves to attach with people and collaborate on interesting projects.


✅ [Featured AI Model] Try LLMWare and It’s RAG- specialized 7B Parameter LLMs

LEAVE A REPLY

Please enter your comment!
Please enter your name here