Home Community This AI Paper Evaluates LLMs’ Ability to Adapt to Recent Variants of Existing Tasks

This AI Paper Evaluates LLMs’ Ability to Adapt to Recent Variants of Existing Tasks

0
This AI Paper Evaluates LLMs’ Ability to Adapt to Recent Variants of Existing Tasks

The remarkable performance of language models (LMs) suggests that large-scale next-word prediction could effectively distill knowledge from text corpora into interactive agents. LMs have achieved impressive results on various natural language processing benchmarks, surpassing state-of-the-art methods and even outperforming humans in tasks requiring complex reasoning. Nevertheless, it’s crucial to find out whether their success stems from task-general reasoning skills or from recognizing and recalling specific tasks encountered during pre-training.

Prior research has mainly focused on instance-level generalization, which data contamination issues can complicate. On this study, the researchers investigate the generalizability of LMs to latest task variants by altering the conditions or rules under which well-performing tasks are performed. The final reasoning procedure for these tasks stays unchanged, but the precise input-output mappings are modified. These latest tasks termed counterfactual tasks, deviate from the default conditions and measure the model’s task-level generalizability.

The researchers propose a collection of 11 counterfactual evaluation tasks spanning multiple categories and domains. These tasks include deductive reasoning, code generation, drawing, and spatial reasoning. While the reasoning procedure stays consistent across the unique tasks and their counterfactual variants, the input-output mappings differ. This evaluation goals to evaluate the flexibleness of LMs in adapting to latest task variations.

[Sponsored] 🔥 Construct your personal brand with Taplio  🚀 The first all-in-one AI-powered tool to grow on LinkedIn. Create higher LinkedIn content 10x faster, schedule, analyze your stats & engage. Try it without spending a dime!

The performance of GPT-4, GPT-3.5, Claude, and PaLM-2 is evaluated on each the default and counterfactual conditions of the tasks. The outcomes indicate that while LMs show above-random counterfactual performance, their performance consistently degrades in comparison with the default settings; this implies that the models’ success on these tasks will be attributed partly to default-condition-specific behaviors quite than abstract, generalizable reasoning skills.

The findings also reveal exciting relationships between model behavior on default and counterfactual tasks. Correlations between default and counterfactual performance, the effectiveness of zero-shot chain-of-thought prompting, and interactions between task- and instance-level frequency effects are observed. Overall, slight variations within the default instantiations of tasks present challenges for LMs, indicating that the success of existing models shouldn’t be solely attributed to their general capability for the goal task.


Take a look at the Paper. Don’t forget to hitch our 26k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you could have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club


Niharika

” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-264×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-902×1024.jpg”>

Niharika is a Technical consulting intern at Marktechpost. She is a 3rd yr undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the most recent developments in these fields.


🔥 StoryBird.ai just dropped some amazing features. Generate an illustrated story from a prompt. Test it out here. (Sponsored)

LEAVE A REPLY

Please enter your comment!
Please enter your name here