Researchers from FAIR Meta, HuggingFace, AutoGPT, and GenAI Meta address the issue of testing the capabilities of general AI assistants in handling real-world questions that require fundamental skills resembling reasoning and multi-modality handling, which proves to be difficult for advanced AIs with human-like responses. The event of GAIA goals to attain Artificial General Intelligence by targeting human-level robustness.
Specializing in real-world questions necessitating reasoning and multi-modality skills, GAIA diverges from current trends by emphasizing tasks difficult for each humans and advanced AIs. Unlike closed systems, GAIA mirrors realistic AI assistant use cases. GAIA features rigorously curated non-gameable questions, prioritizing quality and showcasing human superiority over GPT-4 with plugins. It goals to guide query design, ensuring multi-step completion and stopping data contamination.
As LLMs surpass current benchmarks, evaluating their ability becomes increasingly difficult. Despite the emphasis on complex tasks, researchers argue that difficulty levels for humans don’t necessarily challenge LLMs. To deal with this challenge, a brand new model called GAIA has been introduced. It’s a General AI Assistant that focuses on real-world questions, avoiding LLM evaluation pitfalls. With human-crafted questions that reflect AI assistant use cases, GAIA ensures practicality. By targeting open-ended generation in NLP, GAIA goals to redefine evaluation benchmarks and advance the following generation of AI systems.
A proposed research method involves utilizing a benchmark created by GAIA for testing general AI assistants. This benchmark consists of real-world questions prioritizing reasoning and practical skills, which humans have designed to stop data contamination and permit for efficient and factual evaluation. The evaluation process employs a quasi-exact match to align model answers with ground truth through a system prompt. A developer set and 300 questions have been released to ascertain a leaderboard. The methodology behind GAIA’s benchmark goals to guage open-ended generation in NLP and supply insights to advance the following generation of AI systems.
The benchmark conducted by GAIA revealed a major performance gap between humans and GPT-4 when answering real-world questions. While humans achieved a hit rate of 92%, GPT-4 only scored 15%. Nonetheless, GAIA’s evaluation also showed that LLMs’ accuracy and use cases might be enhanced by augmenting them with tool APIs or web access. It presents a possibility for collaborative human-AI models and advancements in next-gen AI systems. Overall, the benchmark provides a transparent rating of AI assistants and highlights the necessity for further improvements within the performance of General AI Assistants.
In conclusion, Gaia’s benchmark for evaluating General AI Assistants on real-world questions has shown that humans outperform GPT-4 with plugins. It highlights the necessity for AI systems to exhibit robustness just like humans on conceptually easy yet complex questions. The benchmark methodology’s simplicity, non-gameability, and interpretability make it an efficient tool for achieving Artificial General Intelligence. Moreover, the discharge of annotated questions and a leaderboard goals to handle open-ended generation evaluation challenges in NLP and beyond.
Try the Paper and Code. All credit for this research goes to the researchers of this project. Also, don’t forget to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.
If you happen to like our work, you’ll love our newsletter..
Hello, My name is Adnan Hassan. I’m a consulting intern at Marktechpost and shortly to be a management trainee at American Express. I’m currently pursuing a dual degree on the Indian Institute of Technology, Kharagpur. I’m captivated with technology and wish to create latest products that make a difference.