Mathematical reasoning, a part of our advanced considering, reveals the complexities of human intelligence. It involves logical considering and specialized knowledge, not only in words but in addition in pictures, crucial for understanding abilities. This has practical uses in AI. Nonetheless, current AI datasets often focus narrowly, missing a full exploration of mixing visual language understanding with math.
While Large Language Models (LLMs) and Large Multimodal Models (LMMs) reveal remarkable problem-solving abilities across diverse tasks, their aptitude for mathematical reasoning in visual contexts stays understudied. To handle this gap, researchers from UCLA, the University of Washington, and Microsoft introduce MATHVISTA, a benchmark that amalgamates challenges from various mathematical and visual tasks. This benchmark comprises 6,141 examples sourced from 28 existing multimodal datasets related to mathematics and three newly developed datasets (IQTest, FunctionQA, and PaperQA). Successful completion of those tasks necessitates nuanced visual understanding and complicated compositional reasoning, posing difficulties even for essentially the most advanced foundation models.
On this paper, the authors introduce MATHVISTA, a comprehensive benchmark for mathematical reasoning in visual contexts. They propose a task taxonomy to guide its development, identifying seven sorts of mathematical reasoning and specializing in five primary tasks: figure query answering (FQA), geometry problem solving (GPS), math word problem (MWP), textbook query answering (TQA), and visual query answering (VQA). The benchmark encompasses a various range of visual contexts, akin to natural images, geometry diagrams, abstract scenes, synthetic scenes, figures, charts, and plots. MATHVISTA incorporates 28 existing multimodal datasets, comprising 9 math-targeted question-answering (MathQA) datasets and 19 VQA datasets.
Researchers extensively tested 12 leading foundation models, including three Large Language Models (LLMs) akin to ChatGPT, GPT-4, Claude-2, two proprietary Large Multimodal Models (LMMs) – GPT4V, Bard, and 7 open-source LMMs. They evaluated these models on MATHVISTA, employing zero-shot and few-shot settings with chain-of-thought (CoT) and program-of-thought (PoT) prompting strategies. The above figure demonstrates examples of the newly annotated datasets: IQTest, FunctionQA, and PaperQA.
The findings reveal that CoT GPT-4, the best-performing text-based model without visual enhancements, achieves an overall accuracy of 29.2%. Compared, the best-performing multimodal model, Bard, achieves 34.8%, representing 58% of human performance (34.8% vs. 60.3%). When PoT GPT-4 is enhanced with Bard captions and OCR text, it reaches 33.9%, closely matching the Multimodal Bard.
Further evaluation suggests that Bard’s model shortcomings stem from incorrect calculations and hallucinations influenced by visual perception and textual reasoning. Notably, GPT-4V, the most recent multimodal version of GPT-4, achieves a state-of-the-art accuracy of 49.9%, a major 15.1% improvement over Multimodal Bard, as reported in the primary comprehensive evaluation using MATHVISTA. As the sector continues to advance, their work contributes precious insights for further refining mathematical reasoning in multimodal AI systems!
Try the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
In the event you like our work, you’ll love our newsletter..
Don’t Forget to hitch our Telegram Channel
Janhavi Lande, is an Engineering Physics graduate from IIT Guwahati, class of 2023. She is an upcoming data scientist and has been working on the planet of ml/ai research for the past two years. She is most fascinated by this ever changing world and its constant demand of humans to maintain up with it. In her pastime she enjoys traveling, reading and writing poems.