Within the realm of artificial intelligence, Large Multimodal Models (LMMs) have exhibited remarkable problem-solving capabilities across diverse tasks, resembling zero-shot image/video classification, zero-shot image/video-text retrieval, and multimodal query answering (QA). Nonetheless, recent studies highlight a considerable gap between powerful LMMs and expert-level artificial intelligence, particularly in tasks involving complex perception and reasoning with domain-specific knowledge. This paper goals to bridge this gap by introducing CMMMU, a pioneering Chinese benchmark meticulously designed to judge LMMs’ performance on an intensive array of multi-discipline tasks, guiding the event of bilingual LMMs towards achieving expert-level artificial intelligence.
CMMMU (Chinese Massive Multi-discipline Multimodal Understanding) stands out as one of the comprehensive benchmarks (), comprising 12,000 manually collected Chinese multimodal questions sourced from college exams, quizzes, and textbooks. These questions span six core disciplines: . Other statistics are shown in Table 2. The benchmark not only evaluates LMMs on complex reasoning and perception tasks but in addition annotates each query with detailed subfields and image types, providing invaluable insights into the varieties of questions that pose challenges for LMMs.
A 3-stage data collection process ensures the richness and variety of CMMMU. Within the first stage, annotator organizers, mainly the authors, collect sources adhering to license requirements. Within the second stage, crowdsourcing annotators, consisting of undergraduate students and individuals with higher degrees, further annotate the collected sources, strictly following key principles to filter out unqualified questions with images. The third stage involves supplementing inquiries to subjects needing more representation, ensuring a balanced dataset across disciplines.
A rigorous data quality control protocol is implemented to reinforce data quality further. Not less than one in every of the paper’s authors manually verifies each query, filtering out questions with answers which are too difficult for LMMs to extract. Moreover, questions not meeting college-level examination standards are meticulously removed. To handle data contamination concerns, questions that might be accurately solved by multiple advanced LMMs concurrently without OCR assistance are filtered out.
The evaluation includes large language models (LLMs) and huge multimodal models (LMMs), considering each closed-source and open-source implementations. The zero-shot evaluation settings are used as an alternative of fine-tuning or few-shot settings since it provides a raw assessment of the model’s ability to generate accurate answers on multimodal tasks. A scientific and rule-based evaluation pipeline, incorporating robust regular expressions and specific rules for various query types, ensures a comprehensive evaluation. Finally, they’ve adopted micro-average accuracy because the evaluation metric.
As well as, the paper also presents an intensive error evaluation of 300 samples, showcasing instances where even top-performing LMMs, resembling QwenVL-Plus and GPT-4V, answer incorrectly. The evaluation, distributed amongst 30 subjects, highlights challenges leading advanced LMMs astray and underscores the long journey ahead toward achieving expert-level bilingual LMMs. Even essentially the most advanced closed-source LMMs, GPT-4V and Qwen-VL-Plus, achieve only 42% and 36% accuracy, respectively, indicating significant room for improvement.
Interestingly, the study reveals a smaller performance gap between open-source and closed-source LMMs in a Chinese context in comparison with English. While essentially the most powerful open-source LMM, Qwen-VL-Chat, achieves an accuracy of 28%, with a 14% gap in comparison with GPT-4V, the gap in English is 21%. Notably, Yi-VL-6B1, Yi-VL-34B2, and Qwen-VL-Chat outperform other open-source LMMs on CMMMU, emphasizing their potential within the Chinese language domain. Yi-VL-34B even narrows the performance gap between open-source LMMs and GPT-4V on CMMMU to 7%.
In conclusion, the CMMMU benchmark represents a major advancement in the search for Advanced General Intelligence (AGI). It serves as a meticulous evaluator of the newest Large Multimodal Models (LMMs), gauging their elementary perceptual skills, intricate logical reasoning, and profound domain-specific expertise. By comparing LMMs’ performance on CMMMU and MMMU, this research provides insights into the reasoning capability of bilingual LMMs in Chinese and English contexts, paving the way in which for AGI that rivals seasoned professionals across diverse fields.
Try the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
When you like our work, you’ll love our newsletter..
Don’t Forget to hitch our Telegram Channel
Vineet
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/IMG20221002180119-Vineet-kumar-225×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/IMG20221002180119-Vineet-kumar-768×1024.jpg”>
Vineet Kumar is a consulting intern at MarktechPost. He’s currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He’s a Machine Learning enthusiast. He’s enthusiastic about research and the newest advancements in Deep Learning, Computer Vision, and related fields.