Home Community This AI Paper Unveils ‘Vary’: A Novel Approach to Expand Vision Vocabulary in Large Vision-Language Models for Advanced Multilingual Perception Tasks

This AI Paper Unveils ‘Vary’: A Novel Approach to Expand Vision Vocabulary in Large Vision-Language Models for Advanced Multilingual Perception Tasks

0
This AI Paper Unveils ‘Vary’: A Novel Approach to Expand Vision Vocabulary in Large Vision-Language Models for Advanced Multilingual Perception Tasks

Large Vision-Language Models (LVLMs) mix computer vision and natural language processing to generate text descriptions of visual content. These models have shown remarkable progress in various applications, including image captioning, visible query answering, and image retrieval. Nonetheless, despite their impressive performance, LVLMs still face some challenges, particularly relating to specialized tasks that require dense and fine-grained perception. The issue addressed by the Vary method is the limited vision vocabulary of LVLMs relating to specific tasks that demand a more nuanced understanding of visual content. 

Researchers from Huazhong University of Science and Technology, MEGVII Technology, and the University of Chinese Academy of Sciences introduced Vary, a way enhancing LVLMs for specialised tasks requiring dense perception. It empowers LVLMs to amass recent features efficiently, improving fine-grained perception. Experimental results display Vary’s effectiveness across functions. Acknowledging the scope for improvement, the researchers have proposed Vary as a platform for further exploration. It notes using GPT-4 for generating training data and highlights Vary’s applicability to numerous downstream visual tasks, expanding LVLM capabilities while maintaining the unique ones.

The study addresses the constraints of common vision vocabularies, similar to CLIP-VIT, in dense and fine-grained vision perception scenarios, motivating the necessity to scale up visual vocabularies in LVLMs. It introduces Vary, a way inspired by expanding text vocabulary in LVLMs for foreign languages. Vary generates a brand new vision vocabulary using a vocabulary network and integrates it with the unique, aiming to reinforce encoding efficiency and model performance in diverse tasks like non-English OCR and chart understanding. It anticipates that Vary’s design will stimulate further research on this direction.

The research introduces two configurations of Vary: Vary-tiny and Vary-base. Vary-tiny, specializing in fine-grained perception, lacks a text input branch and employs a tiny OPT-125M model. It’s trained using document and chart data as positive samples and natural images as negatives. The vocabulary network in Vary-tiny generates a brand new vision vocabulary, integrated with the unique in Vary-base. During Vary-base training, each vocabulary networks are utilized, freezing their weights, while LVLM parameters and input embedding layers are optimized. Implementation details involve AdamW optimization, a cosine annealing scheduler, and specific learning rates. Synthetic data is created for document and chart understanding.

Vary demonstrates promising performance across multiple tasks, excelling in document-level OCR, chart understanding, and MMVet tasks. Specifically, it achieves an ANLS of 78.2% in DocVQA and 36.2% in MMVet, showcasing its competency in recent document parsing features. Vary-tiny and Vary-base exhibit strong leads to document OCR tasks, with Vary-base outperforming other LVLMs. While the study acknowledges Vary’s success, it emphasizes the continuing need for improvements in effectively scaling up the visual vocabulary.

In conclusion, the study’s key takeaways will be summarized in a couple of points:

  • Proposal: Efficient Method for Scaling up Vision Vocabulary in LVLMs.
  • Methodology: The proposed method introduces a brand new vision vocabulary generated through a network integrated with the unique language. 
  • Capabilities: This method enhances fine-grained perception, especially in document-level OCR and chart understanding tasks. The unique powers of LVLMs are maintained while quickly acquiring recent features. 
  • Performance: Promising scores have been demonstrated in various tasks, with this method outperforming other LVLMs in document parsing features.

Take a look at the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.

Should you like our work, you’ll love our newsletter..


Hello, My name is Adnan Hassan. I’m a consulting intern at Marktechpost and shortly to be a management trainee at American Express. I’m currently pursuing a dual degree on the Indian Institute of Technology, Kharagpur. I’m obsessed with technology and wish to create recent products that make a difference.


🐝 [Free Webinar] Alexa, Upgrade my App: Integrating Voice AI into Your Strategy (Dec 15 2023)

LEAVE A REPLY

Please enter your comment!
Please enter your name here