The domain of Artificial Intelligence (AI) is evolving and advancing with the discharge of each recent model and solution. Large Language Models (LLMs), which have recently got highly regarded as a consequence of their incredible abilities, are the major reason for the rise in AI. The subdomains of AI, be it Natural Language Processing, Natural Language Understanding, or Computer Vision, all of those are progressing, and for all good reasons. One research area that has recently garnered numerous interest from AI and deep learning communities is Visual Query Answering (VQA). VQA is the duty of answering open-ended text-based questions on a picture.
Systems adopting Visual Query Answering try to appropriately answer questions in natural language regarding an input in the shape of a picture, and these systems are designed in a way that they understand the contents of a picture much like how humans do and thus effectively communicate the findings. Recently, a team of researchers from UC Berkeley and Google Research has proposed an approach called CodeVQA that addresses visual query answering using modular code generation. CodeVQA formulates VQA as a program synthesis problem and utilizes code-writing language models which take questions as input and generate code as output.
This framework’s major goal is to create Python programs that may call pre-trained visual models and mix their outputs to offer answers. The produced programs manipulate the visual model outputs and derive an answer using arithmetic and conditional logic. In contrast to previous approaches, this framework uses pre-trained language models, pre-trained visual models based on image-caption pairings, a small variety of VQA samples, and pre-trained visual models to support in-context learning.
To extract specific visual information from the image, equivalent to captions, pixel locations of things, or image-text similarity scores, CodeVQA uses primitive visual APIs wrapped around Visual Language Models. The created code coordinates various APIs to collect the essential data, then uses the complete expressiveness of Python code to research the info and reason about it using math, logical structures, feedback loops, and other programming constructs to reach at an answer.
For evaluation, the team has compared the performance of this recent technique to a few-shot baseline that doesn’t use code generation to gauge its effectiveness. COVR and GQA were the 2 benchmark datasets utilized in the evaluation, amongst which the GQA dataset includes multihop questions created from scene graphs of individual Visual Genome photos that humans have manually annotated, and the COVR dataset incorporates multihop questions on sets of images within the Visual Genome and imSitu datasets. The outcomes showed that CodeVQA performed higher on each datasets than the baseline. Particularly, it showed an improvement within the accuracy by a minimum of 3% on the COVR dataset and by about 2% on the GQA dataset.
The team has mentioned that CodeVQA is easy to deploy and utilize since it doesn’t require any additional training. It makes use of pre-trained models and a limited variety of VQA samples for in-context learning, which aids in tailoring the created programs to particular question-answer patterns. To sum up, this framework is powerful and makes use of the strength of pre-trained LMs and visual models, providing a modular and code-based approach to VQA.
Check Out The Paper and GitHub link. Don’t forget to affix our 24k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you might have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Tanya Malhotra is a final 12 months undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and important considering, together with an ardent interest in acquiring recent skills, leading groups, and managing work in an organized manner.