Large Language Models (LLMs) have transformed natural language understanding in recent times, demonstrating remarkable aptitudes in semantic comprehension, query resolution, and text production, particularly in zero-shot and few-shot environments. As seen in Fig. 1(a), several methods have been put forth for using LLMs on tasks involving vision. An optical encoder could also be trained to represent each picture as a series of continuous embeddings, allowing the LLM to grasp it. One other uses a contrastively trained frozen vision encoder while adding additional layers to the frozen LLM which might be then learned from scratch.
One other method recommends training a light-weight transformer to align a frozen visual encoder (pre-trained contrastively) and a frozen LLM. Even in the event that they have progressed within the abovementioned research, it remains to be difficult to justify the extra pretraining stage(s)’ computational cost. As well as, massive databases, including text, photos, and videos, are required to synchronize the visual and linguistic modalities with an existing LLM. Flamingo adds recent cross-attention layers into an LLM pre-trained so as to add visual features.
The multimodal pretraining stage requires stunning 2 billion picture-text pairs and 43 million web sites, which may take as much as 15 days, even employing a pretrained image encoder and a pretrained frozen LLM. As a substitute, using a wide range of “vision modules,” they’ll extract information from visual inputs and produce detailed textual representations (equivalent to tags, attributes, actions, and relationships, amongst other things), which they’ll then feed on to the LLM to avoid the necessity for added multimodal pretraining, as shown in Fig. 1(b). Researchers from Contextual AI and Stanford University introduce LENS (Large Language Models ENnhanced to See) a modular strategy that uses an LLM because the “reasoning module” and functions across separate “vision modules.”
They first extract wealthy textual information within the LENS technique using pretrained vision modules, equivalent to contrastive models and image-captioning models. The text is then sent to the LLM, enabling it to perform tasks, including object recognition, vision, and language (V&L). LENS bridges the gap between the modalities at no expense by eliminating the need for added multimodal pretraining stages or data. Incorporating LENS gives them a model that operates across domains out of the box without the necessity for added cross-domain pretraining. Moreover, this integration enables us to right away use probably the most recent developments in computer vision and natural language processing, maximizing the benefits related to each disciplines.
They supply the next contributions:
• They present LENS, a modular method that handles computer vision challenges through the use of language models’ few-shot, in-context learning capabilities through natural language descriptions of visual inputs.
• LENS gives any off-the-shelf LLM the power to see without further training or data.
• They use frozen LLMs to handle object recognition and visual reasoning tasks without additional vision-and-language alignment or multimodal data. Experimental results show that their approach achieves zero-shot performance that’s competitive with or superior to end-to-end jointly pre-trained models like Kosmos and Flamingo. A partial implementation of their paper is obtainable on GitHub.
Check Out the Paper, Demo, Github link, and Blog. Don’t forget to affix our 25k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you may have any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com
Featured Tools:
🚀 Check Out 100’s AI Tools in AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed toward harnessing the ability of machine learning. His research interest is image processing and is keen about constructing solutions around it. He loves to attach with people and collaborate on interesting projects.