Home Community Make ChatGPT See Again: This AI Approach Explores Link-Context Learning to Enable Multimodal Learning

Make ChatGPT See Again: This AI Approach Explores Link-Context Learning to Enable Multimodal Learning

Make ChatGPT See Again: This AI Approach Explores Link-Context Learning to Enable Multimodal Learning

Language models have revolutionized the way in which we communicate with computers by their ability to generate coherent and contextually relevant text. Large Language Models (LLMs) have been on the forefront of this progress, trained on massive amounts of text data to learn the patterns and nuances of human language. ChatGPT, the pioneer of the LLM revolution, is incredibly popular amongst people in numerous disciplines.

LLMs have made various tasks easier to tackle due to their extreme ability. We use them to summarize texts, help us write emails, automate coding tasks, explain documents, etc. All these tasks were quite time-consuming only a 12 months ago, but nowadays, they take just a few minutes to finish.

Nonetheless, with the increasing demand for multimodal understanding, where models must process and generate content across different modalities like text, images, and even videos, the necessity for Multimodal Large Language Models (MLLMs) has emerged. MLLMs mix the ability of language models with visual understanding, enabling machines to grasp and generate content in a more comprehensive and contextually aware manner.

Once the ChatGPT craze settled down a bit, MLLMs took the AI world by storm, enabling machines to grasp and generate content across different modalities like text and pictures. These models have shown remarkable performance in tasks like image recognition, visual grounding, and instruction understanding. Nonetheless, training these models effectively stays a challenge. The most important challenge is when an MLLM encounters entirely novel scenarios where each the image and the label are unseen.

Furthermore, MLLMs are likely to get “lost in the center” when processing longer contexts. These models heavily depend on the start and middle positions, which explains the plateau in accuracy because the variety of shots increases. Subsequently, MLLMs struggle with longer inputs.

Time to fulfill Link-context-learning (LCL) that tackles various challenges in MLLM.

In MLLM, there are two key training strategies. Multimodal Prompt Tuning (M-PT) and Multimodal Instruction Tuning (M-IT). M-PT involves fine-tuning only a small portion of the model’s parameters while keeping the remaining frozen. This approach helps achieve similar results to full fine-tuning while minimizing computational resources. However, M-IT enhances the zero-shot capability of MLLMs by fine-tuning them on datasets that include instruction descriptions. This strategy improves the model’s ability to grasp and reply to recent tasks without prior training. These work high-quality, but they each sacrifice certain facets. 

As a substitute, LCL explores different training strategies: mix strategy, 2-way strategy, 2-way-random, and 2-way-weight. The mixed strategy stands out by significantly boosting zero-shot accuracy and achieving impressive results at 6-shot. Nonetheless, its performance barely decreases at 16-shot. Quite the opposite, the 2-way strategy shows a gradual increase in accuracy from 2-shot to 16-shot, indicating a better alignment with the trained pattern.

Unlike traditional in-context learning, LCL goes a step further by empowering the model to determine a mapping between the source and goal, enhancing its overall performance. By providing demonstrations with causal links, LCL enables MLLMs to discern not only analogies but additionally the underlying causal associations between data points, allowing them to acknowledge unseen images and understand novel concepts more effectively. The ISEKAI dataset serves as a vital resource for evaluating and advancing the capabilities of MLLMs within the context of link-context learning.

Furthermore, LCL introduces the ISEKAI dataset, a novel and comprehensive dataset specifically designed to guage the capabilities of MLLMs. The ISEKAI dataset comprises entirely generated images and fabricated concepts. It challenges MLLMs to assimilate recent concepts from ongoing conversations and retain this data for accurate question-answering. 

In conclusion, LCL provides beneficial insights into the training strategies employed for multimodal language models. The mixed strategy and 2-way strategy offer different approaches to boost the performance of MLLMs, each with its own strengths and limitations. The contextual evaluation sheds light on the challenges faced by MLLMs when processing longer inputs, emphasizing the importance of further research on this area. 

Take a look at the Paper and Code. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to affix our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.

When you like our work, you’ll love our newsletter..

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, along with his dissertation titled “Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning.” His research interests include deep learning, computer vision, video encoding, and multimedia networking.

🚀 Take a look at Hostinger AI Website Builder (Sponsored)


Please enter your comment!
Please enter your name here