
Generative Artificial Intelligence has grow to be increasingly popular up to now few months. Being a subset of AI, it enables Large Language Models (LLMs) to generate recent data by learning from massive amounts of obtainable textual data. LLMs understand and follow user intentions and directions by the use of text-based conversations. These models imitate humans to supply recent and inventive content, summarize long paragraphs of text, answer questions precisely, and so forth. LLMs are limited to text-based conversations, which comes as a limitation as text-only interaction between a human and a pc is just not probably the most optimal type of communication for a robust AI assistant or a chatbot.
Researchers have been attempting to integrate visual understanding capabilities in LLMs, equivalent to the BLIP-2 framework, which performs vision-language pre-training through the use of frozen pre-trained image encoders and language decoders. Though efforts have been made so as to add vision to LLMs, the combination of videos which contributes to an enormous a part of the content on social media, remains to be a challenge. It is because it may well be difficult to understand non-static visual scenes in videos effectively, and it’s tougher to shut the modal gap between images and text than it’s to shut the modal gap between video and text since it requires processing each visual and audio inputs.
To deal with the challenges, a team of researchers from DAMO Academy, Alibaba Group, has introduced Video-LLaMA, an instruction-tuned audio-visual language model for video understanding. This multi-modal framework enhances language models with the flexibility to grasp each visual and auditory content in videos. Video-LLaMA explicitly addresses the difficulties of integrating audio-visual information and the challenges of temporal changes in visual scenes, in contrast to prior vision-LLMs that focus solely on static image understanding.
The team has also introduced a Video Q-former that captures the temporal changes in visual scenes. This component assembles the pre-trained image encoder into the video encoder and enables the model to process video frames. Using a video-to-text generation task, the model is trained on the connection between videos and textual descriptions. ImageBind has been used to integrate audio-visual signals because the pre-trained audio encoder. It’s a universal embedding model that aligns various modalities and is understood for its ability to handle various kinds of input and generate unified embeddings. Audio Q-former has also been used on the highest of ImageBind to learn reasonable auditory query embeddings for the LLM module.
Video-LLaMA has been trained on large-scale video and image-caption pairs to align the output of each the visual and audio encoders with the LLM’s embedding space. This training data allows the model to learn the correspondence between visual and textual information. Video-LLaMA is fine-tuned on visual-instruction-tuning datasets that provide higher-quality data for training the model to generate responses grounded in visual and auditory information.
Upon evaluation, experiments have shown that Video-LLaMA can perceive and understand video content, and it produces insightful replies which might be influenced by the audio-visual data offered within the videos. In conclusion, Video-LLaMA has a whole lot of potential as an audio-visual AI assistant prototype that may react to each visual and audio inputs in videos and may empower LLMs with audio and video understanding capabilities.
Check Out The Paper and Github. Don’t forget to hitch our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you could have any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Tanya Malhotra is a final yr undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and demanding considering, together with an ardent interest in acquiring recent skills, leading groups, and managing work in an organized manner.