Across the globe, individuals create myriad videos every day, including user-generated live streams, video-game live streams, short clips, movies, sports broadcasts, and promoting. As a flexible medium, videos convey information and content through various modalities, reminiscent of text, visuals, and audio. Developing methods able to learning from these diverse modalities is crucial for designing cognitive machines with enhanced capabilities to investigate uncurated real-world videos, transcending the restrictions of hand-curated datasets.
Nevertheless, the richness of this representation introduces quite a few challenges for exploring video understanding, particularly when confronting extended-duration videos. Grasping the nuances of long videos, especially those exceeding an hour, necessitates sophisticated methods of analyzing images and audio sequences across multiple episodes. This complexity increases with the necessity to extract information from diverse sources, distinguish speakers, discover characters, and maintain narrative coherence. Moreover, answering questions based on video evidence demands a deep comprehension of the content, context, and subtitles.
In live streaming and gaming video, additional challenges emerge in processing dynamic environments in real-time, requiring semantic understanding and the power to have interaction in long-term strategic planning.
In recent times, considerable progress has been achieved in large pre-trained and video-language models, showcasing their proficient reasoning capabilities for video content. Nevertheless, these models are typically trained on concise clips (e.g., 10-second videos) or predefined motion classes. Consequently, these models may encounter limitations in providing a nuanced understanding of intricate real-world videos.
The complexity of understanding real-world videos involves identifying individuals within the scene and discerning their actions. Moreover, pinpointing these actions is mandatory, specifying when and the way these actions occur. Moreover, it necessitates recognizing subtle nuances and visual cues across different scenes. The first objective of this work is to confront these challenges and explore methodologies directly applicable to real-world video understanding. The approach involves deconstructing prolonged video content into coherent narratives, subsequently employing these generated stories for video evaluation.
Recent strides in Large Multimodal Models (LMMs), reminiscent of GPT-4V(ision), have marked significant breakthroughs in processing each input images and text for multimodal understanding. This has spurred interest in extending the applying of LMMs to the video domain. The study reported in this text introduces MM-VID, a system that integrates specialized tools with GPT-4V for video understanding. The overview of the system is illustrated within the figure below.
Upon receiving an input video, MM-VID initiates multimodal pre-processing, encompassing scene detection and automatic speech recognition (ASR), to assemble crucial information from the video. Subsequently, the input video is segmented into multiple clips based on the scene detection algorithm. GPT-4V is then employed, utilizing clip-level video frames as input to generate detailed descriptions for every video clip. Finally, GPT-4 produces a coherent script for your entire video, conditioned on clip-level video descriptions, ASR, and available video metadata. The generated script empowers MM-VID to execute a various array of video tasks.
Some examples taken from the study are reported below.
This was the summary of MM-VID, a novel AI system integrating specialized tools with GPT-4V for video understanding. If you happen to have an interest and wish to learn more about it, please be happy to discuss with the links cited below.
Try the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.
If you happen to like our work, you’ll love our newsletter..
We’re also on Telegram and WhatsApp.
Daniele Lorenzi received his M.Sc. in ICT for Web and Multimedia Engineering in 2021 from the University of Padua, Italy. He’s a Ph.D. candidate on the Institute of Information Technology (ITEC) on the Alpen-Adria-Universität (AAU) Klagenfurt. He’s currently working within the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.