A team of researchers from the University of Wisconsin-Madison, NVIDIA, the University of Michigan, and Stanford University have developed a brand new vision-language model (VLM) called Dolphins. It’s a conversational driving assistant that may process multimodal inputs to offer informed driving instructions. Dolphins are designed to handle the complex driving scenarios faced by autonomous vehicles (AVs) and exhibit human-like features equivalent to rapid learning, adaptation, error recovery, and interpretability during interactive conversations.
LLMs like DriveLikeHuman and GPT-Driver lack wealthy visual features for autonomous driving. Dolphins mix LLM reasoning with visual understanding, excelling in in-context learning and handling varied video inputs. Inspired by Flamingo’s multimodal in-context learning, Dolphins aligns with works enhancing instruction comprehension in multimodal language models through text-image interleaved datasets.
The study addresses the challenge of achieving full autonomy in vehicular systems, aiming to design AVs with human-like understanding and responsiveness in complex scenarios. Current data-driven and modular autonomous driving systems face various integration and performance issues. Dolphins, a VLM tailored for AVs, demonstrates advanced understanding, fast learning, and error recovery. Emphasizing interpretability for trust and transparency, Dolphins reduce the disparity between existing autonomous systems and human-like driving capabilities.
Dolphins use OpenFlamingo and GCoT to reinforce reasoning. They ground VLMs within the AV context and develop fine-grained capabilities using real and artificial AV datasets. In addition they create a multimodal in-context instruction tuning dataset for detailed conversation tasks.
Dolphins excel in solving diverse autonomous vehicle tasks with human-like capabilities equivalent to fast adaptation and error recovery. They pinpoint precise driving locations, assess traffic status, and understand road agent behaviors. The model’s fine-grained capabilities result from being grounded in a general image dataset and fine-tuned inside the particular context of autonomous driving. A multimodal in-context instruction tuning dataset contributes to their training and evaluation.
Dolphins showcase impressive holistic understanding and human-like reasoning in intricate driving scenarios. As a conversational driving assistant, it handles various AV tasks, excelling in interpretability and rapid adaptation. It acknowledges computational challenges, particularly in achieving high frame rates on edge devices and managing power consumption. Proposing customized and distilled model versions suggests a promising direction to balance computational demands with power efficiency. Continuous exploration and innovation are deemed essential for unlocking the complete potential of AVs empowered by advanced AI capabilities like Dolphins.
Further exploration recommends computational efficiency, particularly in achieving high frame rates on edge devices and reducing power consumption for running advanced models in vehicles. Proposing the event of customized and distilled versions of VLMs, equivalent to Dolphins, suggests a possible solution to balance computational demands with power efficiency. Emphasizing the critical role of VLMs in enabling autonomous driving and unlocking full AI potential in AVs.
Try the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to hitch our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.
For those who like our work, you’ll love our newsletter..
Hello, My name is Adnan Hassan. I’m a consulting intern at Marktechpost and shortly to be a management trainee at American Express. I’m currently pursuing a dual degree on the Indian Institute of Technology, Kharagpur. I’m enthusiastic about technology and wish to create recent products that make a difference.