Home Community Overcoming Hallucinations in AI: How Factually Augmented RLHF Optimizes Vision-Language Alignment in Large Multimodal Models

Overcoming Hallucinations in AI: How Factually Augmented RLHF Optimizes Vision-Language Alignment in Large Multimodal Models

0
Overcoming Hallucinations in AI: How Factually Augmented RLHF Optimizes Vision-Language Alignment in Large Multimodal Models

By additional pre-training using image-text pairings or fine-tuning them with specialized visual instruction tuning datasets, Large Language Models may dive into the multimodal domain, giving rise to potent Large Multimodal Models. Nonetheless, there are obstacles to constructing LMMs, chief amongst them the disparity between the amount and quality of multimodal data and text-only datasets. Take the LLaVA model, initialized from a pre-trained visual encoder and a language model tweaked for instructions. It’s trained on far fewer instances than text-only models, which use over 100M examples over 1800 tasks. It is barely trained on 150K artificial image-based conversations. On account of such data restrictions, the visual and language modalities will not be aligned. 

In consequence, LMMs could generate hallucinatory outputs which might be inaccurately tied to the context that pictures give. Researchers from UC Berkeley, CMU, UIUC, UW–Madison, UMass Amherst Microsoft Research, and MIT-IBM Watson AI Lab present LLaVA-RLHF, a vision-language model trained for enhanced multimodal alignment, to deal with the problems brought on by the absence of high-quality visual instruction tuning data for LMM training. One in all their major contributions is adapting the multimodal alignment for LMMs to the universal and scalable alignment paradigm generally known as Reinforcement Learning from Human Feedback, which has demonstrated remarkable effectiveness for text-based AI agents. To fine-tune LMM, it collects human preferences specializing in recognizing hallucinations and uses those preferences in reinforcement learning. 

This strategy may improve the multimodal alignment at a comparatively low-cost annotation cost, equivalent to $3000 for gathering 10K human preferences for image-based discussions. So far as they know, this strategy is the primary effective use of RLHF for multimodal alignment. Gaining high rankings from the reward model only sometimes equates to improving human judgments, which is reward hacking. It’s a possible problem with the current RLHF paradigm. Previous research suggested iteratively gathering “fresh” human feedback to stop incentive hacking, but this method is usually expensive and can’t properly use existing human preference data. This study suggests a more data-efficient option, attempting to make the reward model able to using the knowledge and data already present in greater language models that humans have annotated. 

Figure 1: A diagram illustrating the opportunity of hallucinations in the course of the Supervised Nice-Tuning (SFT) phase of LMM training and the way in which Factually Augmented RLHF addresses the issue of low capability within the reward model, which is initialized from the SFT model.

First, they use a superior visual encoder with higher resolutions and a much bigger language model to reinforce the reward model’s overall functionality. Second, they present the Factually Augmented RLHF algorithm, which, as shown in Fig. 1, calibrates the reward signals by supplementing them with extra information like picture descriptions or a ground-truth multi-choice option. They further augment the synthetic vision instruction tuning data with existing high-quality human-annotated multimodal data within the conversation format to reinforce the overall capabilities of LMMs in the course of the Supervised Nice-Tuning stage. They specifically transform Flickr30k right into a Spotting Captioning project, VQA-v2, and A-OKVQA right into a multi-round QA task, and each train the LLaVA-SFT+ models using the brand new data set. 

Finally, they consider the way to evaluate the multimodal alignment of LMMs in situations of real-world creation, paying particular attention to penalizing any hallucinations. The benchmark questions they develop, MMHAL-BENCH, cover all 12 of COCO’s key object categories and comprise eight job kinds. In accordance with their evaluation, this benchmark dataset closely matches human assessments, especially if scores are considered for anti-hallucinations. As the primary LMM trained with RLHF, LLaVA-RLHF performs admirably of their experimental assessment. They saw an improvement of 94% on the LLaVA-Bench, a 60% improvement on the MMHAL-BENCH, and so they set recent performance records for LLaVA with 52.4% on MMBench and 82.7% F1 on POPE. On GitHub, they’ve made their code, model, and data accessible to the general public.


Try the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to hitch our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.

If you happen to like our work, you’ll love our newsletter..


Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed toward harnessing the ability of machine learning. His research interest is image processing and is captivated with constructing solutions around it. He loves to attach with people and collaborate on interesting projects.


▶️ Now Watch AI Research Updates On Our Youtube Channel [Watch Now]

LEAVE A REPLY

Please enter your comment!
Please enter your name here