Home Community Are CLIP Models ‘Parroting’ Text in Images? This Paper Explores the Text Spotting Bias in Vision-Language Systems

Are CLIP Models ‘Parroting’ Text in Images? This Paper Explores the Text Spotting Bias in Vision-Language Systems

0
Are CLIP Models ‘Parroting’ Text in Images? This Paper Explores the Text Spotting Bias in Vision-Language Systems

In recent research, a team of researchers has examined CLIP (Contrastive Language-Image Pretraining), which is a famous neural network that effectively acquires visual concepts using natural language supervision. CLIP, which predicts probably the most relevant text snippet given a picture, has helped advance vision-language modeling tasks. Though CLIP’s effectiveness has established itself as a fundamental model for various different applications, CLIP models display biases pertaining to visual text, color, gender, etc.

A team of researchers from Shanghai AI Laboratory, Show Lab, National University of Singapore, and Sun Yat-Sen University have examined CLIP’s visual text bias, particularly with regard to its capability to discover text in photos. The team has studied the LAION-2B dataset intimately and has found that estimating bias accurately is difficult given the big volume of image-text data.

Picture clustering has been used on the entire dataset to unravel the issue, rating each cluster in keeping with CLIP scores. This evaluation goals to find out which image-text pair kinds are most favored based on CLIP rating measures. Many examples with the best CLIP scores have been included, consisting of dense contemporaneous text that appears on the pixel level in each the captions and the pictures. 

The captions that coincide with the samples have been called the ‘Parrot Captions’ since they seem to provide CLIP one other solution to accomplish its objectives by teaching it to acknowledge text without necessarily grasping the visual notions. The team has studied the importance of the parrot captions by examining the dataset from three angles, i.e., the dataset itself, popular models which were released, and the model-training procedure.

The team has discovered a notable bias in how visual text material embedded in images is described in LAION-2B captions. They’ve found that over 50% of the photos have visual text content by thoroughly profiling the LAION-2B dataset utilizing business text detection methods. Their evaluation of paired image-text data has shown that greater than 90% of captions have no less than one word that appears concurrently, with the caption and spotted text from the pictures having a word overlap of about 30%. This implies that when trained with LAION-style data, CLIP significantly deviates from the elemental presumption of semantic congruence between picture and text.

The study has looked into biases in released CLIP models, namely a big bias in favor of text spotting in several types of web photographs. The team has compared alignment scores before and after text removal to look at how OpenAI’s publicly available CLIP model behaves on the LAION-2B dataset. The findings have shown a robust association between visual text incorporated in images with corresponding parrot captions and CLIP model predictions. 

The team has also demonstrated the text spotting abilities of CLIP and OpenCLIP models, finding that OpenCLIP, which was trained on LAION-2B, shows a greater bias in favor of text spotting than CLIP, which was trained on WIT-400M. The research has focussed on how CLIP models can quickly pick up text recognition skills from parrot captions, but they’ve trouble making the connection between vision and language semantics. 

Based on text-oriented parameters, reminiscent of the embedded text ratio, contemporaneous word ratios, and relative CLIP scores from text removal, several LAION-2B subsets have been sampled. The findings have shown that CLIP models gain good text detection abilities when trained with parrot caption data, but they lose most of their zero-shot generalization ability on image-text downstream tasks. 

In conclusion, this study has focussed on the consequences of parrot captions on CLIP model learning. It has make clear biases related to visual text in LAION-2B captions and has emphasized the text spotting bias in published CLIP models.


Take a look at the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to affix our 35k+ ML SubReddit, 41k+ Facebook Community, Discord ChannelLinkedIn Group, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.

If you happen to like our work, you’ll love our newsletter..


Tanya Malhotra is a final 12 months undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and significant considering, together with an ardent interest in acquiring latest skills, leading groups, and managing work in an organized manner.


🚀 Boost your LinkedIn presence with Taplio: AI-driven content creation, easy scheduling, in-depth analytics, and networking with top creators – Try it free now!.

LEAVE A REPLY

Please enter your comment!
Please enter your name here