Home Community A Recent AI Research Introduces Recognize Anything Model (RAM): A Robust Base Model For Image Tagging

A Recent AI Research Introduces Recognize Anything Model (RAM): A Robust Base Model For Image Tagging

0
A Recent AI Research Introduces Recognize Anything Model (RAM): A Robust Base Model For Image Tagging

In the case of natural language processing (NLP) tasks, large language models (LLM) trained on massive online datasets perform exceptionally well. Segment Anything Model (SAM) has shown outstanding zero-shot localization abilities in computer vision (CV) by scaling up data. 

Unfortunately, SAM cannot produce semantic labels, a fundamental task on par with localization. Recognizing many labels for a single image is the goal of multi-label image recognition, also often called image tagging. Since images contain various labels, including objects, sceneries, properties, and activities, image tagging is a crucial and useful computer vision problem.

Two essential aspects hinder image labeling as follows:

🚀 JOIN the fastest ML Subreddit Community
  1. The extensive collection of high-quality data. An efficient data annotation engine that may semi-automatically or mechanically annotate massive amounts of photos across various categories continues to be lacking, as is a standardized and comprehensive labeling system.
  2. There aren’t enough open-vocabulary and powerful models built using an efficient and versatile model design that takes advantage of large-scale weakly-supervised data.

The Recognize Anything Model (RAM) is a strong base model for image tagging, and it has just been introduced by researchers on the OPPO Research Institute, the International Digital Economy Academy (IDEA), and AI2 Robotics. In the case of data, RAM can overcome problems equivalent to inadequate labeling systems, insufficient datasets, inefficient data engines, and architectural constraints.

The researchers start by creating a regular, global naming convention. They use academic datasets (classification, detection, and segmentation) and industrial taggers (Google, Microsoft, and Apple) to complement their tagging system. By combining all available public tags with common text-based tags, the labeling method yields 6,449 labels that collectively address the overwhelming majority of use cases. The researchers state that it is feasible to acknowledge the remaining open-vocabulary labels using open-set recognition.

Annotating large-scale photographs using the label system mechanically is a difficult task. The proposed approach to image tagging is inspired by previous work in the sector, which uses large-scale public image-text pairs to coach robust visual models. To place these massive amounts of picture-text data to good use for tagging, the team employed automatic text semantic parsing to extract the image tags. With this method, they might obtain a big set of picture tags based on image-text pairs without counting on manual annotations.

Web-sourced image-text combos are inclined to be imprecise resulting from random noise. The team creates an information tagging engine to enhance the accuracy of annotations. To unravel the issue of missing labels, they adopt preexisting models to supply supplementary classifications. When coping with mislabeled areas, they pinpoint certain sections inside the image that correlate to distinct labels. Then, they use region clustering methods to seek out and eliminate anomalies inside the same category. As well as, the labels that make inconsistent predictions are also removed to get a more precise annotation. 

RAM permits generalization to novel classes by adding semantic context to label searches. RAM’s identification abilities might be boosted by this model architecture for any visual dataset, demonstrating its versatility. By showing that a general model trained on noisy, annotation-free data may beat highly supervised models, RAM introduces a brand new paradigm to picture tagging. RAM necessitates a free and publicly available dataset with no annotations. Probably the most powerful version of RAM must only be trained for 3 days on eight A100 GPUs. 

In accordance with the team, improvements can yet be made to RAM. This includes running many iterations of the info engine, increasing the backbone parameters to spice up the model’s capability, and expanding the training dataset beyond 14 million photos to raised cover varied areas.


Check Out The Paper, Project, and Github. Don’t forget to affix our 23k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you could have any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club


Tanushree

” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-225×300.jpeg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-768×1024.jpeg”>

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest within the scope of application of artificial intelligence in various fields. She is enthusiastic about exploring the brand new advancements in technologies and their real-life application.


Try https://aitoolsclub.com to seek out 100’s of Cool AI Tools

LEAVE A REPLY

Please enter your comment!
Please enter your name here