Home Community Helping computer vision and language models understand what they see

Helping computer vision and language models understand what they see

Helping computer vision and language models understand what they see

Powerful machine-learning algorithms referred to as vision and language models, which learn to match text with images, have shown remarkable results when asked to generate captions or summarize videos.

While these models excel at identifying objects, they often struggle to grasp concepts, like object attributes or the arrangement of things in a scene. As an illustration, a vision and language model might recognize the cup and table in a picture, but fail to know that the cup is sitting on the table.

Researchers from MIT, the MIT-IBM Watson AI Lab, and elsewhere have demonstrated a brand new technique that utilizes computer-generated data to assist vision and language models overcome this shortcoming.

The researchers created an artificial dataset of images that depict a wide selection of scenarios, object arrangements, and human actions, coupled with detailed text descriptions. They used this annotated dataset to “fix” vision and language models in order that they can learn concepts more effectively. Their technique ensures these models can still make accurate predictions once they see real images.

Once they tested models on concept understanding, the researchers found that their technique boosted accuracy by as much as 10 percent. This might improve systems that mechanically caption videos or enhance models that provide natural language answers to questions on images, with applications in fields like e-commerce or health care.

“With this work, we’re going beyond nouns within the sense that we’re going beyond just the names of objects to more of the semantic concept of an object and the whole lot around it. Our idea was that, when a machine-learning model sees objects in many various arrangements, it is going to have a greater idea of how arrangement matters in a scene,” says Khaled Shehada, a graduate student within the Department of Electrical Engineering and Computer Science and co-author of a paper on this method.

Shehada wrote the paper with lead creator Paola Cascante-Bonilla, a pc science graduate student at Rice University; Aude Oliva, director of strategic industry engagement on the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist within the Computer Science and Artificial Intelligence Laboratory (CSAIL); senior creator Leonid Karlinsky, a research staff member within the MIT-IBM Watson AI Lab; and others at MIT, the MIT-IBM Watson AI Lab, Georgia Tech, Rice University, École des Ponts, Weizmann Institute of Science, and IBM Research. The paper might be presented on the International Conference on Computer Vision.

Specializing in objects

Vision and language models typically learn to discover objects in a scene, and might find yourself ignoring object attributes, similar to color and size, or positional relationships, similar to which object is on top of one other object.

That is on account of the strategy with which these models are sometimes trained, referred to as contrastive learning. This training method involves forcing a model to predict the correspondence between images and text. When comparing natural images, the objects in each scene are likely to cause probably the most striking differences. (Perhaps one image shows a horse in a field while the second shows a sailboat on the water.)

“Every image might be uniquely defined by the objects within the image. So, if you do contrastive learning, just specializing in the nouns and objects would solve the issue. Why would the model do anything in a different way?” says Karlinsky.

The researchers sought to mitigate this problem by utilizing synthetic data to fine-tune a vision and language model. The fine-tuning process involves tweaking a model that has already been trained to enhance its performance on a particular task.

They used a pc to mechanically create synthetic videos with diverse 3D environments and objects, similar to furniture and luggage, and added human avatars that interacted with the objects.

Using individual frames of those videos, they generated nearly 800,000 photorealistic images, after which paired each with an in depth caption. The researchers developed a technique for annotating every aspect of the image to capture object attributes, positional relationships, and human-object interactions clearly and consistently in dense captions.

Since the researchers created the photographs, they might control the looks and position of objects, in addition to the gender, clothing, poses, and actions of the human avatars.

“Synthetic data allows a whole lot of diversity. With real images, you would possibly not have a whole lot of elephants in a room, but with synthetic data, you would even have a pink elephant in a room with a human, in the event you want,” Cascante-Bonilla says.

Synthetic data produce other benefits, too. They’re cheaper to generate than real data, yet the photographs are highly photorealistic. Additionally they preserve privacy because no real humans are shown in the photographs. And, because data are produced mechanically by a pc, they may be generated quickly in massive quantities.

Through the use of different camera viewpoints, or barely changing the positions or attributes of objects, the researchers created a dataset with a far wider number of scenarios than one would find in a natural dataset.

Wonderful-tune, but don’t forget

Nonetheless, when one fine-tunes a model with synthetic data, there may be a risk that model might “forget” what it learned when it was originally trained with real data.

The researchers employed a number of techniques to forestall this problem, similar to adjusting the synthetic data so colours, lighting, and shadows more closely match those present in natural images. Additionally they made adjustments to the model’s inner-workings after fine-tuning to further reduce any forgetfulness.

Their synthetic dataset and fine-tuning strategy improved the power of popular vision and language models to accurately recognize concepts by as much as 10 percent. At the identical time, the models didn’t forget what they’d already learned.

Now that they’ve shown how synthetic data may be used to unravel this problem, the researchers wish to discover ways to enhance the visual quality and variety of those data, in addition to the underlying physics that makes synthetic scenes look realistic. As well as, they plan to check the boundaries of scalability, and investigate whether model improvement starts to plateau with larger and more diverse synthetic datasets.

This research is funded, partly, by the U.S. Defense Advanced Research Projects Agency, the National Science Foundation, and the MIT-IBM Watson AI Lab.


Please enter your comment!
Please enter your name here