Home Community Image recognition accuracy: An unseen challenge confounding today’s AI

Image recognition accuracy: An unseen challenge confounding today’s AI

Image recognition accuracy: An unseen challenge confounding today’s AI

Imagine you’re scrolling through the photos in your phone and also you come across a picture that in the first place you’ll be able to’t recognize. It looks like perhaps something fuzzy on the couch; could or not it’s a pillow or a coat? After a few seconds it clicks — after all! That ball of fluff is your friend’s cat, Mocha. While a few of your photos might be understood straight away, why was this cat photo rather more difficult?

MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers were surprised to seek out that despite the critical importance of understanding visual data in pivotal areas starting from health care to transportation to household devices, the notion of a picture’s recognition difficulty for humans has been almost entirely ignored. One in every of the foremost drivers of progress in deep learning-based AI has been datasets, yet we all know little about how data drives progress in large-scale deep learning beyond that larger is best.

In real-world applications that require understanding visual data, humans outperform object recognition models despite the undeniable fact that models perform well on current datasets, including those explicitly designed to challenge machines with debiased images or distribution shifts. This problem persists, partly, because now we have no guidance on absolutely the difficulty of a picture or dataset. Without controlling for the problem of images used for evaluation, it’s hard to objectively assess progress toward human-level performance, to cover the range of human abilities, and to extend the challenge posed by a dataset.

To fill in this data gap, David Mayo, an MIT PhD student in electrical engineering and computer science and a CSAIL affiliate, delved into the deep world of image datasets, exploring why certain images are harder for humans and machines to acknowledge than others. “Some images inherently take longer to acknowledge, and it’s essential to know the brain’s activity during this process and its relation to machine learning models. Perhaps there are complex neural circuits or unique mechanisms missing in our current models, visible only when tested with difficult visual stimuli. This exploration is crucial for comprehending and enhancing machine vision models,” says Mayo, a lead writer of a brand new paper on the work.

This led to the event of a brand new metric, the “minimum viewing time” (MVT), which quantifies the problem of recognizing a picture based on how long an individual must view it before making an accurate identification. Using a subset of ImageNet, a preferred dataset in machine learning, and ObjectNet, a dataset designed to check object recognition robustness, the team showed images to participants for various durations from as short as 17 milliseconds to so long as 10 seconds, and asked them to decide on the right object from a set of fifty options. After over 200,000 image presentation trials, the team found that existing test sets, including ObjectNet, appeared skewed toward easier, shorter MVT images, with the overwhelming majority of benchmark performance derived from images which might be easy for humans.

The project identified interesting trends in model performance — particularly in relation to scaling. Larger models showed considerable improvement on simpler images but made less progress on tougher images. The CLIP models, which incorporate each language and vision, stood out as they moved within the direction of more human-like recognition.

“Traditionally, object recognition datasets have been skewed towards less-complex images, a practice that has led to an inflation in model performance metrics, not truly reflective of a model’s robustness or its ability to tackle complex visual tasks. Our research reveals that harder images pose a more acute challenge, causing a distribution shift that is commonly not accounted for in standard evaluations,” says Mayo. “We released image sets tagged by difficulty together with tools to mechanically compute MVT, enabling MVT to be added to existing benchmarks and prolonged to numerous applications. These include measuring test set difficulty before deploying real-world systems, discovering neural correlates of image difficulty, and advancing object recognition techniques to shut the gap between benchmark and real-world performance.”

“One in every of my biggest takeaways is that we now have one other dimension to guage models on. We would like models which might be capable of recognize any image even when — perhaps especially if — it’s hard for a human to acknowledge. We’re the primary to quantify what this may mean. Our results show that not only is that this not the case with today’s state-of-the-art, but additionally that our current evaluation methods don’t have the flexibility to inform us when it’s the case because standard datasets are so skewed toward easy images,” says Jesse Cummings, an MIT graduate student in electrical engineering and computer science and co-first writer with Mayo on the paper.

From ObjectNet to MVT

A couple of years ago, the team behind this project identified a major challenge in the sector of machine learning: Models were combating out-of-distribution images, or images that weren’t well-represented within the training data. Enter ObjectNet, a dataset comprised of images collected from real-life settings. The dataset helped illuminate the performance gap between machine learning models and human recognition abilities, by eliminating spurious correlations present in other benchmarks — for instance, between an object and its background. ObjectNet illuminated the gap between the performance of machine vision models on datasets and in real-world applications, encouraging use for a lot of researchers and developers — which subsequently improved model performance.

Fast forward to the current, and the team has taken their research a step further with MVT. Unlike traditional methods that deal with absolute performance, this recent approach assesses how models perform by contrasting their responses to the simplest and hardest images. The study further explored how image difficulty might be explained and tested for similarity to human visual processing. Using metrics like c-score, prediction depth, and adversarial robustness, the team found that harder images are processed in a different way by networks. “While there are observable trends, resembling easier images being more prototypical, a comprehensive semantic explanation of image difficulty continues to elude the scientific community,” says Mayo.

Within the realm of health care, for instance, the pertinence of understanding visual complexity becomes much more pronounced. The flexibility of AI models to interpret medical images, resembling X-rays, is subject to the range and difficulty distribution of the photographs. The researchers advocate for a meticulous evaluation of difficulty distribution tailored for professionals, ensuring AI systems are evaluated based on expert standards, slightly than layperson interpretations.

Mayo and Cummings are currently neurological underpinnings of visual recognition as well, probing into whether the brain exhibits differential activity when processing easy versus difficult images. The study goals to unravel whether complex images recruit additional brain areas not typically related to visual processing, hopefully helping demystify how our brains accurately and efficiently decode the visual world.

Toward human-level performance

Looking ahead, the researchers aren’t only focused on exploring ways to reinforce AI’s predictive capabilities regarding image difficulty. The team is working on identifying correlations with viewing-time difficulty as a way to generate harder or easier versions of images.

Despite the study’s significant strides, the researchers acknowledge limitations, particularly when it comes to the separation of object recognition from visual search tasks. The present methodology does consider recognizing objects, leaving out the complexities introduced by cluttered images.

“This comprehensive approach addresses the long-standing challenge of objectively assessing progress towards human-level performance in object recognition and opens recent avenues for understanding and advancing the sector,” says Mayo. “With the potential to adapt the Minimum Viewing Time difficulty metric for quite a lot of visual tasks, this work paves the way in which for more robust, human-like performance in object recognition, ensuring that models are truly put to the test and are ready for the complexities of real-world visual understanding.”

“That is an interesting study of how human perception will be used to discover weaknesses within the ways AI vision models are typically benchmarked, which overestimate AI performance by concentrating on easy images,” says Alan L. Yuille, Bloomberg Distinguished Professor of Cognitive Science and Computer Science at Johns Hopkins University, who was not involved within the paper. “This may help develop more realistic benchmarks leading not only to improvements to AI but additionally make fairer comparisons between AI and human perception.” 

“It’s widely claimed that computer vision systems now outperform humans, and on some benchmark datasets, that is true,” says Anthropic technical staff member Simon Kornblith PhD ’17, who was also not involved on this work. “Nevertheless, lots of the problem in those benchmarks comes from the obscurity of what is in the photographs; the common person just doesn’t know enough to categorise different breeds of dogs. This work as a substitute focuses on images that folks can only get right if given enough time. These images are generally much harder for computer vision systems, but the most effective systems are only a bit worse than humans.”

Mayo, Cummings, and Xinyu Lin MEng ’22 wrote the paper alongside CSAIL Research Scientist Andrei Barbu, CSAIL Principal Research Scientist Boris Katz, and MIT-IBM Watson AI Lab Principal Researcher Dan Gutfreund. The researchers are affiliates of the MIT Center for Brains, Minds, and Machines.

The team is presenting their work on the 2023 Conference on Neural Information Processing Systems (NeurIPS).


Please enter your comment!
Please enter your name here