Natural language processing (NLP) has entered a transformational period with the introduction of Large Language Models (LLMs), just like the GPT series, setting recent performance standards for various linguistic tasks. Autoregressive pretraining, which teaches models to forecast the almost definitely tokens in a sequence, is one in every of the predominant aspects causing this amazing achievement. For this reason fundamental technique, the models can absorb a posh interaction between syntax and semantics, contributing to their exceptional ability to grasp language like an individual. Autoregressive pretraining has substantially contributed to computer vision along with NLP.
In computer vision, autoregressive pretraining was initially successful, but subsequent developments have shown a pointy paradigm change in favor of BERT-style pretraining. This shift is noteworthy, especially in light of the primary results from iGPT, which showed that autoregressive and BERT-style pretraining performed similarly across various tasks. Nonetheless, due to its greater effectiveness in visual representation learning, subsequent research has come to prefer BERT-style pretraining. For example, MAE shows that a scalable approach to visual representation learning could also be so simple as predicting the values of randomly masked pixels.
On this work, the Johns Hopkins University and UC Santa Cruz research team reexamined iGPT and questioned whether autoregressive pretraining can produce highly proficient vision learners, particularly when applied widely. Two essential changes are incorporated into their process. First, the research team “tokenizes” photos into semantic tokens using BEiT, considering images are naturally noisy and redundant. This modification shifts the main focus of the autoregressive prediction from pixels to semantic tokens, allowing for a more sophisticated comprehension of the interactions between various picture areas. Secondly, the research team adds a discriminative decoder to the generative decoder, which autoregressively predicts the following semantic token.
Predicting the semantic tokens of the seen pixels is the responsibility of this extra component. Moreover, it’s interesting that models trained discriminatively, like CLIP, provide semantic visual tokens best suited to this pretraining pathway. The research team refers to this improved method as D-iGPT. The efficiency of their suggested D-iGPT is confirmed by extensive tests conducted on various datasets and tasks. Using ImageNet-1K because the only relevant dataset, their base-size model outperforms the prior state-of-the-art by 0.6%, achieving an 86.2% top-1 classification accuracy.
Moreover, their large-scale model achieves an 89.5% top-1 classification accuracy with 36 million publically available datasets. D-iGPT achieves performance comparable to earlier state-of-the-art training on public datasets, although with far less training data and lower model size. Using the identical pretraining and fine-tuning dataset, the research team also analyzed D-iGPT on semantic segmentation, finding that it performs higher than its MAE equivalents.
Take a look at the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.
If you happen to like our work, you’ll love our newsletter..
Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed toward harnessing the facility of machine learning. His research interest is image processing and is captivated with constructing solutions around it. He loves to attach with people and collaborate on interesting projects.