Home Artificial Intelligence The Rise of Vision Transformers

The Rise of Vision Transformers

0
The Rise of Vision Transformers

And so, it seems that the reply just isn’t a fight to the death between CNNs and Transformers (see the numerous overindulgent eulogies for LSTMs), but quite something a bit more romantic. Not only does the adoption of 2D convolutions in hierarchical transformers like CvT and PVTv2 conveniently create multiscale features, reduce the complexity of self-attention, and simplify architecture by alleviating the necessity for positional encoding, but these models also employ residual connections, one other inherited trait of their progenitors. The complementary strengths of transformers and CNNs have been brought together in viable offspring.

So is the era of ResNet over? It would definitely seem so, although any paper will certainly need to incorporate this indefatigable backbone for comparison for a while to return. It can be crucial to recollect, nevertheless, that there are not any losers here, just a brand new generation of powerful and transferable feature extractors for all to enjoy, in the event that they know where to look. Parameter efficient models like PVTv2 democratize research of more complex architectures by offering powerful feature extraction with a small memory footprint, and should be added to the list of normal backbones for benchmarking recent architectures.

Future Work

This text has focused on how the cross-pollination of convolutional operations and self-attention has given us the evolution of hierarchical feature transformers. These models have shown dominant performance and parameter efficiency at small scales, making them ideal feature extraction backbones (especially in parameter-constrained environments). Nevertheless, there may be a scarcity of exploration into whether the efficiencies and inductive biases that these models capitalize on at smaller scales can transfer to large-scale success and threaten the dominance of pure ViTs at much higher parameter counts.

Large Multimodal Models (LMMS) like Large Language and Visual Assistant (LLaVA) and other applications that require a natural language understanding of visual data depend on Contrastive Languageā€“Image Pretraining (CLIP) embeddings generated from ViT-L features, and due to this fact inherit the strengths and weaknesses of ViT. If research into scaling hierarchical transformers shows that their advantages, corresponding to multiscale features that enhance fine-grained understanding, enable them to to attain higher or similar performance with greater parameter efficiency than ViT-L, it might have widespread and immediate practical impact on anything using CLIP: LMMs, robotics, assistive technologies, augmented/virtual reality, content moderation, education, research, and plenty of more applications affecting society and industry could possibly be improved and made more efficient, lowering the barrier for development and deployment of those technologies.

LEAVE A REPLY

Please enter your comment!
Please enter your name here