And so, it seems that the reply just isn’t a fight to the death between CNNs and Transformers (see the numerous overindulgent eulogies for LSTMs), but quite something a bit more romantic. Not only does the adoption of 2D convolutions in hierarchical transformers like CvT and PVTv2 conveniently create multiscale features, reduce the complexity of self-attention, and simplify architecture by alleviating the necessity for positional encoding, but these models also employ residual connections, one other inherited trait of their progenitors. The complementary strengths of transformers and CNNs have been brought together in viable offspring.
So is the era of ResNet over? It would definitely seem so, although any paper will certainly need to incorporate this indefatigable backbone for comparison for a while to return. It can be crucial to recollect, nevertheless, that there are not any losers here, just a brand new generation of powerful and transferable feature extractors for all to enjoy, in the event that they know where to look. Parameter efficient models like PVTv2 democratize research of more complex architectures by offering powerful feature extraction with a small memory footprint, and should be added to the list of normal backbones for benchmarking recent architectures.
Future Work
This text has focused on how the cross-pollination of convolutional operations and self-attention has given us the evolution of hierarchical feature transformers. These models have shown dominant performance and parameter efficiency at small scales, making them ideal feature extraction backbones (especially in parameter-constrained environments). Nevertheless, there may be a scarcity of exploration into whether the efficiencies and inductive biases that these models capitalize on at smaller scales can transfer to large-scale success and threaten the dominance of pure ViTs at much higher parameter counts.
Large Multimodal Models (LMMS) like Large Language and Visual Assistant (LLaVA) and other applications that require a natural language understanding of visual data depend on Contrastive LanguageāImage Pretraining (CLIP) embeddings generated from ViT-L features, and due to this fact inherit the strengths and weaknesses of ViT. If research into scaling hierarchical transformers shows that their advantages, corresponding to multiscale features that enhance fine-grained understanding, enable them to to attain higher or similar performance with greater parameter efficiency than ViT-L, it might have widespread and immediate practical impact on anything using CLIP: LMMs, robotics, assistive technologies, augmented/virtual reality, content moderation, education, research, and plenty of more applications affecting society and industry could possibly be improved and made more efficient, lowering the barrier for development and deployment of those technologies.