Home Community Unveiling EVA-CLIP-18B: A Leap Forward in Open-Source Vision and Multimodal AI Models

Unveiling EVA-CLIP-18B: A Leap Forward in Open-Source Vision and Multimodal AI Models

Unveiling EVA-CLIP-18B: A Leap Forward in Open-Source Vision and Multimodal AI Models

Lately, LMMs have rapidly expanded, leveraging CLIP as a foundational vision encoder for robust visual representations and LLMs as versatile tools for reasoning across various modalities. Nevertheless, while LLMs have grown to over 100 billion parameters, the vision models they depend on have to be greater, hindering their potential. Scaling up contrastive language-image pretraining (CLIP) is crucial to reinforce each vision and multimodal models, bridging the gap and enabling more practical handling of diverse data types.

Researchers from the Beijing Academy of Artificial Intelligence and Tsinghua University have unveiled EVA-CLIP-18B, the biggest open-source CLIP model yet, boasting 18 billion parameters. Despite training on just 6 billion samples, it achieves a powerful 80.7% zero-shot top-1 accuracy across 27 image classification benchmarks, surpassing prior models like EVA-CLIP. Notably, this advancement is achieved with a modest dataset of two billion image-text pairs, openly available and smaller than those utilized in other models. EVA-CLIP-18B showcases the potential of EVA-style weak-to-strong visual model scaling, with hopes of fostering further research in vision and multimodal foundation models.

EVA-CLIP-18B is the biggest and strongest open-source CLIP model, with 18 billion parameters. It outperforms its predecessor EVA-CLIP (5 billion parameters) and other open-source CLIP models by a big margin by way of zero-shot top-1 accuracy on 27 image classification benchmarks. The principles of EVA and EVA-CLIP guide the scaling-up procedure of EVA-CLIP-18B. The EVA philosophy follows a weak-to-strong paradigm, where a small EVA-CLIP model serves because the vision encoder initialization for a bigger EVA-CLIP model. This iterative scaling process stabilizes and accelerates the training of larger models.

EVA-CLIP-18B, an 18-billion-parameter CLIP model, is trained on a 2 billion image-text pairs dataset from LAION-2B and COYO-700M. Following the EVA and EVA-CLIP principles, it employs a weak-to-strong paradigm, where a smaller EVA-CLIP model initializes a bigger one, stabilizing and expediting training. Evaluation across 33 datasets, including image and video classification and image-text retrieval, demonstrates its efficacy. The scaling process involves distilling knowledge from a small EVA-CLIP model to a bigger EVA-CLIP, with the training dataset mostly fixed to showcase the effectiveness of the scaling philosophy. Notably, the approach yields sustained performance gains, exemplifying the effectiveness of progressive weak-to-strong scaling.

EVA-CLIP-18B, boasting 18 billion parameters, showcases outstanding performance across various image-related tasks. It achieves a powerful 80.7% zero-shot top-1 accuracy across 27 image classification benchmarks, surpassing its predecessor and other CLIP models by a big margin. Furthermore, linear probing on ImageNet-1K outperforms competitors like InternVL-C with a median top-1 accuracy of 88.9. Zero-shot image-text retrieval on Flickr30K and COCO datasets achieves a median recall of 87.8, significantly surpassing competitors. EVA-CLIP-18B exhibits robustness across different ImageNet variants, demonstrating its versatility and high performance across 33 widely used datasets.

In conclusion, EVA-CLIP-18B is the biggest and highest-performing open-source CLIP model, boasting 18 billion parameters. Applying EVA’s weak-to-strong vision scaling principle achieves exceptional zero-shot top-1 accuracy across 27 image classification benchmarks. This scaling approach consistently improves performance without reaching saturation, pushing the boundaries of vision model capabilities. Notably, EVA-CLIP-18B exhibits robustness in visual representations, maintaining performance across various ImageNet variants, including adversarial ones. Its versatility and effectiveness are demonstrated across multiple datasets, spanning image classification, image-text retrieval, and video classification tasks, marking a big advancement in CLIP model capabilities.

Try the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

In case you like our work, you’ll love our newsletter..

Don’t Forget to hitch our Telegram Channel

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is enthusiastic about applying technology and AI to handle real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]


Please enter your comment!
Please enter your name here