BAAI introduces BGE M3-Embedding with the assistance of researchers from the University of Science and Technology of China. The M3 refers to a few novel properties of text embedding- Multi-Lingual, Multi-Functionality, and Multi-Granularity. It identifies the first challenges in the prevailing embedding models, like being unable to support multiple languages, restrictions in retrieval functionalities, and difficulty handling varied input granularities.
Existing embedding models, equivalent to Contriever, GTR, E5, and others, have been proven to bring notable progress in the sphere, but they lack language support, multiple retrieval functionality, or long input texts. These models are mainly trained just for English and support just one retrieval functionality. The proposed solution, BGE M3-Embedding, supports over 100 languages, accommodates diverse retrieval functionalities (dense, sparse, and multi-vector retrieval), and processes input data starting from short sentences to lengthy document handling as much as 8192 tokens.
M3-Embedding involves a novel self-knowledge distillation approach, optimizing batching strategies for big input lengths, for which researchers used large-scale, diverse multi-lingual datasets from various sources like Wikipedia and S2ORC. It facilitates three common retrieval functionalities: dense retrieval, lexical retrieval, and multi-vector retrieval. The distillation process involves combining relevance scores from various retrieval functionalities to create a teacher signal that allows the model to perform multiple retrieval tasks efficiently.
The model is evaluated for its performance with multilingual text(MLDR), varied sequence length, and narrative QA responses. The evaluation metric was nDCG@10(normalized discounted cumulative gain). The experiments demonstrated that the M3 embedding model outperformed existing models in greater than 10 languages, giving at-par ends in English. The model performance was much like the opposite models with smaller input lengths but showcased improved results with longer texts.
In conclusion, M3 embedding is a big advancement in text embedding models. It’s a flexible solution that supports multiple languages, varied retrieval functionalities, and different input granularities. The proposed model addresses crucial limitations in existing methods, marking a considerable step forward in information retrieval. It outperforms baseline methods like BM25, mDPR, and E5, showcasing its effectiveness in addressing the identified challenges.
Take a look at the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
In the event you like our work, you’ll love our newsletter..
Don’t Forget to hitch our Telegram Channel
Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest within the scope of software and data science applications. She is all the time reading concerning the developments in several field of AI and ML.