A Recent AI Research Releases SWIM-IR: A Large-Scale Synthetic Multilingual Retrieval Dataset with 28 Million Training Pairs over 33 Languages

Community

A Recent AI Research Releases SWIM-IR: A Large-Scale Synthetic Multilingual Retrieval Dataset with 28 Million Training Pairs over 33 Languages

admin

November 20, 2023

A Recent AI Research Releases SWIM-IR: A Large-Scale Synthetic Multilingual Retrieval Dataset with 28 Million Training Pairs over 33 Languages

Researchers from Google Research, Google DeepMind, and the University of Waterloo introduce SWIM-IR, an artificial retrieval training dataset encompassing 33 languages, addressing the challenge of limited human-labeled training pairs in multilingual retrieval. Leveraging the SAP (summarize-then-ask prompting) method, SWIM-IR is constructed to enable synthetic fine-tuning of multilingual dense retrieval models without human supervision. SWIM-X models, trained on SWIM-IR, show competitiveness with human-supervised thick retrieval models across various benchmarks, including XOR-Retrieve, XTREME-UP, and MIRACL.

The study addresses limitations in multilingual dense retrieval models. Existing multilingual retrieval models face challenges attributable to scarce or uneven training data. SWIM-IR employs SAP to help LLMs in generating informative queries within the goal language. SWIM-X models, trained on SWIM-IR, exhibit competitive performance with human-supervised models across various benchmarks, highlighting the potential of synthetic datasets as an economical alternative to human-labeled training data for multilingual dense retrieval models.

The research addresses the limited success of multilingual dense retrieval models, attributing it to insufficient supervised training data for non-English languages. This synthetic dataset enables fine-tuning of multilingual dense retrieval models, evaluated on benchmarks like XOR-Retrieve, XTREME-UP, and MIRACL. Results show SWIM-IR’s efficacy in substituting expensive human-labeled training data, establishing competitive performance for multilingual dense retrieval models against human-supervised counterparts.

SWIM-IR, an artificial retrieval training dataset spanning 33 languages, was generated through the SAP technique. Employing SWIM-IR, the study explores the synthetic fine-tuning of multilingual dense retrieval models, adapting the Dense Passage Retrieval (DPR) model. Utilizing the T5X Retrieval framework, it replicates mContriever and mDPR zero-shot baselines by initializing from a multilingual T5-base checkpoint and fine-tuning on the English MS MARCO dataset. Pretraining on the mC4 dataset and employing contrastive loss for in-batch negatives, the researchers use the PaLM 2 Small model for cross-language query generation.

Straight-turned on synthetic training data from SWIM-IR, SWIM-X models exhibit competitive performance in multilingual dense retrieval tasks. SWIM-X (7M) outperforms mContriever-X, the best-fine-tuned model, by 7.1 points on Recall5kt within the XOR-Retrieve benchmark. Even the limited-budget baseline, SWIM-X (500k), surpasses mContriever-X by 3.6 points. SWIM-X (180K) competes well on the MIRACL benchmark, outperforming the very best zero-shot model by 6.6 points on nDCG10, even though it falls wanting mContriever-X, which advantages from human-labeled training pairs with hard negatives. Synthetic baselines, SWIM-X (120K) and SWIM-X (120K)MT show promising leads to cross-lingual supervised baselines, outperforming existing models by way of Recall5kt. The study emphasizes the importance of optimized training techniques, including higher sampling hard negatives with SWIM-IR, to further enhance the performance of synthetic models.

The SWIM-IR dataset employed within the study exhibits limitations, including decontextualization, code-switching, passage quality and length, and factual inconsistencies in LLM generation. The study acknowledges that LLMs may generate text lacking sufficient grounding to knowledge sources, posing risks of misinformation and hallucination in generated outputs. While these limitations may impact the standard and accuracy of generated queries, they do indirectly affect the downstream multilingual retrieval task. Nevertheless, it doesn’t extensively discuss the methods’ limitations, similar to the SAP approach or the fine-tuning process.

SWIM-IR is an artificial multilingual retrieval training dataset created using the SAP approach to generate informative queries in multiple languages. With 28 million query-passage training pairs across 33 languages, SWIM-IR facilitates fine-tuning multilingual dense retrieval models without requiring human-labeled training data. The resulting SWIM-X models exhibit competitive performance in multilingual retrieval tasks, outperforming existing recall and mean reciprocal rank models on each cross-lingual and monolingual benchmarks. It underscores SWIM-IR’s potential as an economical substitute for expensive human-labeled retrieval training data, enabling the event of strong multilingual dense retrieval models.

Try the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to hitch our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.

When you like our work, you’ll love our newsletter..

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is keen about applying technology and AI to handle real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

🔥 Join The AI Startup Newsletter To Learn About Latest AI Startups

LEAVE A REPLY Cancel reply