Home Community Can We Transfer the Capabilities of LLMs like LLaMA from English to Non-English Languages? A Deep Dive into Multilingual Model Proficiency

Can We Transfer the Capabilities of LLMs like LLaMA from English to Non-English Languages? A Deep Dive into Multilingual Model Proficiency

0
Can We Transfer the Capabilities of LLMs like LLaMA from English to Non-English Languages? A Deep Dive into Multilingual Model Proficiency

Significant achievements have been made in LLMs, exemplified by ChatGPT, excelling in complex language processing tasks. But most mainstream LLMs like LLaMA are pre-trained on English-dominant corpus. One other example is LaMDA, proposed by Google, which is pre-trained on text containing over 90% English. This limits the performance of LLMs in other non-English languages, which is a matter of concern for non-English users.

Recent strides in LLMs like ChatGPT, PaLM, and LLaMA showcase advanced reasoning, planning, and experiential learning capabilities. While many LLMs comprehend diverse languages, imbalanced language resources pose challenges. BLOOM’s pretraining on 46 languages lacks diversity, and LLaMA faces difficulties with non-English languages. Investigations into vocabulary extension and transfer processes reveal efficient language transfer at minimal cost. 

The researchers on the School of Computer Science, Fudan University, have focused on effectively transferring language generation capabilities and following instructions in non-English. To deal with this, they’ve analyzed the impact of key aspects similar to vocabulary extension, further pretraining, and instruction tuning on transfer. Evaluation involves 4 standardized benchmarks. 

The research explores transferring language generation and instruction-following capabilities to non-English languages using LLaMA. Because of its wealthy linguistic resources, it employs Chinese as the place to begin, extending findings to over ten low-resource languages. Models include LLaMA, LLaMA2, Chinese LLaMA, Chinese LLaMA2, and Open Chinese LLaMA, each with different pretraining scales. Evaluation involves benchmarks like LLM-Eval, C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Response quality is assessed based on accuracy, fluency, informativeness, logical coherence, and harmlessness. The study achieves state-of-the-art performance with minimal pretraining data, providing insights for non-English LLM development.

The study investigates language transfer to non-English languages using LLaMA, specializing in vocabulary extension, training scale impact, and multilingual proficiency. Surprisingly, extending the vocabulary diminishes performance in Chinese. While increased pretraining scale initially improves response quality, it plateaus, emphasizing language generation over knowledge acquisition. English proficiency suffers with exclusive Chinese training. Evaluations across 13 low-resource languages show SFT data boost response quality, with Arabic, Indonesian, and Vietnamese excelling. Code-switching samples suggest LLaMA learns cross-lingual semantic alignment during pretraining, enhancing transferability. The study emphasizes nuanced approaches for effective non-English LLM development.  

Table 1: Evaluation results of model response quality for 13 low-resource languages on the LLM-Eval. ACC., F., LC., H., INFO., and AVG. Respectively denote accuracy, fluency, logical coherence, harmlessness, informativeness, and average.
Researchers have focused on effectively transferring language generation capabilities and following instructions to a non-English language. Specifically, they’ve conducted a comprehensive empirical study to investigate the need of vocabulary extension and the required training scale for effective transfer. They found that vocabulary extension is unnecessary and that comparable transfer performance to state-of-the-art models might be achieved with lower than 1% of the further pretraining data. Similar results are observed from the extension experiments on the 13 low-resource languages. 


Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

In case you like our work, you’ll love our newsletter..


Nikhil is an intern consultant at Marktechpost. He’s pursuing an integrated dual degree in Materials on the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who’s at all times researching applications in fields like biomaterials and biomedical science. With a powerful background in Material Science, he’s exploring latest advancements and creating opportunities to contribute.


🐝 Get stunning skilled headshots effortlessly with Aragon- TRY IT NOW!.

LEAVE A REPLY

Please enter your comment!
Please enter your name here