Have you ever ever tried asking a matter in a language apart from English in ChatGPT? You may get a weird, unrelated answer to your inquiry because these models are sometimes biased toward the English language. Won’t it’s easier if LLMs work in any language?
National Key Laboratory for Novel Software Technology researchers propose a pre-trained LLM in non-English languages. The same old performance of LLMs is poor in non-English languages attributable to each pre-training corpus and instruction-tuning data being in English. One can improve it by continued pre-training with large-scale monolingual data.
Researchers perform instruction-tuning on LLMs with translation tasks to enhance the correspondence between two languages and use the cross-lingual general tasks to enhance the flexibility of the instructions. They use LLaMA-7B as their pre-trained LLM and consider six languages much like the English alphabet. LLaMA stands for Large Language Model Meta AI.
An x-LLaMA is obtained with language-specific data for every language, which is then further compared with LLMs. This language modeling requires predicting the subsequent token based on the prefix sequence. It needs the LLM to be trained on the large-scale corpus and translation data. Translation data is one of the crucial useful resources for learning semantic alignment, and LLM’s translation performance could be enhanced by utilizing human expert-annotated translation data for instruction tuning.
Researchers use publicly available sentence-level translation datasets to construct translation task instruction data. This makes their method scalable, reproducible, and extendable to more languages. They find that arranging non-English text on the goal side of translation data can boost LLM’s performance on non-English tasks than having it on the source side.
Researchers used bilingual translation performance as a parameter to know the semantic alignment. They found that the dimensions of the interpretation task instruction data also greatly impacts the alignment. They derived an expression referring to translation performance and data scale, which has logarithmic dependence within the exponential form. They find that a less similar language requires more translation data to construct semantic alignment than languages an identical to English.
To check x-LLaMA, researchers designed Alpaca-7B ( a LLaMA ), which was tuned with English instructions; Parrot – 7B, which was tuned with human-annotated translation data; and Bayling-7B, which was tuned with human interactive translations. They find that x-LLaMA outperforms Alpaca-7B by 42.50% in six non-English languages. The accuracy of non-English tasks in x-LLaMA was the identical because the English tasks in Alpaca-7B.
Finally, this proves that cross-lingual instruction tuning is an efficient way. Their approach and findings illuminate the potential for developing stronger LLMs for non-English languages.
Try the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to affix our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.
Arshad is an intern at MarktechPost. He’s currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the elemental level results in recent discoveries which result in advancement in technology. He’s captivated with understanding the character fundamentally with the assistance of tools like mathematical models, ML models and AI.