Home Community Researchers from the University of Oregon and Adobe Introduce CulturaX: A Multilingual Dataset with 6.3T Tokens in 167 Languages Tailored for Large Language Model (LLM) Development

Researchers from the University of Oregon and Adobe Introduce CulturaX: A Multilingual Dataset with 6.3T Tokens in 167 Languages Tailored for Large Language Model (LLM) Development

Researchers from the University of Oregon and Adobe Introduce CulturaX: A Multilingual Dataset with 6.3T Tokens in 167 Languages Tailored for Large Language Model (LLM) Development

By dramatically improving state-of-the-art performance across a wide selection of tasks and revealing recent emergent skills, large language models (LLMs) have profoundly impacted NLP research and applications. To encode input texts into representation vectors, the encoder-only models have been investigated; to create texts, the decoder-only models have been studied; and to perform sequence-to-sequence creation, the encoder-decoder models have been studied. The exponential growth in model sizes and training datasets, each required by the scaling laws for optimum performance, has been the first force behind the remarkable capabilities of LLMs. For instance, although the BERT model only contained just a few hundred million parameters, more contemporary GPT-based models now include a whole lot of billions of parameters.

Massive model sizes and large training datasets are the first elements in advancing large language models (LLMs) with amazing learning capabilities. With the event of NLP, LLMs have been increasingly available to most people to encourage further study and practical applications. Nevertheless, training datasets for these LLMs are typically only partially provided, especially for probably the most recent state-of-the-art models. Extensive data cleansing and deduplication are required to create high-quality training data for LLMs. In this fashion, the necessity for more openness around training data has stymied efforts to duplicate findings and progress the sphere of hallucination and bias research in LLMs. These difficulties are compounded in multilingual learning scenarios by the typically insufficient collection and cleansing of multilingual text collections. Because of this, there isn’t a very good open-source dataset that might be used for training LLMs across languages. CulturaX, a large multilingual dataset including 6.3 trillion tokens in 167 languages, was developed by a collaboration of academics on the University of Oregon and Adobe Research to handle this problem. To make sure the best quality for model training, the dataset goes through a stringent pipeline comprising quite a few steps of cleansing and deduplication. These processes include identifying the languages within the dataset, filtering the dataset using URLs, cleansing the dataset using metrics, refining the documents, and deduplicating the info.

CulturaX undergoes thorough document-level cleansing and deduplication to make sure the best quality training LLMs across languages. The info cleansing procedure uses an entire pipeline to eliminate inaccurate information. This necessitates the elimination of distractions reminiscent of inaccurate language identification, poisonous data, and non-linguistic material.

Key Features

  • CulturaX is the most important open-source, multilingual dataset that has ever been thoroughly cleaned and deduplicated to be used in LLM and NLP applications.
  • CulturaX provides a multilingual, open-source, and big dataset with immediately applicable and high-quality data to coach LLMs, solving many problems with current datasets.
  • While there exist multilingual open-source datasets with text data in various languages, reminiscent of mC4, their quality, and scale don’t fulfill the necessities for efficiently training LLMs, especially generative models reminiscent of GPT. As an illustration, as mentioned within the introduction, neither mC4 nor OSCAR provides document-level fuzzy deduplication. The usage of cld3 leads to inferior language recognition for mC4, which is one other drawback. While CC100 does contain data past 2018, BigScience ROOTS only gives a sampling of the info for 46 languages.

HuggingFace’s full public release of CulturaX will help further study multilingual LLMs and their applications. Take a look at here https://huggingface.co/datasets/uonlp/CulturaX 

You need to take a look at CulturaX, a brand new multilingual dataset with text data for 167 languages. A radical workflow cleans and removes duplicates from the dataset, leading to 6.3 trillion tokens. As an enormous, high-quality dataset, CulturaX could also be utilized to coach effective LLMs in various languages easily. This information is freely available to the general public, and researchers hope it could encourage further studies and practical uses of language acquisition.

Take a look at the Paper and Dataset. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to affix our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.

In the event you like our work, you’ll love our newsletter..


” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-169×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-576×1024.jpg”>

Dhanshree Shenwai is a Computer Science Engineer and has a very good experience in FinTech corporations covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is keen about exploring recent technologies and advancements in today’s evolving world making everyone’s life easy.

🚀 The top of project management by humans (Sponsored)


Please enter your comment!
Please enter your name here