Home Community Salesforce Introduces XGen-7B: A Recent 7B LLM Trained on as much as 8K Sequence Length for 1.5T Tokens

Salesforce Introduces XGen-7B: A Recent 7B LLM Trained on as much as 8K Sequence Length for 1.5T Tokens

0
Salesforce Introduces XGen-7B: A Recent 7B LLM Trained on as much as 8K Sequence Length for 1.5T Tokens

With recent technological breakthroughs in artificial intelligence, Large Language Models, or LLMs in brief, have turn out to be increasingly prevalent. Over the past few years, researchers have made rapid advancements in solving several complex language-related tasks by training these models on vast amounts of information with the intention to comprehend intricate language patterns, generate coherent responses, etc. One area of research that has particularly gained the interest of researchers and developers is the appliance of LLMs in the case of handling long-form content to incorporate broader contexts. Some examples of those tasks range from relatively easy tasks like text summarization and code generation to more complex problem statements like protein structure prediction and knowledge retrieval. Long textual sequences consist of knowledge in diverse forms, corresponding to paragraphs, tables, images, etc.; thus, LLMs should be trained to process and understand such elements. Furthermore, by effectively considering long-distance structural dependencies, LLMs can discover the connections between different parts of the text and extract essentially the most relevant information. Thus, exposure to a broader range of information allows LLMs to offer more accurate and contextually relevant answers to user queries. 

Yet, despite the many potential use cases, most available open-source LLMs, starting from Meta’s LLaMA to MosaicML’s MPT LLM models, have been trained on sequences with a maximum of 2K tokens. This limitation presents a big challenge in the case of modeling longer sequences. Moreover, previous research on model scaling has shown that smaller models trained on a greater variety of tokens outperform larger models when given a set computational budget. Thus, inspired by the issue at hand and current advances, Salesforce Research made groundbreaking achievements by introducing XGen-7B, a series of 7B LLMs trained on 8K sequence length for 1.5 trillion tokens. The series of models include XGen-7B-4K-Base (with support for 4K sequence length), XGen-7B-8K-Base (with support for 8K sequence length), and XGen-7B-8k-Inst which has been fine-tuned on public-domain instructional data (released just for research purposes). The striking characteristic of those LLMs is that on standard NLP benchmarks, XGen achieves comparable or higher results when put next to other state-of-the-art LLMs of comparable size like MPT, Falcon, LLaMA, etc.

The XGen-7b models employed on this study were trained using Salesforce’s proprietary library JaxFormer, which enables efficient training of LLMs utilizing data and model parallelism specifically optimized for TPU-v4 hardware. The training process followed the rules of LLaMA, augmented with two additional investigations. The primary exploration focused on understanding “loss spikes,” where the loss suddenly and temporarily increases during training and not using a clear underlying cause. Although the foundation explanation for these spikes stays unknown, the researchers identified aspects corresponding to “sequential over parallel circuits,” “swish-GLU over GeLU,” and “RMS-Norm over Layer-norm” as potential contributors to training instability. The second aspect addressed was sequence length. Since training with longer sequences incurs significantly higher computational costs as a consequence of the quadratic complexity of self-attention, a staged training approach was adopted. The training initially encompassed 800B tokens with a sequence length of 2k tokens, followed by 400B tokens with 4k length, and eventually, 300B tokens with 8k length. 

🔥 Join The Fastest Growing ML Subreddit

To evaluate the capabilities of the XGen-7b 8k model in comprehending longer contexts, the researchers conducted evaluations using three primary tasks: long-form dialogue generation, text summarization, and question-answering. The researchers used the instruction-tuned model for his or her evaluations pertaining to the problem of the tasks at hand. Regarding long-form dialogue generation, the researchers utilized three tasks for assessment: AMI meeting summarization, ForeverDreaming, and TVMegaSite screenplay summarization. Across all metrics, the XGen-7B-inst model achieved the best scores in comparison with several other instruction-tuned models, demonstrating its superior performance.

For long-form question-answering, the researchers generated questions using ChatGPT based on Wikipedia documents covering diverse topics like Physics, Engineering, History, and Entertainment, together with their corresponding summaries. The LLM-generated answers, which were 256 tokens long, were evaluated using GPT-4 based on their structure, organization, and relevance to the query and source document. On this scenario, the XGen-7B-8k-Inst model outperformed the baseline models, that are limited to 2k tokens, showcasing its superior performance. When it comes to text summarization, the researchers employed two datasets from different domains, specifically meeting conversations and government reports, to guage the XGen-7b model. The outcomes revealed that the XGen-7b model significantly outperformed other baseline models in these tasks, indicating its superior performance in text summarization as well. 

The evaluations demonstrated that the XGen-7b model excelled in understanding longer contexts across various tasks, including long-form dialogue generation, question-answering, and text summarization. Its performance surpassed that of other instruction-tuned and baseline models, showcasing its effectiveness in comprehending and generating coherent responses in extensive text contexts. Nevertheless, despite its efficacy, the researchers acknowledge a limitation of the XGen model, because it just isn’t exempt from biases and has the potential to generate toxic responses, a characteristic it shares with many other AI models. Salesforce Research has also open-sourced its code to permit the community to explore its work.

Check Out the SF Blog and Github Link. Don’t forget to hitch our 25k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you will have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com


Featured Tools:

🚀 Check Out 100’s AI Tools in AI Tools Club


Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is obsessed with the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more in regards to the technical field by participating in several challenges.


🔥 StoryBird.ai just dropped some amazing features. Generate an illustrated story from a prompt. Test it out here. (Sponsored)

LEAVE A REPLY

Please enter your comment!
Please enter your name here