MosaicML is a generative AI company that gives AI deployment and scalability solutions. Their latest large language model (LLM) MPT-30B is making waves across the AI community.
MosaicML’s LLM journey began with the discharge of MPT-7B (Mosaic Pretrained Transformer) in May 2023 which got here with three variants:
- MPT-7B-StoryWriter-65k+ (for long-form story generation)
- MPT-7B-Instruct (for short-form instruction following)
- MPT-7B-Chat (for dialogue generation)
The models witnessed massive success within the ML community due to their open-source nature, industrial usability, and exceptional capability to handle prolonged context windows.
Most significantly, the model was at par and, in some cases, outperformed the opposite comparable models (LLaMA-7B, StableLM 7B, etc). By June, the MPT-7B series had been downloaded over 3 million times. On twenty second June, MosaicML released MPT-30B which raised the bar even further for open-source foundation models.
The MPT-30B: A Powerful LLM That Exceeds GPT-3
MPT-30B is an open-source and commercially licensed decoder-based LLM that’s more powerful than GPT-3-175B with only 17% of GPT-3 parameters, i.e., 30B. It outperforms GPT-3 on several tasks. Here’s a comparison between MPT-30B and GPT-3.
Source
MPT-30B builds upon the previous MPT-7B model. It’s computationally efficient to coach in comparison with models with similar sizes. As an illustration, LLaMA-30B used roughly 1.44 times more FLOPs budget than MPT-30B, while Falcon-40B had a 1.27 times higher FLOPs budget than MPT-30B. Here’s an illustration of MPT-30B’s improvement on various tasks over its predecessor.
Source
Some special features of MPT-30B are as follows:
8k Token Context Window
Context window in LLMs refers back to the range of tokens the model can consider before generating the output. MPT-30B had a context window of 8000 tokens at training time. It was first trained on 1T token using 2k token sequences after which a further 50B tokens of 8k token sequences (roughly 6000 words).
ALiBi Support
To clarify this feature, let’s consider a matter:
How can MPT-30B understand and make predictions for longer sequences than what it was trained on?
MPT-30B uses an Attention with Linear Biases (ALiBi) technique to grasp longer sequences and extend the context window beyond 8k tokens during finetuning or inference.
As a substitute of calculating positional embeddings by which we assign a vector to every word within the sequence, ALiBi calculates attention scores between key and query tokens. When the important thing and query tokens are close together, the penalty is low but higher otherwise. In consequence, the underlying transformer architecture can extrapolate to long-form inputs.
Efficient Inference & Training Performance via FlashAttention
Attention i.e., specializing in relevant parts of the input sequence, is a critical component of transformers, but it will probably be slow and memory-intensive, especially when processing long text sequences.
FlashAttention is an approach proposed by researchers at Cornell University that addresses this problem for MPT-30B. Using a method called tiling, FlashAttention reduces the variety of times the model must read from or write to memory, speeding up the processing. Hence, the model employs the state-of-the-art FlashAttention technique and NVIDIA’s FasterTransformer optimization library for efficient training and inference.
Ease of Training & Deployment
Developers can train MPT-30B from scratch or use MosaicML’s checkpoints for quicker deployments. Also, it will probably be finetuned for domain-specific use cases on a selected dataset.
The model’s size was chosen to enable effortless deployment on a single GPU, specifically 1xA100-80GB in 16-bit precision or 1xA100-40GB in 8-bit precision. Which means the model was designed to suit inside the memory limitations of those GPUs.
Coding Capabilities
MPT-30B provides exceptional coding capabilities as well. HumanEval is a dataset released by OpenAI that comprises 164 handcrafted programming problems. On the HumanEval dataset, the model surpasses purpose-built LLM models, reminiscent of the StarCoder series.
Source
Positive-Tuned Variants: MPT-30B-Instruct & MPT-30B-Chat
MPT-30B-Instruct
LLMs are primarily used for instructions reminiscent of query answering, text summarization, language translation, etc. MPT-30B-Instruct is a commercially usable (maintains industrial CC-By-SA-3.0 license) variant of MPT-30B fine-tuned specifically for instruction following tasks. For fine-tuning, the next datasets were used:
- FLAN
- P3
- Alpaca
- Dolly-15k
The Dolly dataset was further augmented with Anthropic’s Helpful and Harmless dataset for instruction finetuning. Moreover, a various range of datasets were used for data augmentation, that are as follows:
- CompetitionMath
- GradeSchoolMath
- DialogSum
- DuoRC
- QASPER
- QuALITY
- SummScreen
- Spider
MPT-30B-Chat
MPT-30B-Chat is a fine-tuned version of MPT-30B for dialogue generation. It’s a research artifact released under the CC-By-NC-SA-4.0 license, allowing only non-commercial use. The model was fine-tuned using various language datasets, including:
- Airoboros/GPT4-1.2
- Baize
- Camel
- GPTeacher
- Guanaco
- LongCoversations
- ShareGPT
- WizardLM
LLMs share an enormous chunk of the multi-billion dollar generative AI market, which has experienced tremendous growth very quickly after ChatGPT revolutionized the landscape last 12 months. The MPT family is a foundational a part of this revolution. Within the near future, we will expect to see commercially available open-source models which can be much more powerful and efficient than the MPT family.
For the newest AI news, visit unite.ai.