Home News OLMo: Enhancing the Science of Language Models

OLMo: Enhancing the Science of Language Models

0
OLMo: Enhancing the Science of Language Models

The event and progress of language models prior to now few years have marked their presence almost in every single place, not only in NLP research but in addition in industrial offerings and real-world applications. Nonetheless, the surge in industrial demand for language models has, to a certain extent, hindered the expansion of the community. It’s because a majority of state-of-the-art and capable models are gated behind proprietary interfaces, making it unimaginable for the event community to access vital details of their training architecture, data, and development processes. It’s now undeniable that these training and structural details are crucial for research studies, including access to their potential risks and biases, thus making a requirement for the research community to have access to a very open and powerful language model.

To fulfill this requirement, developers have created OLMo, a state-of-the-art, truly open language model framework. This framework allows researchers to make use of OLMo to construct and study language models. Unlike most state-of-the-art language models, which have only released interface code and model weights, the OLMo framework is actually open source, with publicly accessible evaluation code, training methods, and training data. OLMo’s primary aim is to empower and boost the open research community and the continual development of language models.

In this text, we are going to discuss the OLMo framework intimately, examining its architecture, methodology, and performance in comparison with current state-of-the-art frameworks. So, let’s start.

OLMo: Enhancing the Science of Language Models

The language model has arguably been the most popular trend for the past few years, not only inside the AI and ML community but in addition across the tech industry, as a result of its remarkable capabilities in performing real-world tasks with human-like performance. ChatGPT is a first-rate example of the potential language models hold, with major players within the tech industry exploring language model integration with their products.

NLP, or Natural Language Processing, is certainly one of the industries that has extensively employed language models over the past few years. Nonetheless, ever because the industry began employing human annotation for alignment and large-scale pre-training, language models have witnessed a rapid enhancement of their industrial viability, leading to a majority of state-of-the-art language and NLP frameworks having restricted proprietary interfaces, with the event community having no access to vital details.

To make sure the progress of language models, OLMo, a state-of-the-art, truly open language model, offers developers a framework to construct, study, and advance the event of language models. It also provides researchers with access to its training and evaluation code, training methodology, training data, training logs, and intermediate model checkpoints. Existing state-of-the-art models have various degrees of openness, whereas the OLMo model has released the whole framework, from training to data to evaluation tools, thus narrowing the performance gap in comparison to state-of-the-art models just like the LLaMA2 model.

For modeling and training, the OLMo framework includes the training code, full model weights, ablations, training logs, and training metrics in the shape of interface code, in addition to Weights & Biases logs. For evaluation and dataset constructing, the OLMo framework includes the complete training data used for AI2’s Dolma and WIMBD models, together with the code that produces the training data. For evaluation purposes, the OLMo framework includes AI2’s Catwalk model for downstream evaluation, and the Paloma model for perplexity-based evaluation.

OLMo : Model and Architecture

The OLMo model adopts a decoder-only transformer architecture based on the Neural Information Processing Systems, and delivers two models with 1 billion and seven billion parameters respectively, with a 65 billion parameter model currently under development. 

The architecture of the OLMo framework delivers several improvements over frameworks including the vanilla transformer component of their architecture including recent state-of-the-art large language models like OpenLM, Falcon, LLaMA, and PaLM. The next figure compares the OLMo model with 7B billion parameters against recent LLMs operating on almost equal numbers of parameters. 

The OLMo framework selects the hyperparameters by optimizing the model for training throughput on the hardware while at the identical time minimizing the danger of slow divergence and loss spikes. With that being said, the first changes implemented by the OLMo framework that distinguishes itself from the vanilla transformer architecture are as follows:

No Biases

Unlike Falcon, PaLM, LLaMA and other language models, the OLMo framework doesn’t include any bias in its architecture to reinforce the training stability. 

Non-Parametric Layer Norm

The OLMo framework implements the non-parametric formulation of the layer norm in its architecture. The Non-Parametric Layer Norm offers no affine transformation inside the norm i.e it doesn’t offer any adaptive gain or bias. Non-Parametric Layer Norm not only offers more security that Parametric Layer Norms, but also they are faster. 

SwiGLU Activation Function

Like a majority of language models like PaLM and LLaMA, the OLMo framework includes the SwiGLU activation function in its architecture as an alternative of the ReLU activation function, and increases the hidden activation size to the closest multiple of 128 to enhance throughput. 

RoPE or Rotary Positional Embeddings

The OLMo models follow the LLaMA and PaLM models and swap absolutely the positional embeddings for RoPE or Rotary Positional Embeddings. 

Pre Training with Dolma

Although the event community now has enhanced access to model parameters, the doors to access pre-training datasets still remain shut because the pre-training data isn’t released alongside the closed models nor alongside the open models. Moreover, technical documentations covering such data often lack vital details required to totally understand and replicate the model. The roadblock makes it difficult to hold forward the research in certain threads of language model research including the understanding of how the training data impacts the capabilities and limitations of the model. The OLMo framework built and released its pre-training dataset, Dolma, to facilitate open research on language model pre-training. The Dolma dataset is a multi-source and diverse collection of over 3 trillion tokens across 5 billion documents collected from 7 different sources which are commonly utilized by powerful large-scale LLMs for pre-training and are accessible to the final audience. The composition of the Dolma dataset is summarized in the next table. 

The Dolma dataset is built using a pipeline of 5 components: language filtering, quality filtering, content filtering, multi-source mixing, deduplication, and tokenization. OLMo has also released the Dolma report that gives more insights into the design principles and construction details together with a more detailed content summary. The model also open sources its high performance data curation tools to enable easy and quick curation of pre-training data corpora. Evaluation of the model follows a two-staged strategy, starting with online evaluation for decision-making during model training and a final offline evaluation for an aggregated evaluation from model checkpoints. For offline evaluation, OLMo uses the Catwalk framework, our publicly available evaluation tool that has access to a broad diversity of datasets and task formats. The framework uses Catwalk for downstream evaluation in addition to intrinsic language modeling evaluation on our latest perplexity benchmark, Paloma. OLMo then compares it against several public models using its fixed evaluation pipeline, for each downstream and perplexity evaluation. 

OLMo runs several evaluation metrics in regards to the model architecture, initialization, optimizers, learning rate schedule, and mixtures of knowledge through the training of the model. Developers call it OLMo’s “online evaluation” in that it’s an in-loop iteration at every 1000 training steps (or ∼4B training tokens) to offer an early and continuous signal on the standard of the model being trained. The setup of those evaluations will depend on a majority of core tasks and experiment settings used for our offline evaluation. OLMo goals for not only comparisons of OLMo-7B against other models for best performance but in addition to indicate the way it enables fuller and more controlled scientific evaluation. OLMo-7B is the most important Language Model with explicit decontamination for perplexity evaluation. 

OLMo Training

It is important to notice that the OLMo framework models are trained using the ZeRO optimizer strategy, which is provided by the FSDP framework through PyTorch and, in this manner, substantially reduces GPU memory consumption by sharding model weights over GPUs. With this, on the 7B scale, training could be done with a micro-batch size of 4096 tokens per GPU on our hardware. The training framework for OLMo-1B and -7B models uses a globally constant batch size of about 4M tokens (2048 instances each with a sequence length of 2048 tokens). For the model OLMo-65B (currently in training), developers use a batch size warmup that starts at about 2M tokens (1024 instances), doubling every 100B tokens until about 16M tokens (8192 instances). 

To enhance throughput, we employ mixed-precision training (Micikevicius et al., 2017) through FSDP’s built-in settings and PyTorch’s amp module. The latter ensures that certain operations just like the softmax all the time run in full precision to enhance stability, while all other operations run in half-precision with the bfloat16 format. Under our specific settings, the sharded model weights and optimizer state local to every GPU are kept in full precision. The weights inside each transformer block are only forged to bfloat16 format when the full-sized parameters are materialized on each GPU through the forward and backward passes. Gradients are reduced across GPUs in full precision. 

Optimizer

The OLMo framework makes use of the AdamW optimizer with the next hyperparameters. 

For all model sizes, the training rate warms up linearly over the primary 5000 steps (∼21B tokens) to a maximum value, after which decays linearly with the inverse square root of the step number to the desired minimum learning rate. After the warm-up period, the model clips gradients such that the overall l-norm of the parameter gradients doesn’t exceed 1.0. The next table gives a comparison of our optimizer settings on the 7B scale with those from other recent LMs that also used AdamW. 

Training Data

Training involves tokenizing training instances by word and BPE tokenizer for the sentence piece model after adding a special EOS token at the tip of every document, after which we group consecutive chunks of 2048 tokens to form training instances. Training instances are shuffled in the very same way for every training run. The information order and exact composition of every training batch could be reconstructed from the artifacts we release. All the released OLMo models have been trained to not less than 2T tokens (a single epoch over its training data), and a few were trained beyond that by starting a second epoch over the info with a unique shuffling order. Given the small amount of knowledge that this repeats, it must have a negligible effect. 

Results

The checkpoint used for evaluation of OLMo-7B is trained as much as 2.46T tokens on the Dolma data set with the linear learning rate decay schedule mentioned before. Further tuning this checkpoint on the Dolma dataset for 1000 steps with linearly decayed learning rate to 0 further increases model performance on perplexity and end-task evaluation suites described before. For the ultimate evaluation, developers compared OLMo with other publicly available models – LLaMA-7B, LLaMA2-7B, Pythia-6.9B, Falcon-7B and RPJ-INCITE-7B. 

Downstream evaluation

The core downstream evaluation suite is summarized in the next table. 

We conduct zero-shot evaluation by rank classification approach in all cases. On this approach, the candidate text completions (e.g., different multiple-choice options) are ranked by likelihood (normally normalized by some normalization factor), and prediction accuracy is reported. 

While Catwalk uses several typical likelihood normalization methods, equivalent to per token normalization and per-character normalization, the normalization strategies applied are chosen individually for every dataset and include the answer is unconditional likelihood. More concretely, this involved no normalization for the arc and openbookqa tasks, per-token normalization for hellaswag, piqa, and winogrande tasks, and no normalization for boolq, copa, and sciq tasks (i.e., tasks in a formulation near a single token prediction task).

The next figure shows the progress of accuracy rating for the nine core end-tasks. It might be deduced that there’s a generally increasing trend within the accuracy number for all tasks, aside from OBQA, as OLMo-7B is further trained on more tokens. A pointy upward tick in accuracy of many tasks between the last and second to last step shows us the good thing about linearly reducing the LR to 0 over the ultimate 1000 training steps. For example, within the case of intrinsic evaluations, Paloma argues through a series of analyses, from the inspection of performance in each domain individually as much as more summarized results over combos of domains. We report results at two levels of granularity: the mixture performance over 11 of the 18 sources in Paloma, in addition to more fine-grained results over each of those sources individually.

Final Thoughts

In this text, we’ve talked about OLMo, a state-of-the-art truly open language model offers developers a framework to construct, study, and advance the event of language models together with providing researchers access to its training and evaluation code, training methodology, training data, training logs, and intermediate model checkpoints. Existing state-of-the-art models have various degrees of openness whereas the OLMo model has released the whole framework from training to data to evaluation tools, thus narrowing the gap in performance in comparison against state-of-the-art models like LLaMA2 model. 

LEAVE A REPLY

Please enter your comment!
Please enter your name here