Large Language Models: RoBERTa — A Robustly Optimized BERT Approach Introduction 1. Dynamic masking 2. Next sentence prediction 3. Increasing batch size 4. Byte text encoding Pretraining RoBERTa versions Conclusion Resources

Artificial Intelligence

Large Language Models: RoBERTa — A Robustly Optimized BERT Approach Introduction 1. Dynamic masking 2. Next sentence prediction 3. Increasing batch size 4. Byte text encoding Pretraining RoBERTa versions Conclusion Resources

admin

September 25, 2023

Large Language Models: RoBERTa — A Robustly Optimized BERT Approach
Introduction
1. Dynamic masking
2. Next sentence prediction
3. Increasing batch size
4. Byte text encoding
Pretraining
RoBERTa versions
Conclusion
Resources

Study key techniques used for BERT optimisation

The looks of the BERT model led to significant progress in NLP. Deriving its architecture from Transformer, BERT achieves state-of-the-art results on various downstream tasks: language modeling, next sentence prediction, query answering, NER tagging, etc.

Despite the wonderful performance of BERT, researchers still continued experimenting with its configuration in hopes of achieving even higher metrics. Fortunately, they succeeded with it and presented a brand new model called RoBERTa — Robustly Optimised BERT Approach.

Throughout this text, we shall be referring to the official RoBERTa paper which accommodates in-depth information in regards to the model. In easy words, RoBERTa consists of several independent improvements over the unique BERT model — all the other principles including the architecture stay the identical. All the advancements shall be covered and explained in this text.

From the BERT’s architecture we do not forget that during pretraining BERT performs language modeling by attempting to predict a certain percentage of masked tokens. The issue with the unique implementation is the undeniable fact that chosen tokens for masking for a given text sequence across different batches are sometimes the identical.

More precisely, the training dataset is duplicated 10 times, thus each sequence is masked only in 10 alternative ways. Keeping in mind that BERT runs 40 training epochs, each sequence with the identical masking is passed to BERT 4 times. As researchers found, it’s barely higher to make use of dynamic masking meaning that masking is generated uniquely each time a sequence is passed to BERT. Overall, this ends in less duplicated data through the training giving a possibility for a model to work with more various data and masking patterns.

The authors of the paper conducted research for locating an optimal approach to model the following sentence prediction task. As a consequence, they found several beneficial insights:

Removing the following sentence prediction loss ends in a rather higher performance.
Passing single natural sentences into BERT input hurts the performance, in comparison with passing sequences consisting of several sentences. One among the more than likely hypothesises explaining this phenomenon is the issue for a model to learn long-range dependencies only counting on single sentences.
It more helpful to construct input sequences by sampling contiguous sentences from a single document slightly than from multiple documents. Normally, sequences are at all times constructed from contiguous full sentences of a single document in order that the whole length is at most 512 tokens. The issue arises after we reach the top of a document. On this aspect, researchers compared whether it was value stopping sampling sentences for such sequences or moreover sampling the primary several sentences of the following document (and adding a corresponding separator token between documents). The outcomes showed that the primary option is healthier.

Ultimately, for the ultimate RoBERTa implementation, the authors selected to maintain the primary two elements and omit the third one. Despite the observed improvement behind the third insight, researchers didn’t not proceed with it because otherwise, it could have made the comparison between previous implementations more problematic. It happens as a consequence of the undeniable fact that reaching the document boundary and stopping there signifies that an input sequence will contain lower than 512 tokens. For having an analogous variety of tokens across all batches, the batch size in such cases must be augmented. This results in variable batch size and more complex comparisons which researchers desired to avoid.

Recent advancements in NLP showed that increase of the batch size with the suitable decrease of the training rate and the number of coaching steps normally tends to enhance the model’s performance.

As a reminder, the BERT base model was trained on a batch size of 256 sequences for one million steps. The authors tried training BERT on batch sizes of 2K and 8K and the latter value was chosen for training RoBERTa. The corresponding number of coaching steps and the training rate value became respectively 31K and 1e-3.

It is usually necessary to take into accout that batch size increase ends in easier parallelization through a special technique called “gradient accumulation”.

In NLP there exist three fundamental forms of text tokenization:

Character-level tokenization
Subword-level tokenization
Word-level tokenization

The unique BERT uses a subword-level tokenization with the vocabulary size of 30K which is learned after input preprocessing and using several heuristics. RoBERTa uses bytes as an alternative of unicode characters as the bottom for subwords and expands the vocabulary size as much as 50K with none preprocessing or input tokenization. This ends in 15M and 20M additional parameters for BERT base and BERT large models respectively. The introduced encoding version in RoBERTa demonstrates barely worse results than before.

Nevertheless, within the vocabulary size growth in RoBERTa allows to encode almost any word or subword without using the unknown token, in comparison with BERT. This provides a substantial advantage to RoBERTa because the model can now more fully understand complex texts containing rare words.

Aside from it, RoBERTa applies all 4 described elements above with the identical architecture parameters as BERT large. The full variety of parameters of RoBERTa is 355M.

RoBERTa is pretrained on a mix of 5 massive datasets leading to a complete of 160 GB of text data. As compared, BERT large is pretrained only on 13 GB of knowledge. Finally, the authors increase the number of coaching steps from 100K to 500K.

Because of this, RoBERTa outperforms BERT large on XLNet large on the most well-liked benchmarks.

Analogously to BERT, the researchers developed two versions of RoBERTa. A lot of the hyperparameters in the bottom and huge versions are the identical. The figure below demonstrates the principle differences:

The fine-tuning process in RoBERTa is analogous to the BERT’s.

In this text, we have now examined an improved version of BERT which modifies the unique training procedure by introducing the next elements:

dynamic masking
omitting the following sentence prediction objective
training on longer sentences
increasing vocabulary size
training for longer with larger batches over data

The resulting RoBERTa model appears to be superior to its ancestors on top benchmarks. Despite a more complex configuration, RoBERTa adds only 15M additional parameters maintaining comparable inference speed with BERT.

All images unless otherwise noted are by the writer

Study key techniques used for BERT optimisation

LEAVE A REPLY Cancel reply