Discover how mirror augmentation generates data and aces the BERT performance on semantic similarity tasks
It isn’t any secret that BERT-like models play a fundamental role in modern NLP applications. Despite their phenomenal performance on downstream tasks, most of those models usually are not that perfect on specific problems without fine-tuning. Embedding construction from raw pretrained models often results in metrics being removed from state-of-the-art results. At the identical time, fine-tuning is a heavy procedure and frequently requires a minimum of hundreds of annotated data samples to make the model higher understand the domain data. In some cases, this aspect becomes problematic after we cannot simply collect already annotated data or it comes with a high price.
MirrorBERT was designed to beat the aforementioned issue. As an alternative of the usual fine-tuning algorithm, MirrorBERT relies on self-supervision by smartly augmenting initial data with none external knowledge. This approach allows MirrorBERT to point out comparable performance on semantic similarity problems. Moreover, by utilizing its modern contrastive learning technique, MirrorBERT can transform pretrained models like BERT or RoBERTa into universal lexical encoders in lower than a minute!
With the assistance of the official MirrorBERT paper, we’ll dive into its crucial details to know how it really works under the hood. The obtained knowledge shall be universal because the discussed techniques could then be used for other NLP models coping with similarity tasks as well.
To place it simply, MirrorBERT is identical BERT model apart from several steps introduced in its learning process. Allow us to discuss each of them.
1. Self-duplication
Because the name suggests, MirrorBERT simply duplicates the initial data.
This duplicated data is then used to further construct two different embedding representations of the identical strings.
2. Data augmentation
The authors of the paper propose two intuitive techniques that barely modify dataset texts. In accordance with them, in a overwhelming majority of cases, these text corruptions don’t change their meaning.
2.1. Input augmentation
Given a pair of strings (xᵢ, x̄ᵢ), the algorithm randomly chooses certainly one of them and applies random span masking consisting of a random substitution of a substring of a set length k within the text with the [MASK] token.
2.2. Feature augmentation
Random span masking operates on sentence- / phrase-level. To make a model capable of work well on word-level tasks as well, one other augmentation mechanism is required operating on shorter text fragments. Feature augmentation solves this problem by utilizing dropout.
The dropout process refers to turning off a given percentage p of neurons in a certain network layer. This will be viewed because the equivalent of zeroing corresponding neurons within the network.
The authors of the paper propose using dropout for data augmentation. When a pair of strings (xᵢ, x̄ᵢ) is passed to the network with dropout layers, their output representations shall be barely different if, on each forward pass, dropout layers at all times disable different neurons.
The good aspect of using dropout for feature augmentation is that the dropout layers are already included in BERT / RoBERTa architecture meaning no additional implementation is required!
While random span masking is applied only to every second object within the dataset, the dropout is applied to all of them.
3. Constrastive learning
Contrastive learning is a machine learning technique consisting of learning data representations in a way that similar objects lie close to one another within the embedding space while dissimilar are far-off from one another.
One among the ways of constrastive learning implementation is the usage of a constrastive loss function. The one chosen for MirrorBERT is InfoNCELoss. Allow us to understand how it really works.
InfoNCELoss
At first sight, the formula for InfoNCELoss might look intimidating, so allow us to progressively come to it step-by-step.
- The cosine similarity between two vectors measures how close they align to one another taking values within the range from -1 to 1 with greater values indicating higher similarity.
2. To higher understand the following steps, it’s essential to remember that InfoNCELoss uses softmax transformation with the temperature parameter T controlling the smoothness of the output softmax distribution. That’s the reason similarities are divided by T.
For more details about softmax temperature, confer with this text explaining it in additional detail.
3. As in the usual softmax formula, a prediction (similarity) is then transformed to the exponent form.
4. In the traditional softmax formula, the numerator incorporates an exponent of a category probability whereas the denominator is the exponential sum of all distribution probabilities. Within the case with similarities in InfoNCELoss, the formula analogously follows the logic:
- The numerator incorporates exponential similarity of two barely modified an identical strings (xᵢ, x̄ᵢ) which will be considered a positive example.
- The denominator consists of a sum of exponential similarities between xᵢ and all other dataset strings xⱼwhich will be seen as a set of all negative examples.
5. In the perfect scenario, we wish the similarity between an identical strings (xᵢ, x̄ᵢ) to be high while the similarity of xᵢ with other strings xⱼ to be low. Whether it is true, then the numerator within the formula above will increase while the denominator will decrease making the entire expression larger.
Loss functions work inversely: in ideal cases, they take smaller values, and, in worse situations, they highly penalise the model. To make the formula above compatible with this loss principle, allow us to add the negative logarithm before the entire expression.
6. The expression within the previous step already corresponds to a loss value for a single string xᵢ. For the reason that dataset consists of many strings, we’d like to take all of them into consideration. For that, allow us to sum up this expression for all of the strings.
The obtained formula is strictly the InfoNCELoss!
InfoNCELoss tries to group similar objects close to one another while pushing away the dissimilar ones within the embedding space.
Triplet loss utilized in SBERT is one other example of contrastive learning loss.
A surprising fact about MirrorBERT is that it doesn’t require a whole lot of data to be fine-tuned. Moreover, this data doesn’t should be external as the entire training process is self-supervised.
The researchers report that for fine-tuning lexical representations, they use only 10k most frequent words in each language. For sentence-level tasks, 10k sentences are used.
The main points on the MirrorBERT training are listed below:
- The temperature is ready to T = 0.04 in sentence-level tasks and to T = 0.2 in word-level tasks.
- In random span masking, k is ready to five.
- Dropout is ready to p = 0.1.
- AdamW optimizer is used with a learning rate of 2e-5.
- The batch size is ready to 200 (or 400 with duplicates).
- Lexical models are trained for two epochs and sentence-level models are trained for a single epoch.
- As an alternative of the mean pooling of all output token representations, the [CLS] token representation is created.
Single MirrorBERT training epoch needs only 10–20 seconds.
The authors evaluated metrics on a set of benchmarks by applying mirror fine-tuning. The outcomes were reported on three kinds of tasks: lexical, sentence-level and cross-lingual. In each of them, MirrorBERT demonstrated comparable performance to other BERT-like fine-tuned models.
The outcomes also showed that the range between 10k and 20k training examples is essentially the most optimal for fine-tuning. The performance of the model progressively decreases with more training examples.
Mirror fine-tuning literally acts like a magic spell: as a substitute of heavy fine-tuning procedures, the mirror framework requires much less time without using external data being on par with other fine-tuned models like BERT, SBERT or RoBERTa on semantic similarity tasks.
In consequence, MirrorBERT can transform BERT-like pretrained model into universal encoders capturing linguistic knowledge with high efficiency.
All images unless noted are by the writer