Home News data2vec: A Milestone in Self-Supervised Learning

data2vec: A Milestone in Self-Supervised Learning

data2vec: A Milestone in Self-Supervised Learning

Machine learning models have heavily relied on labeled data for training, and traditionally speaking, training models on labeled data yields accurate results. Nonetheless, the major downside of using labeled data is the high annotation costs that rise with a rise in the scale of the training data. High annotation costs are a giant hurdle for developers, especially when working on a big project with substantial amounts of coaching data.

To tackle the annotation issue, developers got here up with the concept of SSL or Self Supervised Learning. Self Supervised Learning is a machine learning process during which the model trains itself to learn a portion of the input from one other a part of the input. A Self Supervised Learning model goals to take advantage of the connection between the information as an alternative of using labeled data’s supervised signals. 

Along with Self Supervised Learning, there are several other methods & models to coach machine learning models without the usage of labeled data. Nonetheless, most of those methods have two major issues

  1. They are sometimes specialized for a single modality like a picture or a text. 
  2. They require a high amount of computational power. 

These limitations are a significant issue why a median human mind is in a position to learn from a single style of data far more effectively compared to an AI model that relies on separate models & training data to differentiate between a picture, text, and speech. 

To tackle the problem of single modality, Meta AI released the data2vec, the primary of a form, self supervised high-performance algorithm to learn patterns information from three different modalities: image, text, and speech. With the implementation of the data2vec algorithm, text understandings may very well be applied to a picture segmentation problem, or it could possibly even be deployed in a speech recognition task. 

In this text, we shall be talking concerning the data2vec model in-depth. We are going to discuss the strategy overview, related work, architecture, and results of the model in greater depth in order that you’ve gotten a transparent understanding of the data2vec algorithm. 

Data2vec Introduction: The Core Idea

Although the elemental concept of Self Supervised Learning is applied across modalities, actual objectives & algorithms differ from one another because they were designed in respect to a single modality. Designing a model for a single modality is the explanation why the identical self supervised learning algorithm cannot work effectively across different kinds of coaching data. 

To beat the challenge presented by single modality models & algorithms, Meta AI released the data2vec, an algorithm that uses the identical learning methodology for either computer vision, NLP or speech.  

The core idea behind the data2vec algorithm is to make use of the masked view of the input to predict latent representations of the complete input data in a self-distillation setup with the assistance of standard Transformer architecture. So, as an alternative of modality-specific objects like images, text, or voice which might be local in nature, the data2vec algorithm predicts latent representations with information from the whole training or input data. 

Why Does the AI Industry Need the Data2Vec Algorithm?

Self Supervised Learning models construct representations of the training data using human annotated labels, and it’s one in all the key reasons behind the advancement of the NLP or Natural Language Processing, and the Computer Vision technology. These self supervised learning representations are the explanation why tasks like speech recognition & machine learning deploy unsupervised learning of their models. 

Until now, these self supervised learning algorithms concentrate on individual modalities that lead to learning biases, and specific designs within the models. The person modality of self supervised learning algorithms create challenges in several AI applications including computer vision & NLP. 

For instance, there are vocabulary of speech units in speech processing that may define a self-supervised learning task in NLP. Similarly, in computer vision, developers can either regress the input, learn discrete visual tokens, or learn representations invariant to data augmentation. Although these learning biases are handy, it’s difficult to verify whether these biases will generalize to other modalities. 

The data2vec algorithm is a significant milestone within the self-supervised learning industry because it goals at improving multiple modalities fairly than simply one. Moreover, the data2vec algorithm will not be reliant on reconstructing the input or contrastive learning. 

So the explanation why the world needs data2vec is since the data2vec algorithm has the potential of accelerating progress in AI, and contributes in developing AI models that may find out about different points of their surroundings seamlessly. Scientists hope that the data2vec algorithm will allow them to develop more adaptable AI and ML models which might be able to performing highly advanced tasks beyond what today’s AI models can do.

What’s the Data2Vec Algorithm?

The data2vec is a unified framework that goals at implementing self-supervised machine learning across different data modalities including images, speech, and text. 

The data2vec algorithm goals at developing ML models that may learn the overall patterns within the environment a lot better by keeping the training objective uniform across different modalities. The data2vec model unifies the training algorithm, nevertheless it still learns the representations for every modality individually. 

With the introduction of the data2vec algorithm, Meta AI hopes that it can make multimodal learning effective, and far more simpler. 

How Does the Data2Vec Algorithm Work?

The data2vec algorithm combines the learnings of latent goal representations with masked prediction, even though it uses multiple network layers as targets to generalize the latent representations. The model specifically trains an off-the-shelf Transformer network that’s then used either within the teacher or student mode. 

Within the teacher mode, the model first builds the representations of the input data that serves as targets in the training task. In the coed mode, the model encodes a masked version of the input data that’s then used to make predictions on full data representations. 

The above picture represents how the data2vec model uses the identical learning process for various modalities. In step one, the model produces representations of the input data (teacher mode). The model then regresses these representations on the premise of a masked version of the input. 

Moreover, because the data2vec algorithm uses latent representations of the input data, it could possibly be viewed as a simplified version of the modality-specific designs like creating suitable targets by normalizing the input or learning a set set of visual tokens. However the crucial differentiating point between the data2vec & other algorithms is that the data2vec algorithm uses self-attention to make its goal representation contextualized & continuous. Then again, other self-supervised learning models use a set set of targets which might be based on a neighborhood context. 

Data2vec: Model Method

The data2vec model is trained by predicting the model representations of the input data given a partial view of the input. As you’ll be able to see within the given figure, the dog’s face is masked, a specific section of the voice note is masked, and the word “with” is masked within the text. 

The model first encodes a masked version of the training sample(student mode), after which encodes the unmasked version of the input to construct training targets with the identical model but only when it’s parameterized because the exponential average of the model weights(teacher mode). Moreover, the goal representations encode the knowledge present within the training sample, and in the coed mode, the training task is used to predict these representations when given a partial view of the input. 

Model Architecture

The data2vec model uses an ordinary Transformer architecture with modality-specific encoding of the input data. For tasks related to computer vision, the model uses the ViT technique to encode a picture as a sequence of patches where each image spans over 16×16 pixels, and fed as a linear transformation. 

Moreover, the information for speech recognition, the model encodes the information using a multi-layer 1-D convolutional neural network that maps the 16 kHz waveforms into 50 Hz representations. To process the text data, the model preprocesses the information to extract sub-word units, after which embeds the information in distributional space via embedding vectors. 


Once the model embeds the input data as a sequence of tokens, the model masks parts of those units by replacing them with an embedding token, after which feeds the sequence to the Transformer network. For computer vision, the model practices block-wise marking strategy. Latent speech representations are used to mask spans of speech data, and for language related tasks, the tokens are masked. 

Training Targets

The data2vec model goals at predicting the model representations of the unmasked training sample based on an encoding of the masked sample that was originally feeded to the model. The model predicts the representations just for masked time-steps. 

The model predicts contextualized representations that not only encode the actual time-step, nevertheless it also encodes other information from the sample since it uses self-attention within the Transformer network. The contextualized representations & the usage of Transformer network is what distinguishes the data2vec model from already existing BERT, wav2vec, BEiT, SimMIM, MAE, and MaskFeat models that predict targets without contextual information. 

Here is how the data2vec model parameterizes the teacher mode to predict the network representations that then function targets. 

Teacher Parameterization

The data2vec model parameterized the encoding of the unmasked training sample with the usage of EMA or Exponential Moving Average of the model parameters(θ) where the weights of the model within the goal mode(△) are as follows

                                           ∆ ← τ∆ + (1 − τ ) θ


Moreover, the model schedules for τ that linearly increases the parameter from  τ0 to τe (goal value) over the primary τn updates. After these updates, the model keeps the worth constant until the training gets over. The usage of the EMA strategy updates the teacher far more incessantly at first when the training starts when the model is random. Because the training proceeds & good parameters have been learned, the teacher gets updated less incessantly. 

The outcomes show that the model is more efficient & accurate when it shares the parameters of the feature encoder & positional encoder between the coed & the teacher mode. 


The development of the training targets are depending on the output of the highest K blocks of the teacher network for time-steps which might be masked in the coed mode. The output of the block l at any time-step t is denoted as alt. The model then applies normalization to every block to acquire âlt before it averages the highest K blocks 



to acquire the training goal yt for time-step t for a network with L blocks in total. 

It creates training targets that the model regresses when it’s in student mode. Within the initial experiments, the data2vec model performed well in predicting each block individually with a dedicated projection, and being far more efficient at the identical time. 

Moreover, normalizing the targets also allows the data2vec model from collapsing into constant representations for time-steps, and stopping layers with high normalization to dominate the features within the goal dataset. For speech recognition, the model uses instance normalization over the present input sample with none learned parameters. It’s mainly because because the stride over the input data is small, the neighboring representations are highly correlated. 

Moreover, the researchers found that when working with computer vision and NLP, parameter-less normalization does the job sufficiently. The issue can be solved with Variance-Invariance-Covariance regularization however the strategy mentioned above performs sufficiently well, and it doesn’t require any additional parameters. 


For contextualized training targets yt, the model uses a Smooth L1 loss to regress the targets as mentioned below

Here, β is answerable for transitioning from a squared loss to an L1 loss, and it depends heavily on the scale of the gap between the model prediction ft(x) at time-step t. The advantage of this loss is that it’s comparatively less sensitive to the outliers, with the necessity to tune the setting of β

Experimental Setup

The data2vec model is experimented with two model sizes: data2vec Large and data2vec Base. For numerical stability, the EMA updates are done in fp32, and the models contain L= 12 or L= 24 Transformer blocks with hidden dimensions(H) = 768 or H= 1024.  Let’s have an in depth have a look at the experimental setup for various modalities, and purposes. 

Computer Vision

The data2vec model embeds images of 224×224 pixels as patches of 16×16 pixels. Each of those patches is transformed linearly, and a sequence with 196 representations is fed to the usual Transformer. 

The model follows BEiT to mask blocks with adjoining patches with each block having a minimum of 16 patches with a random aspect ratio. Nonetheless, as an alternative of masking 40% of the patch as originally within the BEiT model, the data2vec model masks 60% of the patch for higher accuracy. 

Moreover, the model randomly resizes the image crops, horizontal flips, and color jittering. Finally, the data2vec model uses the identical modified image in each the teacher & the coed mode. 

The ViT-B models are pre-trained for 800 epochs, and the data2vec model uses the batch size of 8,192 for the ViT-L model, and a couple of,048 for the ViT-B model. The data2vec model also uses a cosine, and a Adam schedule with a single cycle to warm up the training rate for 80 epochs to 0.001 for ViT-L, and for 40 epochs to 0.001 for ViT-B. 

For each ViT-B, and ViT-L, the data2vec model uses β = 2, K = 6 and τ = 0.9998 as constant with no schedule. The model further uses the stochastic depth rate 0.2. 

Moreover, for ViT-L, the model trains for 1,600 epochs where the primary 800 epochs have a learning rate as 0.9998, after which the model resets the training rate schedule, and continues for the ultimate 800 epochs with learning rate as 0.9999. 

For image classification, the model uses the mean-pool of the output of the last Transformer block, and feeds it to the softmax-normalized classifier. The model then high quality tunes the ViT-L for 50 epochs, and ViT-B for 100 epochs using the cosine, and Adam to warmup the training rate. 

Speech Processing

For speech processing, the data2vec model uses the Fairseq, a sequence-modeling kit used to coach customer models for summarization, translation, and text generation. The model takes 16 kHz waveform as input that’s processed using a feature encoder, and incorporates temporal convolutions with 512 channels, kernel widths (10,3,3,3,3,2,2), and strides (5,2,2,2,2,2,2). 

The above leads to the output frequency of the encoder being 50Hz, and it has a stride of 20ms between each sample. The receptive field comprises of 400 input samples or 25 ms of audio. The raw waveform fed to the encoder is normalized to unit variance, and 0 mean

The masking strategy utilized by the data2vec for the Base model resembles the Baevski framework for self-supervised learning in speech recognition. The model samples p = 0.065 for all time-steps to be starting indices, and proceeds to mark the next ten time-steps. For a typical training sequence, the method allows almost 49% of the entire time-steps to be masked. 

During training, the data2vec model linearly anneals τ using τo = 0.999, τe = 0.9999, and τn = 30,000. The data2vec model uses the Adam optimizer with the height learning rate being 5×10-4 for the Base model. Moreover, the bottom model uses a tri-stage scheduler that warms up the training rate linearly for the primary 3% of updates, maintains it for the following 90%, after which proceeds to decay it linearly for the remaining 7%. 

Natural Language Processing

The data2vec model uses the byte-pair encoding of 50K types to tokenize the input, and the model then learns an embedding for every type. After the information is encoded, the model applies the BERT masking technique to 15% of uniformly chosen tokens during which 80% are replaced by learned mask tokens, 10% are replaced by random vocabulary tokens, and the remaining 10% are unchanged. 

During pre-training the model uses τo = 0.999, τe = 0.9999, and τn = 100,000, K= 10, and β = 4. The model uses the Adam optimizer with a tri-stage learning rate schedule that warms up the training rate linearly for the primary 5% of updates, maintains it for the following 80%, after which proceeds to decay it linearly for the remaining 15%, with the height learning rate being 2×10-4

Moreover, the model trains on 16 GPUs with a batch size of 256 sequences, and every sequence containing about 512 tokens. For downstreaming, the model is pre-trained in 4 different learning rates: 1×10-4, 2×10-4, 3×10-4, 4×10-4, and the one which performs the most effective is chosen for further NLP downstreaming tasks. 


Let’s have a have a look at how the data2vec model performs when it implements the strategies discussed above for various modalities. 

Computer Vision

To judge the outcomes for computer vision, the data2vec model is pre-trained on the pictures obtained from the ImageNet-1K dataset. The resulting model is fine-tuned using the labeled data of the identical benchmark. As per the usual practice, the model is then evaluated when it comes to top-1 accuracy on validation data. 

The outcomes are then distinguished on the premise of a single self-supervised model, and training a separate visual tokenizer on additional data, or other self-supervised learning models. 

The table below compares the performance of the data2vec model for computer vision, and other existing models: ViT-L, and ViT-B. 

The outcomes from the above table will be summarized as follows. 

  • The data2vec model outperforms prior work with each the ViT-L, and ViT-B models in single model setting. 
  • The masked prediction setup utilized in the data2vec algorithm to predict contextualized latent representations performs higher compared to methods that predict local targets like engineering image features, input pixels, or visual tokens. 
  • The data2vec model also outperforms self-distillation methods that regress the ultimate layer of the coed network while taking two different augmented versions of a picture as inputs. 

Audio & Speech Processing

For speech & audio processing, the data2vec model is trained on about 960 hours of audio data obtained from the Librispeech(LS-960) dataset. The dataset incorporates clean speech audio from audiobooks in English, and it’s treated as an ordinary benchmark within the speech & audio processing industry. 

To investigate the model’s performance in several resource settings, researchers have high quality tuned the data2vec model to make use of different amounts of labeled data(from a couple of minutes to several hours) for automatic speech recognition. To investigate the model’s performance, data2vec is compared against HuBERT & wav2vec 2.0, two of the preferred algorithms for speech & audio representation learnings that depend on discrete speech units. 

The above table compares the performance of data2vec when it comes to word rate for speech recognition with other existing models. LM represents the language model used for decoding. The outcomes will be summarized as follows. 

  • The data2vec model shows improvements for many labeled data setups with the most important gain of 10 minutes of labeled data for Base models. 
  • In terms of large models, the model performs significantly higher on small labeled datasets, and the performance is comparable on resource-rich datasets with over 100 & 960 hours of labeled data. It’s since the performance generally saturates on resource-rich labeled dataset for many models. 
  • After analyzing the performance, it could possibly be deduced that when the model uses wealthy contextualized targets, it’s not essential to learn discrete units. 
  • Learning contextualized targets during training helps in improving the general performance significantly. 

Moreover, to validate data2vec’s approach for speech recognition, the model can be trained on the AudioSet benchmark. Although the pre-training setup for AudioSet is analogous to Librispeech, the model is trained for K= 12, and for over 200K updates, where the scale of every batch is 94.5 minutes. 

The model then applies the DeepNorm framework, and layer normalization to the targets to assist in stabilizing the training. Moreover, the model can be high quality tuned on balanced subsets with batch size of 21.3 minutes over 13k updates. The model also uses Linear Softmax Pooling and mixup with a probability rating of 0.7. The model then adds a single linear projection into 527 unique classes of audio, and sets the projection learning rate to 2e-4. 

Moreover, the pre-trained parameters have a learning rate of 3e-5, and the model uses masking techniques for high quality tuning the dataset. The table below summarizes the outcomes, and it could possibly be seen that the data2vec model is able to outperforming a comparable setup with the identical fine-tuning, and pre-training data. 

Natural Language Processing

To investigate data2vec’s performance on text, the model follows the identical training setup as BERT and pre-training the model on English Wikipedia dataset with over 1M updates, and batch size being 256 sequences. The model is evaluated on the GLUE or General Language Understanding Evaluation benchmark that features natural language interference tasks(MNLI or Multi Genre Natural Language Inference), sentence similarity (QQP or Quora Query Pairs benchmark, MRPC or Microsoft Research Paragraph Corpus, and STS-B or Semantic Textual Similarity Benchmark), sentiment evaluation(SST-2 or Stanford Sentiment Treebank), and grammatically(CoLA). 

Moreover, to high quality tune the data2vec model, the labeled data is provided by each task, and the typical accuracy is reported on the event sets with 5 fine-tuning runs. The next table summarizes the performance of the data2vec model for Natural Language Processing tasks, and compares it with other models. 

  • The above data shows that the data2vec model outperforms the baseline RoBERTa model because the strategy in data2vec model doesn’t use random targets. 
  • The data2vec model is the primary successful pre-trained NLP model that doesn’t use discrete units like characters, words or sub-words as training targets. As a substitute, the data2vec framework predicts contextualized latent representation over the whole unmasked text sequence. 
  • It helps in making a learning task during which the model is required to predict targets with specific properties from the present sequence fairly than predicting representations which might be generic to each text unit with particular discretion. 
  • Moreover, the training goal set will not be fixed, and the model is free to define latest targets, and it’s open to vocabulary settings. 

Data2Vec: Ablations Study

Ablation is a term used to define the removal of a component within the AI, and ML systems. An ablation study is used to research or analyze the performance of an AI or ML model by removing certain key components from the model that permits researchers to grasp the contribution of that component in the general system. 

Layer Averaged Targets

A serious difference between data2vec and other self-supervised learning models is that the data2vec model uses targets which might be based on averaging multiple layers from the teacher network. The concept comes from the indisputable fact that the highest top layers of the wav2vec 2.0 model doesn’t perform well for downstream tasks compared to middle layers of the model. 

In the next experiment, the performance of all three modalities is measured by averaging K= 1, 2, …, 12 layers where K= 1 predicts only the highest layer. Nonetheless, to extract faster turnaround time, the data2vec trains the bottom model with 12 layers in total. For speech recognition, the model is pre-trained on over 200 thousand updates on Librispeech, after which fine-tuned on a ten hour labeled split of Libri-light. For Natural Language Processing, the model reports the typical GLUE rating for the validation set, and pre-trains the model for 300 epochs for computer vision & then reports the top-1 accuracy obtained on the ImageNet dataset. 

The above figure shows that targets based on multiple layers generally improve when only the highest layer K=1 is used for all modalities. Using all of the layers available is a very good practice because the neural networks construct features over several types of features, and diverse layers which might be then extracted as feature layers. 

Using features from multiple layers helps in boosting accuracy, and enriches the self-supervised learning process. 

Goal Feature Type

The transformer blocks within the data2vec model have several layers that may all function targets. To investigate how different layers affect performance, the model is pre-trained on Librispeech’s speech models that use different layers as goal features. 

The figure below clearly indicates that the output of the feed forward network or the FFN works ideally whereas the output of the self-attention blocks don’t lead to a usable model. 

Goal Contextualization

Teacher representations within the data2vec model use self-attention over your complete input to supply contextualized targets. It’s what separates data2vec from other self-supervised learning models that construct a learning task by reconstructing or predicting local parts of the input. It evidently poses the query: does the data2vec model require contextualized targets to work well? 

To reply the query, the researchers construct goal representations that don’t have access to your complete input dataset but only a fraction of it that’s predetermined. The model then restricts the self-attention mechanism of the teacher that permits it to access only a portion of surrounding environment input. After the model has been trained, it’s fine-tuned to access the complete context size. 

The figure below indicates that larger context sizes often result in a greater performance, and when your complete input sample is visible, it yields the most effective accuracy. It further proves that richer goal representations can yield higher performance. 

Modality Specific Feature Extractors and Masking

The first objective of data2vec is to design a straightforward learning mechanism that may work with different modalities. It’s because, although the present models and frameworks have a unified learning regime, they still use modality specific masking, and have extractors. 

It is sensible that frameworks mostly work with a single modality given the character of the input data varies vastly from each other. For instance, speech recognition models use a high resolution input( like 10 kHz waveform) that typically have 1000’s of samples. The waveform is then processed by the framework using a multilayer convolutional neural network to acquire feature sequences of fifty Hz. 

Structured and Contextualized Targets

The major differentiating point between the data2vec and other masked prediction models is that within the data2vec model, the features of coaching targets are contextualized. These features are built using self-attention of your complete masked input in teacher mode. 

Another frameworks like BYOL(Bootstrap Your Own Latent) or DINO also use latent representations just like the data2vec, but their primary focus is to learn transformation invariant representations. 

Final Thoughts

Recent work within the AI and ML industry have indicated that uniform model architectures will be an efficient approach to tackle multiple modalities. The data2vec model uses a self-supervised learning approach for working with three modalities: speech, images, and language. 

The important thing concept behind the data2vec model is to make use of partial input view to regress contextualized information or input data. The approach utilized by the data2vec frameworks is effective because the model performs higher than prior self-supervised learning models on ImageNet-1K dataset for each ViT-B, and ViT-L single models. 

Data2vec is trully a milestone within the self-supervised learning industry because it demonstrates a single learning method for learning multiple modalities can indeed make it easier for models to learn across modalities. 


Please enter your comment!
Please enter your name here