Home Artificial Intelligence Large Language Models, GPT-2 — Language Models Are Unsupervised Multitask Learners Introduction Multitask learning Dataset Model Conclusion Resources

Large Language Models, GPT-2 — Language Models Are Unsupervised Multitask Learners Introduction Multitask learning Dataset Model Conclusion Resources

Large Language Models, GPT-2 — Language Models Are Unsupervised Multitask Learners
Multitask learning

Acing GPT capabilities by turning it into a robust multitask zero-shot model

Towards Data Science

GPT is a well known series of models whose last versions are currently dominating in various NLP tasks. The primary GPT version was a major milestone: being trained on enormous 120M parameters, this model demonstrated state-of-the-art performance on top benchmarks. Ranging from this point, researchers tried to enhance the bottom version.

In 2019, researchers from OpenAI officially released GPT-2. It was 10 times greater than GPT-1 which allowed it to enhance performance even further. Aside from that, the authors conjectured of their work that LLMs are multitask learners meaning that they will learn to perform several tasks at the identical time. This essential statement made it possible to further develop LLMs in a way more efficient framework.

In this text, we’ll check with the official GPT-2 paper by going through its fundamental features and enhancements over GPT-1 and understand a novel approach for constructing LLMs.

Note. This text assumes that you just are already conversant in the primary version of GPT. If not, take a look at this text.

The importance of understanding the GPT evolution

It is not any secret that with the recent introduction of powerful models like ChatGPT or GPT-4, the primary GPT versions not attract that much attention and appear obsolete.

Nevertheless, the next reasons explain the essential motivation behind studying the GPT evolution.

  • The primary GPT versions introduced language learning concepts which can be still utilized by essentially the most recent models. The perfect example is GPT-2 innovating the multitask learning technique. Because of this idea, the fashionable GPT models can accurately solve a big number of NLP tasks.
  • From the algorithmic perspective, most LLMs already use many advanced techniques and it becomes harder to innovate latest efficient methods. That’s the reason NLP researchers focus more on scraping and feeding more high-quality data to models. This detail explains why there just isn’t a lot difference between internal working mechanisms in first GPT models, compared to ChatGPT-3.5 or GPT-4. Because of this, essentially the most principled differences are frequently the quantity of knowledge fed to them and the complexity of a neural network. By understanding how first GPT models work, you may robotically recognize the working concepts of more advanced models.
Regardless that there is likely to be some subtle differences within the training process between different GPT models, the features contributing essentially the most to the model’s performance is the quantity of knowledge fed to it and the neural network’s complexity.

GPT-2 is built on top of GPT-1 meaning that it has the identical architecture. During training, GPT-1 uses the usual log-likelihood language modeling objective:

GPT’s learning objective

This expression will be considered an optimization of conditional probability distribution p(output | input) for a given task (within the case of GPT-1, the duty consists of predicting the subsequent token). While this approach works well for individual tasks, the model continues to be not in a position to learn to perform multiple tasks. As an illustration, a model trained with the aforementioned objective to predict the subsequent token within the sequence will perform poorly on a sentiment evaluation problem without proper fine-tuning.

The GPT-2 authors proposed a novel approach for replacing the common pre-training + fine-tuning framework that will allow a trained model to perform well across different tasks. The thought consists of not modeling the usual probability p(output | input) but including task conditioning p(output | input, task) as an alternative. There exist several approaches to incorporating task type into the model. Many of the previous methods considered this information by making changes on the architecture level. Though this approach worked well prior to now, it turned out that there can be no need to change the model’s architecture for task-type incorporation.

The final word idea is that task information will be easily incorporated into the input sequence. For instance:

  • If a sentence in language A must be translated into the language B, then the input sequence within the dataset will likely be written as:
Example from the paper demonstrating input adaption for translation tasks
  • If a solution ought to be given to an issue in a provided context, then the input sequence will take the next form:
Example from the paper demonstrating input adaption for query answering tasks

Surprisingly the described approach was already proven to be competitive in previous works (e.g. MQAN model)! The one fundamental drawback is its slow learning speed.

Zero-shot learning is a well-liked term and designates the power of a model to perform a certain task without having explicitly received any training examples for it. GPT-2 is an example of a model having this ability.

To make use of the concept of multitask learning from the previous section, for training, we might normally need a dataset whose objects contain task descriptions, text inputs and labels. Nevertheless, in point of fact, the authors developed a sturdy framework which turns this supervised problem into an unsupervised one and doesn’t even need task descriptions!

The researchers conjectured that if a model was trained on a big and diverse dataset, then there would probably be a variety of language demonstration tasks in numerous domains that will definitely help the model to totally understand them. To validate this hypothesis, the authors designed an online scraping algorithm that collected human responses on Reddit which received at the very least 3 likes. Collecting all possible Reddit responses would likely have led to data quality issues and now have been too large for a model. Because of this, the ultimate dataset version includes 8M documents containing 40GB of text data in total.

Dataset fragment containing a sentence including phrases in English and French. Such text fragments may help the model perform translation tasks. The instance is taken from the paper.
An identical example to the previous one from the paper.

Because the collected dataset could be very diverse, to raised account for rare words and characters, the authors incorporated a rather modified version of Byte-Pair Encoding (BPE) for input representations.

In accordance with the paper, GPT-2 has the identical architecture as GPT-1 apart from several changes:

  • Layer normalization was moved to the input of every Transformer block and was added to the ultimate self-attention block.
  • Weights of residual layers are divided by √N at initialization where (N is the variety of residual layers).
  • Context size is increased from 512 to 1024.
  • Batch size is augmented from 64 to 512.
  • Vocabulary size is expanded from 40,000 tokens to 50,257.

By turning a supervised problem into the unsupervised format, multitask learning helps GPT-2 to ace the performance on various downstream tasks (apart from text summarization) without explicit fine-tuning. Actually, after several years, this learning framework continues to be continually gaining popularity in machine learning.

When a training dataset is sufficiently large and diverse, it allows gigantic models to counterpoint linguistic knowledge by simply optimizing the log-likelihood language objective. Finally, GPT-2 has turn into an ideal example of such a model.

All images are by the writer unless noted otherwise.


Please enter your comment!
Please enter your name here