Learn the core concepts behind OpenAI’s GPT models

Introduction
It was 2021 after I wrote my first few lines of code using a GPT model, and that was the moment I noticed that text generation had reached an inflection point. Prior to that, I had written language models from scratch in grad school, and I had experience working with other text generation systems, so I knew just how difficult it was to get them to supply useful results. I used to be fortunate to get early access to GPT-3 as a part of my work on the announcement of its release throughout the Azure OpenAI Service, and I attempted it out in preparation for its launch. I asked GPT-3 to summarize a protracted document and experimented with few-shot prompts. I could see that the outcomes were way more advanced than those of prior models, making me excited concerning the technology and wanting to find out how it’s implemented. And now that the follow-on GPT-3.5, ChatGPT, and GPT-4 models are rapidly gaining wide adoption, more people in the sector are also interested in how they work. While the small print of their inner workings are proprietary and complicated, all of the GPT models share some fundamental ideas that aren’t too hard to grasp. My goal for this post is to clarify the core concepts of language models normally and GPT models particularly, with the reasons geared toward data scientists and machine learning engineers.
How generative language models work
Let’s start by exploring how generative language models work. The very basic idea is the next: they take n tokens as input, and produce one token as output.
This looks as if a reasonably straightforward concept, but with the intention to really understand it, we want to know what a token is.
A token is a bit of text. Within the context of OpenAI GPT models, common and short words typically correspond to a single token, akin to the word “We” within the image below. Long and fewer commonly used words are generally broken up into several tokens. For instance the word “anthropomorphizing” within the image below is broken up into three tokens. Abbreviations like “ChatGPT” could also be represented with a single token or broken up into multiple, depending on how common it’s for the letters to seem together. You may go to OpenAI’s Tokenizer page, enter your text, and see the way it gets split up into tokens. You may make a choice from “GPT-3” tokenization, which is used for text, and “Codex” tokenization, which is used for code. We’ll keep the default “GPT-3” setting.
You can even use OpenAI’s open-source tiktoken library to tokenize using Python code. OpenAI offers a number of different tokenizers that every have a rather different behavior. Within the code below we use the tokenizer for “davinci,” which is a GPT-3 model, to match the behavior you saw using the UI.
import tiktoken# Get the encoding for the davinci GPT3 model, which is the "r50k_base" encoding.
encoding = tiktoken.encoding_for_model("davinci")
text = "We want to stop anthropomorphizing ChatGPT."
print(f"text: {text}")
token_integers = encoding.encode(text)
print(f"total variety of tokens: {encoding.n_vocab}")
print(f"token integers: {token_integers}")
token_strings = [encoding.decode_single_token_bytes(token) for token in token_integers]
print(f"token strings: {token_strings}")
print(f"variety of tokens in text: {len(token_integers)}")
encoded_decoded_text = encoding.decode(token_integers)
print(f"encoded-decoded text: {encoded_decoded_text}")
text: We want to stop anthropomorphizing ChatGPT.
total variety of tokens: 50257
token integers: [1135, 761, 284, 2245, 17911, 25831, 2890, 24101, 38, 11571, 13]
token strings: [b'We', b' need', b' to', b' stop', b' anthrop', b'omorph', b'izing', b' Chat', b'G', b'PT', b'.']
variety of tokens in text: 11
encoded-decoded text: We want to stop anthropomorphizing ChatGPT.
You may see within the output of the code that this tokenizer comprises 50,257 different tokens, and that every token is internally mapped into an integer index. Given a string, we will split it into integer tokens, and we will convert those integers into the sequence of characters they correspond to. Encoding and decoding a string should at all times give us the unique string back.
This offers you a great intuition for a way OpenAI’s tokenizer works, but chances are you’ll be wondering why they selected those token lengths. Let’s consider another options for tokenization. Suppose we try the only possible implementation, where each letter is a token. That makes it easy to interrupt up the text into tokens, and keeps the entire number of various tokens small. Nevertheless, we will’t encode nearly as much information as in OpenAI’s approach. If we used letter-based tokens in the instance above, 11 tokens could only encode “We want to”, while 11 of OpenAI’s tokens can encode your entire sentence. It seems that the present language models have a limit on the utmost variety of tokens that they’ll receive. Due to this fact, we would like to pack as much information as possible in each token.
Now let’s consider the scenario where each word is a token. In comparison with OpenAI’s approach, we’d only need seven tokens to represent the identical sentence, which seems more efficient. And splitting by word can also be straighforward to implement. Nevertheless, language models must have an entire list of tokens that they could encounter, and that’s not feasible for whole words — not only because there are such a lot of words within the dictionary, but additionally because it will be difficult to maintain up with domain-specific terminology and any recent words which can be invented.
So it’s not surprising that OpenAI settled for an answer somewhere in between those two extremes. Other firms have released tokenizers that follow an identical approach, for instance Sentence Piece by Google.
Now that now we have a greater understanding of tokens, let’s return to our original diagram and see if we will understand it a bit higher. Generative models take n tokens in, which might be a number of words, a number of paragraphs, or a number of pages. And so they produce a single token out, which might be a brief word or a component of a word.
That makes a bit more sense now.
But in case you’ve played with OpenAI’s ChatGPT, you realize that it produces many tokens, not only a single token. That’s because this basic idea is applied in an expanding-window pattern. You give it n tokens in, it produces one token out, then it incorporates that output token as a part of the input of the subsequent iteration, produces a brand new token out, and so forth. This pattern keeps repeating until a stopping condition is reached, indicating that it finished generating all of the text you would like.
For instance, if I type “We want to” as input to my model, the algorithm may produce the outcomes shown below:
While fiddling with ChatGPT, chances are you’ll even have noticed that the model will not be deterministic: in case you ask it the very same query twice, you’ll likely get two different answers. That’s since the model doesn’t actually produce a single predicted token; as an alternative it returns a probability distribution over all of the possible tokens. In other words, it returns a vector through which each entry expresses the probability of a specific token being chosen. The model then samples from that distribution to generate the output token.
How does the model provide you with that probability distribution? That’s what the training phase is for. During training, the model is exposed to lots of text, and its weights are tuned to predict good probability distributions, given a sequence of input tokens. GPT models are trained with a big portion of the web, so their predictions reflect a mixture of the knowledge they’ve seen.
You now have a excellent understanding of the concept behind generative models. Notice that I’ve only explained the concept though, I haven’t yet given you an algorithm. It seems that this concept has been around for a lot of many years, and it has been implemented using several different algorithms through the years. Next we’ll take a look at a few of those algorithms.
A transient history of generative language models
Hidden Markov Models (HMMs) became popular within the Nineteen Seventies. Their internal representation encodes the grammatical structure of sentences (nouns, verbs, and so forth), they usually use that knowledge when predicting recent words. Nevertheless, because they’re Markov processes, they only consider probably the most recent token when generating a brand new token. So, they implement a quite simple version of the “n tokens in, one token out” idea, where n = 1. Because of this, they don’t generate very sophisticated output. Let’s consider the next example:
If we input “The short brown fox jumps over the” to a language model, we’d expect it to return “lazy.” Nevertheless, an HMM will only see the last token, “the,” and with such little information it’s unlikely that it should give us the prediction we expect. As people experimented with HMMs, it became clear that language models must support a couple of input token with the intention to generate good outputs.
N-grams became popular within the Nineteen Nineties because they fixed the predominant limitation with HMMs by taking a couple of token as input. An n-gram model would probably do pretty much at predicting the word “lazy” for the previous example.
The best implementation of an n-gram is a bi-gram with character-based tokens, which given a single character, is capable of predict the subsequent character within the sequence. You may create one among these in only a number of lines of code, and I encourage you to provide it a try. First, count the number of various characters in your training text (let’s call it n), and create an n x n 2D matrix initialized with zeros. Each pair of input characters will be used to locate a specific entry on this matrix, by selecting the row corresponding to the primary character, and the column corresponding to the second character. As you parse your training data, for each pair of characters, you just add one to the corresponding matrix cell. For instance, in case your training data comprises the word “automotive,” you’ll add one to the cell within the “c” row and “a” column, after which add one to the cell within the “a” row and “r” column. Once you could have accrued the counts for all of your training data, convert each row right into a probability distribution by dividing each cell by the entire across that row.
Then to make a prediction, you could give it a single character to start out, for instance, “c”. You look up the probability distribution that corresponds to the “c” row, and sample that distribution to supply the subsequent character. You then take the character you produced, and repeat the method, until you reach a stopping condition. Higher-order n-grams follow the identical basic idea, but they’re able to have a look at an extended sequence of input tokens by utilizing n-dimensional tensors.
N-grams are easy to implement. Nevertheless, because the scale of the matrix grows exponentialy because the variety of input tokens increases, they don’t scale well to a bigger variety of tokens. And with just a number of input tokens, they’re not capable of produce good results. A brand new technique was needed to proceed making progress on this field.
Within the 2000s, Recurrent Neural Networks (RNNs) became quite popular because they’re able to just accept a much larger variety of input tokens than previous techniques. Particularly, LSTMs and GRUs, that are forms of RNNs, became widely used and proved able to generating fairly good results.
RNNs are a style of neural network, but unlike traditional feed-forward neural networks, their architecture can adapt to accepting any variety of inputs and produce any variety of outputs. For instance, if we give an RNN the input tokens “We,” “need,” and “to,” and need it to generate a number of more tokens until a full point is reached, the RNN may need the next structure:
Each of the nodes within the structure above has the identical weights. You may consider it as a single node that connects to itself and executes repeatedly (hence the name “recurrent”), or you possibly can consider it within the expanded form shown within the image above. One key capability added to LSTMs and GRUs over basic RNNs is the presence of an internal memory cell that gets passed from one node to the subsequent. This allows later nodes to recollect certain facets of previous ones, which is crucial to make good text predictions.
Nevertheless, RNNs have instability issues with very long sequences of text. The gradients within the model are inclined to grow exponentially (called “exploding gradients”) or decrease to zero (called “vanishing gradients”), stopping the model from continuing to learn from training data. LSTMs and GRUs mitigate the vanishing gradients issue, but don’t prevent it completely. So, although in theory their architecture allows for inputs of any length, in practice there are limitations to that length. Once more, the standard of the text generation was limited by the variety of input tokens supported by the algorithm, and a brand new breakthrough was needed.
In 2017, the paper that introduced Transformers was released by Google, and we entered a brand new era in text generation. The architecture utilized in Transformers allows an enormous increase within the variety of input tokens, eliminates the gradient instability issues seen in RNNs, and is extremely parallelizable, which implies that it’s capable of make the most of the ability of GPUs. Transformers are widely used today, they usually’re the technology chosen by OpenAI for his or her latest GPT text generation models.
Transformers are based on the “attention mechanism,” which allows the model to pay more attention to some inputs than others, no matter where they show up within the input sequence. For instance, let’s consider the next sentence:
On this scenario, when the model is predicting the verb “bought,” it must match the past tense of the verb “went.” To be able to try this, it has to pay lots of attention to the token “went.” In truth, it might pay more attention to the token “went” than to the token “and,” despite the incontrovertible fact that “went” appears much earlier within the input sequence.
This selective attention behavior in GPT models is enabled by a novel idea within the 2017 paper: the usage of a “masked multi-head attention” layer. Let’s break down this term, and dive deeper into each of its sub-terms:
Attention: An “attention” layer comprises a matrix of weights representing the strength of the connection between all pairs of token positions within the input sentence. These weights are learned during training. If the load that corresponds to a pair of positions is large, then the 2 tokens in those positions greatly influence one another. That is the mechanism that allows the Transfomer to pay more attention to some tokens than others, no matter where they show up within the sentence.
Masked: The eye layer is “masked” if the matrix is restricted to the connection between each token position and earlier positions within the input. That is what GPT models use for text generation, because an output token can only rely on the tokens that come before it.
Multi-head: The Transformer uses a masked “multi-head” attention layer since it comprises several masked attention layers that operate in parallel.
The memory cell of LSTMs and GRUs also enables later tokens to recollect some facets of earlier tokens. Nevertheless, if two related tokens are very far apart, the gradient issues could get in the way in which. Transformers don’t have that problem because each token has a direct connection to all other tokens that precede it.
Now that you simply understand the predominant ideas of the Transformer architecture utilized in GPT models, let’s take a take a look at the distinctions between the assorted GPT models which can be currently available.
How different GPT models are implemented
On the time of writing, the three latest text generation models released by OpenAI are GPT-3.5, ChatGPT, and GPT-4, they usually are all based on the Transformer architecture. In truth, “GPT” stands for “Generative Pre-trained Transformer.”
GPT-3.5 is a transformer trained as a completion-style model, which implies that if we give it a number of words as input, it’s able to generating a number of more words which can be more likely to follow them within the training data.
ChatGPT, then again, is trained as a conversation-style model, which implies that it performs best after we communicate with it as if we’re having a conversation. It’s based on the identical transformer base model as GPT-3.5, nevertheless it’s fine-tuned with conversation data. Then it’s further fine-tuned using Reinforcement Learning with Human Feedback (RLHF), which is a way that OpenAI introduced of their 2022 InstructGPT paper. In this method, we give the model the identical input twice, get back two different outputs, and ask a human ranker which output it prefers. That selection is then used to enhance the model through fine-tuning. This system brings alignment between the outputs of the model and human expectations, and it’s critical to the success of OpenAI’s latest models.
GPT-4 then again, will be used each for completion and conversation, and has its own entirely recent base model. This base model can also be fine-tuned with RLHF for higher alignment with human expectations.
Writing code that uses GPT models
You’ve got two options to put in writing code that uses GPT models: you should use the OpenAI API directly, or you should use the OpenAI API on Azure. Either way, you write code using the identical API calls, which you’ll be able to study in OpenAI’s API reference pages.
The predominant difference between the 2 is that Azure provides the next additional features:
- Automated responsible AI filters that mitigate unethical uses of the API
- Azure’s safety features, akin to private networks
- Regional availability, for the most effective performance when interacting with the API
If you happen to’re writing code that uses these models, you’ll need to select the particular version you must use. Here’s a fast cheat-sheet with the versions which can be currently available within the Azure OpenAI Service:
- GPT-3.5: text-davinci-002, text-davinci-003
- ChatGPT: gpt-35-turbo
- GPT-4: gpt-4, gpt-4–32k
The 2 GPT-4 versions differ mainly within the variety of tokens they support: gpt-4 supports 8,000 tokens, and gpt-4–32k supports 32,000. In contrast, the GPT-3.5 models only support 4,000 tokens.
Since GPT-4 is currently the costliest option, it’s a great idea to start out with one among the opposite models, and upgrade provided that needed. For more details about these models, try the documentation.
Conclusion
In this text, we’ve covered the basic principles common to all generative language models, and the distinctive facets of the newest GPT models from OpenAI particularly.
Along the way in which, we emphasized the core idea of language models: “n tokens in, one token out.” We explored how tokens are broken up, and why they’re broken up that way. And we traced the decades-long evolution of language models from the early days of Hidden Markov Models to the recent Transformer-based models. Finally, we described the three latest Transformer-based GPT models from OpenAI, how each of them is implemented, and the way you possibly can write code that makes use of them.
By now, you need to be well equipped to have informed conversations about GPT models, and to start out using them in your individual coding projects. I plan to put in writing more of those explainers about language models, so please follow me and let me know which topics you’d wish to see covered! Thanks for reading!
All images unless otherwise noted are by the creator.