## A straightforward explanation to the model behind ChatGPT

Repeatedly engaging with colleagues across diverse domains, I benefit from the challenge of conveying machine learning concepts to individuals who have little to no background in data science. Here, I attempt to elucidate how GPT is wired in easy terms, only this time in written form.

Behind ChatGPT’s popular magic, there’s an unpopular logic. You write a prompt to ChatGPT and it generates text and whether it’s accurate, it resembles human answers. How is it in a position to understand your prompt and generate coherent and comprehensible answers?

**Transformer Neural Networks.** The architecture designed to process unstructured data in vast amounts, in our case, text. Once we say architecture, what we mean is basically a series of mathematical operations that were made in several layers in parallel. Through this technique of equations, several innovations were introduced that helped us overcome the long-existing challenges of text generation. The challenges that we were struggling to resolve up until 5 years ago.

If GPT has already been here for five years (indeed GPT paper was published in 2018), isn’t GPT old news? Why has it turn into immensely popular recently? What’s the difference between GPT 1, 2, 3, 3.5 (ChatGPT ) and 4?

All GPT versions were built on the identical architecture. Nonetheless each following model contained more parameters and trained using larger text datasets. There have been obviously other novelties introduced by the later GPT releases especially within the training processes like reinforcement learning through human feedback which we are going to explain within the third a part of this blog series.

**Vectors, matrices, tensors.** All these fancy words are essentially units that contain chunks of numbers. Those numbers undergo a series of mathematical operations(mostly multiplication and summation) until we reach optimal output values, that are the chances of the possible outcomes.

Output values? On this sense, it’s the text generated by the language model, right? Yes. Then, what are the input values? Is it my prompt? Yes, but not entirely. So what else is behind?

Before occurring to the various text decoding strategies, which shall be the subject of the next blog post, it is beneficial to remove the anomaly. Let’s return to fundamental query that we asked at the beginning. How does it understand human language?

**Generative Pre-trained Transformers**. Three words that GPT abbreviation stands for. We touched the Transformer part above that it represents the architecture where heavy calculations are made. But what can we calculate exactly? Where do you even get the numbers? It’s a language model and all you do is to input some text. How will you calculate text?

Data is agnostic. All data is same whether in the shape of text, sound or image.¹

**Tokens**. We split the text into small chunks (tokens) and assign an unique number to every certainly one of them(token ID). Models don’t know words, images or audio recordings. They learn to represent them in huge series of numbers (parameters) that serves us as a tool for instance the characteristics of things in numerical forms. Tokens are the language units that convey meaning and token IDs are the unique numbers that encode tokens.

Obviously, how we tokenise the language can vary. Tokenisation can involve splitting texts into sentences, words, parts of words(sub-words), and even individual characters.

Let’s consider a scenario where now we have 50,000 tokens in our language corpus(just like GPT-2 which has 50,257). How can we represent those units after tokenisation?

`Sentence: "students have a good time the graduation with a giant party"`

Token labels: ['[CLS]', 'students', 'have a good time', 'the', 'graduation', 'with', 'a', 'big', 'party', '[SEP]']

Token IDs: tensor([[ 101, 2493, 8439, 1996, 7665, 2007, 1037, 2502, 2283, 102]])

Above is an example sentence tokenised into words. Tokenisation approaches can differ of their implementation. What’s essential for us to grasp immediately is that we acquire numerical representations of language units(tokens) through their corresponding token IDs. So, now that now we have these token IDs, can we simply input them directly into the model where calculations happen?

Cardinality matters in math. 101 and 2493 as token representation will matter to model. Because remember, all we’re doing is principally multiplications and summations of massive chunks of numbers. So multiplying a number with either with 101 or with 2493 will matter. Then, how can we be sure that a token that’s represented with number 101 just isn’t less essential than 2493, simply because we occur to tokenise it arbitrarily so? How can we encode the words without causing a fictitious ordering?

**One-hot encoding. **Sparse mapping of tokens. One-hot encoding is the technique where we project each token as a binary vector. Which means just one single element within the vector is 1 (“hot”) and the remaining is 0 (“cold”).

The tokens are represented with a vector which has length of total token in our corpus. In simpler terms, if now we have 50k tokens in our language, every token is represented by a vector 50k during which just one element is 1 and the remaining is 0. Since every vector on this projection accommodates just one non-zero element, it is known as as sparse representation. Nonetheless, as you would possibly think this approach could be very inefficient. Yes, we manage to remove the substitute cardinality between the token ids but we are able to’t extrapolate any information concerning the semantics of the words. We will’t understand whether the word “party” refers to a celebration or to a political organisation by utilizing sparse vectors. Besides, representing every token with a vector of size 50k will mean, in total of 50k vector of length 50k. This could be very inefficient by way of required memory and computation. Fortunately now we have higher solutions.

**Embeddings**. Dense representation of tokens. Tokenised units go through an embedding layer where each token is transformed into continuous vector representation of a hard and fast size. For instance within the case of GPT 3, each token in is represented by a vector of 768 numbers. These numbers are assigned randomly which then are being learned by the model after seeing a lot of data(training).

`Token Label: “party”`

Token : 2283

Embedding Vector Length: 768

Embedding Tensor Shape: ([1, 10, 768])Embedding vector:

tensor([ 2.9950e-01, -2.3271e-01, 3.1800e-01, -1.2017e-01, -3.0701e-01,

-6.1967e-01, 2.7525e-01, 3.4051e-01, -8.3757e-01, -1.2975e-02,

-2.0752e-01, -2.5624e-01, 3.5545e-01, 2.1002e-01, 2.7588e-02,

-1.2303e-01, 5.9052e-01, -1.1794e-01, 4.2682e-02, 7.9062e-01,

2.2610e-01, 9.2405e-02, -3.2584e-01, 7.4268e-01, 4.1670e-01,

-7.9906e-02, 3.6215e-01, 4.6919e-01, 7.8014e-02, -6.4713e-01,

4.9873e-02, -8.9567e-02, -7.7649e-02, 3.1117e-01, -6.7861e-02,

-9.7275e-01, 9.4126e-02, 4.4848e-01, 1.5413e-01, 3.5430e-01,

3.6865e-02, -7.5635e-01, 5.5526e-01, 1.8341e-02, 1.3527e-01,

-6.6653e-01, 9.7280e-01, -6.6816e-02, 1.0383e-01, 3.9125e-02,

-2.2133e-01, 1.5785e-01, -1.8400e-01, 3.4476e-01, 1.6725e-01,

-2.6855e-01, -6.8380e-01, -1.8720e-01, -3.5997e-01, -1.5782e-01,

3.5001e-01, 2.4083e-01, -4.4515e-01, -7.2435e-01, -2.5413e-01,

2.3536e-01, 2.8430e-01, 5.7878e-01, -7.4840e-01, 1.5779e-01,

-1.7003e-01, 3.9774e-01, -1.5828e-01, -5.0969e-01, -4.7879e-01,

-1.6672e-01, 7.3282e-01, -1.2093e-01, 6.9689e-02, -3.1715e-01,

-7.4038e-02, 2.9851e-01, 5.7611e-01, 1.0658e+00, -1.9357e-01,

1.3133e-01, 1.0120e-01, -5.2478e-01, 1.5248e-01, 6.2976e-01,

-4.5310e-01, 2.9950e-01, -5.6907e-02, -2.2957e-01, -1.7587e-02,

-1.9266e-01, 2.8820e-02, 3.9966e-03, 2.0535e-01, 3.6137e-01,

1.7169e-01, 1.0535e-01, 1.4280e-01, 8.4879e-01, -9.0673e-01,

…

…

… ])

Above is the embedding vector example of the word “party”.

Now now we have 50,000×786 size of vectors which is compare to 50,000×50,000 one-hot encoding is significantly more efficient.

Embedding vectors shall be the inputs to the model. Due to dense numerical representations we are going to in a position to capture the semantics of words, the embedding vectors of tokens which might be similar shall be closer to one another.

How will you measure the similarity of two language unit in context? There are several functions that may measure the similarity between the 2 vectors of same size. Let’s explain it with an example.

Consider an easy example where now we have the embedding vectors of tokens “cat” , “dog”, “automotive” and “banana”. For simplification let’s use an embedding size of 4. Which means there shall be 4 learned numbers to represent the each token.

`import numpy as np`

from sklearn.metrics.pairwise import cosine_similarity# Example word embeddings for "cat" , "dog", "automotive" and "banana"

embedding_cat = np.array([0.5, 0.3, -0.1, 0.9])

embedding_dog = np.array([0.6, 0.4, -0.2, 0.8])

embedding_car = np.array([0.5, 0.3, -0.1, 0.9])

embedding_banana = np.array([0.1, -0.8, 0.2, 0.4])

Using the vectors above lets calculate the similarity scores using the cosine similarity. Human logic would find the word dogs and cats more related to every aside from the words banana a automotive. Can we expect math to simulate our logic?

`# Calculate cosine similarity`

similarity = cosine_similarity([embedding_cat], [embedding_dog])[0][0]print(f"Cosine Similarity between 'cat' and 'dog': {similarity:.4f}")

# Calculate cosine similarity

similarity_2 = cosine_similarity([embedding_car], [embedding_banana])[0][0]

print(f"Cosine Similarity between 'automotive' and 'banana': {similarity:.4f}")

`"Cosine Similarity between 'cat' and 'dog': 0.9832"`

"Cosine Similarity between 'automotive' and 'banana': 0.1511"

We will see that the words “cat” and “dog” have very high similarity rating whereas the words “automotive” and “banana” have very low. Now imagine embedding vectors of length 768 as an alternative of 4 for every 50000 token in our language corpus. That’s how we’re able find the words which might be related to one another.

Now, let’s have a take a look at the 2 sentences below which have higher semantic complexity.

`"students have a good time the graduation with a giant party"`"deputy leader is very respected within the party"

The word “party” from the primary and second sentence conveys different meanings. How are large language models able to mapping out the difference between the “party” as a political organisation and “party” as celebrating social event?

Can we distinguish the various meanings of same token by counting on the token embeddings? The reality is, although embeddings provide us a variety of benefits, they will not be adequate to disentangle the whole complexity of semantic challenges of human language.

**Self-attention.** The answer was again offered by transformer neural networks. We generate recent set of weights which might be namely query, key and value matrices. Those weights learn to represent the embedding vectors of tokens as a brand new set of embeddings. How? Just by taking the weighted average of the unique embeddings. Each token “attends” to each other token(including to itself) within the input sentence and calculates set of attention weights or in other word the brand new so called “*contextual embeddings*”.

All it does really is to map the importance of the words within the input sentence by assigning recent set of numbers(attention weights) which might be calculated using the token embeddings.

Above visualisation demonstrates the “attention” of the token “party” to the remaining of the tokens in two sentences. The boldness of the connection refers back to the importance or the relevance of the tokens. Attention and “attending” is easy series of numbers and their magnitude, that we use to represent the importance of words numerically. In the primary sentence the word “party” attends to the word “have a good time” essentially the most, whereas within the second sentence the word “deputy” has the very best attention. That’s how the model is in a position to incorporate the context by examining surrounding words.

As we mentioned in the eye mechanism we derive recent set of weight matrices, namely: Query, Key and Value (simply q,k,v). They’re cascading matrices of same size(normally smaller than the embedding vectors) which might be introduced to the architecture to capture complexity within the language units. Attention parameters are learned with the intention to demystify the connection between the words, pairs of words, pairs of pairs of words and pairs of pairs of pairs of words and so forth. Below is the visualisation of the query, key and value matrices find essentially the most relevant word.

The visualisation illustrates the q and k vectors as vertical bands, where the boldness of every band reflects its magnitude. The connections between tokens signify the weights determined by attention, indicating that the q vector for “party” aligns most importantly with the k vector for “is”, “deputy” and “respected”.

To make the eye mechanism and the concepts of q, k and v less abstract, imagine that you just went to a celebration and heard an incredible song that you just fell in love with. After the party you might be dying to seek out the song and listen again but you simply remember barely 5 words from the lyrics and a component of the song melody(query). To seek out the song, you choose to undergo the party playlist(keys) and listen(similarity function) all of the songs within the list that was played on the party. Once you finally recognise the song, you note the name of the song(value).

One last essential trick that transformers introduced is so as to add the positional encodings to the vector embeddings. Just because we would really like to capture the position information of the word. It enhances our possibilities to predict the following token more accurately towards to the true sentence context. It is important information because often swapping the words changes the context entirely. As an example, the sentences *“Tim chased clouds all his life”* vs *“clouds chased Tim all his life”* are absolutely different in essence.

All of the mathematical tricks that we explored at a basic level to this point, have the target of predicting the following token, given the sequence of input tokens. Indeed, GPT is trained on one sure bet which is the text generation or in other words the following token prediction. At its core of the matter, we measure the probability of a token, given the sequence of tokens appeared before it.

You may wonder how do models learn the optimal numbers from randomly assigned numbers. It is a subject for an additional blog post probably nevertheless that is definitely fundamental on understanding. Besides, it’s a fantastic sign that you just are already questioning the fundamentals. To remove unclarity, we use an optimisation algorithm that adjusts the parameters based on a metric that is known as loss function. This metric is calculated by comparing the anticipated values with the actual values. The model tracks the possibilities of the metric and depending on how small or large the worth of loss, it tunes the numbers. This process is completed until the loss can’t be smaller given the principles we set within the algorithm that we call hyperparameters. An example hyperparameter will be, how ceaselessly we would like to calculate the loss and tune the weights. That is the rudimentary idea behind learning.

I hope on this short post, I used to be in a position to clear the image at the very least just a little bit. The second a part of this blog series will concentrate on decoding strategies namely on why your prompt matters. The third and the last part shall be dedicated to key factor on ChatGPT’s success which is the reinforcement learning through human feedback. Many thanks for the read. Until next time.