The backbone of ChatGPT is the GPT model, which is built using the Transformer architecture. The backbone of Transformer is the Attention mechanism. The toughest concept to grok in Attention for a lot of is Key, Value, and Query. On this post, I’ll use an analogy of potion to internalize these concepts. Even if you happen to already understand the maths of transformer mechanically, I hope by the top of this post, you’ll be able to develop a more intuitive understanding of the inner workings of GPT from end to finish.
This explanation requires no maths background. For the technically inclined, I add more technical explanations in […]. You can even safely skip notes in [brackets] and side notes in quote blocks like this one. Throughout my writing, I make up some human-readable interpretation of the intermediary states of the transformer model to help the reason, but GPT doesn’t think exactly like that.
[When I talk about “attention”, I exclusively mean “self-attention”, as that is what’s behind GPT. But the same analogy explains the general concept of “attention” just as well.]
The Set Up
GPT can spew out paragraphs of coherent content, since it does one task superbly well: “Given a text, what word comes next?” Let’s role-play GPT: “Sarah lies still on the bed, feeling ____”. Are you able to fill within the blank?
One reasonable answer, amongst many, is “drained”. In the remainder of the post, I’ll unpack how GPT arrives at this answer. (For fun, I put this prompt in ChatGPT and it wrote a brief story out of it.)
The Analogy: (Key, Value, Query), or (Tag, Potion, Recipe)
You feed the above prompt to GPT. In GPT, each word is supplied with three things: Key, Value, Query, whose values are learned from devouring all the web of texts throughout the training of the GPT model. It’s the interaction amongst these three ingredients that permits GPT to make sense of a word within the context of a text. So what do they do, really?
Let’s arrange our analogy of alchemy. For every word, we’ve got:
- A potion (aka “value”): The potion accommodates wealthy information concerning the word. For illustrative purpose, imagine the potion of the word “lies” accommodates information like “drained; dishonesty; can have a positive connotation if it’s a white lie; …”. The word “lies” can tackle multiple meanings, e.g. “tell lies” (related to dishonesty) or, “lies down” (related to drained). You may only tell the true meaning within the context of a text. Right away, the potion accommodates information for each meanings, since it doesn’t have the context of a text.
- An alchemist’s recipe (aka “query”): The alchemist of a given word, e.g. “lies”, goes over all of the nearby words. He finds a couple of of those words relevant to his own word “lies”, and he’s tasked with filling an empty flask with potions of those words. The alchemist has a recipe, listing specific criteria that identifies what potions he should pay attention to.
- A tag (aka “key”): each potion (value) comes with a tag (key). If the tag (key) matches well with the alchemist’s recipe (query), the alchemist will listen to this potion.
Attention: the Alchemist’s Potion Mixology
In step one (attention), the alchemists of all words each exit on their very own quests to fill their flasks with potions from relevant words.
Let’s take the alchemist of the word “lies” for instance. He knows from previous experience — after being pre-trained on all the web of texts — that words that help interpret “lies” in a sentence are often of the shape: “some flat surfaces, words related to dishonesty, words related to resting”. He writes down these criteria in his recipe (query) and appears for tags (key) on the potions of other words. If the tag could be very just like the factors, he’ll pour lots of that potion into his flask; if the tag will not be similar, he’ll pour little or none of that potion.
So he finds the tag for “bed” says “a flat piece of furniture”. That’s just like “some flat surfaces” in his recipe! He pours the potion for “bed” in his flask. The potion (value) for “bed” accommodates information like “drained, restful, sleepy, sick”.
The alchemist for the word “lies” continues the search. He finds the tag for the word “still” says “related to resting” (amongst other connotations of the word “still”). That’s related to his criteria “restful”, so he pours partially of the potion from “still”, which accommodates information like “restful, silent, stationary”.
He looks on the tag of “on”, “Sarah”, “the”, “feeling” and doesn’t find them relevant. So he doesn’t pour any of their potions into his flask.
Remember, he needs to envision his own potion too. The tag of his own potion “lies” says “a verb related to resting”, which matches his recipe. So he pours a few of his own potion into the flask as well, which accommodates information like “drained; dishonest; can have a positive connotation if it’s a white lie; …”.
By the top of his quest to envision words within the text, his flask is full.
Unlike the unique potion for “lies”, this mixed potion now takes into consideration the context of this very specific sentence. Namely, it has lots of elements of “drained, exhausted” and only a pinch of “dishonest”.
On this quest, the alchemist knows to listen to the suitable words, and combines the worth of those relevant words. This can be a metaphoric step for “attention”. We’ve just explained a very powerful equation for Transformer, the underlying architecture of GPT:
Advanced notes:
1. Each alchemist looks at every bottle, including their very own [Q·K.transpose()].
2. The alchemist can match his recipe (query) with the tag (key) quickly and make a quick decision. [The similarity between query and key is determined by a dot product, which is a fast operation.] Moreover, all alchemists do their quests in parallel, which also helps speed things up. [Q·K.transpose() is a matrix multiplication, which is parallelizable. Speed is a winning feature of Transformer, compared to its predecessor Recurrent Neural Network that computes sequentially.]
3. The alchemist is picky. He only selects the highest few potions, as a substitute of blending in a little bit of every part. [We use softmax to collapse Q·K.transpose(). Softmax will pull the inputs into more extreme values, and collapse many inputs to near-zero.]
4. At this stage, the alchemist doesn’t take into consideration the ordering of words. Whether it’s “Sarah lies still on the bed, feeling” or “still bed the Sarah feeling on lies”, the filled flask (output of attention) might be the identical. [In the absence of “positional encoding”, Attention(Q, K, V) is independent of word positions.]
5. The flask all the time returns 100% filled, no more, no less. [The softmax is normalized to 1.]
6. The alchemist’s recipe and the potions’ tags must speak the identical language. [The Query and Key must be of the same dimension to be able to dot product together to communicate. The Value can take on a different dimension if you wish.]
7. The technically astute readers may indicate we didn’t do masking. I don’t need to clutter the analogy with too many details but I’ll explain it here. In self-attention, each word can only see the previous words. So within the sentence “Sarah lies still on the bed, feeling”, “lies” only sees “Sarah”; “still” only sees “Sarah”, “lies”. The alchemist of “still” can’t reach into the potions of “on”, “the”, “bed” and “feeling”.
Feed Forward: Chemistry on the Mixed Potions
Up till this point, the alchemist simply pours the potion from other bottles. In other words, he pours the potion of “lies” — “drained; dishonest;…” — as a uniform mixture into the flask; he can’t distill out the “drained” part and discard the “dishonest” part just yet. [Attention is simply summing the different V’s together, weighted by the softmax.]
Now comes the true chemistry (feed forward). The alchemist mixes every part together and does some synthesis. He notices interactions between words like “sleepy” and“restful”, etc. He also notices that “dishonesty” is just mentioned in a single potion. He knows from past experiences methods to make some ingredients interact with one another and the way discard the one-off ones. [The feed forward layer is a linear (and then non-linear) transformation of the Value. Feed forward layer is the building block of neural networks. You can think of it as the “thinking” step in Transformer, while the earlier mixology step is simply “collecting”.]
The resulting potion after his processing becomes rather more useful for the duty of predicting the following word. Intuitively, it represents some richer properties about this word within the context of its sentence, in contrast with the starting potion (value) that’s out of context.
The Final Linear and Softmax Layer: the Assembly of Alchemists
How can we get from here to the ultimate output, which is to predict that the following word after “Sarah lies still on the bed, feeling ___” is “drained”?
Up to now, each alchemist has been working independently, only tending to his own word. Now all of the alchemists of various words assemble and stack their flasks in the unique word order and present them to the ultimate linear and softmax layer of the Transformer. What do I mean by this? Here, we must depart from the metaphor.
This final linear layer synthesizes information across different words. Based on pre-trained data, one plausible learning is that the immediate previous word is vital to predict the following word. For instance, the linear layer might heavily deal with the last flask (“feeling”’s flask).
Then combined with the softmax layer, this step assigns each word in our vocabulary a probability for a way likely that is the following word after “Sarah lies on the bed, feeling…”. For instance, non-English words will receive probabilities near 0. Words like “drained”, “sleepy”, “exhausted” will receive high probabilities. We then pick the highest winner as the ultimate answer.
Recap
Now you’ve built a minimalist GPT!
To recap, for every word in the eye step, you establish which words (including self) each word should listen to, based on how well that word’s query (recipe) matches the opposite word’s key (tag). You combine together those words’ values (potions) proportional to the eye that word pays to them. You process this mixture to do some “considering” (feed forward). Once each word is processed, you then mix the mixtures from all the opposite words to do more “considering” (linear layer) and make the ultimate prediction of what the following word ought to be.
Side note: the language “decoder” is a vestige from the unique paper, as Transformer was first used for machine translation tasks. You “encode” the source language into embeddings, and “decode” from the embeddings to the goal language.