Home Artificial Intelligence The A-Z of Transformers: All the pieces You Must Know Why one other tutorial on Transformers? A bit little bit of History first: The Transformer Architecture Positional Encoding The Attention Mechanism (Single Head) Multi-Headed Attention Assembling the pieces of the Transformer Decoder Transformer Model Details Conclusion References

The A-Z of Transformers: All the pieces You Must Know Why one other tutorial on Transformers? A bit little bit of History first: The Transformer Architecture Positional Encoding The Attention Mechanism (Single Head) Multi-Headed Attention Assembling the pieces of the Transformer Decoder Transformer Model Details Conclusion References

0
The A-Z of Transformers: All the pieces You Must Know
Why one other tutorial on Transformers?
A bit little bit of History first:
The Transformer Architecture
Positional Encoding
The Attention Mechanism (Single Head)
Multi-Headed Attention
Assembling the pieces of the Transformer
Decoder
Transformer Model Details
Conclusion
References

All the pieces you want to find out about Transformers, and how one can implement them

Towards Data Science
Image by writer

You’ve got probably already heard of Transformers, and everybody talks about it, so why making a brand new article about it?

Well, I’m a researcher, and this requires me to have a really deep understanding of the tools I exploit (because for those who don’t understand them, how will you discover where they’re flawed and the way you may improve them, right?).

As I ventured deeper into the world of Transformers, I discovered myself buried under a mountain of resources. And yet, despite all that reading, I used to be left with a general sense of the architecture and a trail of lingering questions.

On this guide, I aim to bridge that knowledge gap. A guide that offers you a robust intuition on Transformers, a deep dive into the architecture, and the implementation from scratch.

I strongly advise you to follow the code on Github:

Enjoy! 🤗

Many attribute the concept of the eye mechanism to the renowned paper “Attention is All You Need” by the Google Brain team. Nevertheless, this is just a part of the story.

The roots of the eye mechanism could be traced back to an earlier paper titled “Neural Machine Translation by Jointly Learning to Align and Translate” authored by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio.

Bahdanau’s primary challenge was addressing the restrictions of Recurrent Neural Networks (RNNs). Specifically, when encoding lengthy sentences into vectors using RNNs, crucial information was often lost.

Drawing parallels from translation exercises — where one often revisits the source sentence while translating — Bahdanau aimed to allocate weights to the hidden states throughout the RNN. This approach yielded impressive outcomes, and is depicted in the next diagram.

Image from Neural machine translation by jointly learning to align and translate

Nevertheless, Bahdanau wasn’t the just one tackling this issue. Taking cues from his groundbreaking work, the Google Brain team posited a daring idea:

“Why not strip every little thing down and focus solely on the eye mechanism?”

They believed it wasn’t the RNN but the eye mechanism that was the first driver behind the success.

This conviction culminated of their paper, aptly titled “Attention is All You Need”.

Fascinating, right?

1. First things first, Embeddings

This diagram represents the Transformer architecture. Don’t worry for those who don’t understand anything at first, we’ll cover absolutely every little thing.

Embeddings, Image from article modified by writer

From Text to Vectors — The Embedding Process: Imagine our input is a sequence of words, say “The cat drinks milk”. This sequence has a length termed as seq_len. Our immediate task is to convert these words right into a form that the model can understand, specifically vectors. That is where the Embedder is available in.

Each word undergoes a change to turn into a vector. This transformation process is termed as ‘embedding’. Each of those vectors or ‘embeddings’ has a size of d_model = 512.

Now, what exactly is that this Embedder? At its core, the Embedder is a linear mapping (matrix), denoted by E. You possibly can visualize it as a matrix of size (d_model, vocab_size), where vocab_size is the scale of our vocabulary.

After the embedding process, we find yourself with a group of vectors of size d_model each. It’s crucial to know this format, because it’s a recurrent theme — you’ll see it across various stages like encoder input, encoder output, and so forth.

Let’s code this part:

class Embeddings(nn.Module):
def __init__(self, d_model, vocab):
super(Embeddings, self).__init__()
self.lut = nn.Embedding(vocab, d_model)
self.d_model = d_model

def forward(self, x):
return self.lut(x) * math.sqrt(self.d_model)

Note: we multiply by d_model for normalization purposes (explained later)

Note 2: I personally wondered if we used a pre-trained embedder, or at the very least start from a pre-trained one and fine-tune it. But no, the embedding is fully learned from scratch and initialized randomly.

Why Do We Need Positional Encoding?

Given our current setup, we possess a listing of vectors representing words. If fed as-is to a transformer model, there’s a key element missing: the sequential order of words. Words in natural languages often derive meaning from their position. “John loves Mary” carries a distinct sentiment from “Mary loves John.” To make sure our model captures this order, we introduce Positional Encoding.

Now, you would possibly wonder, “Why not only add a straightforward increment like +1 for the primary word, +2 for the second, and so forth?” There are several challenges with this approach:

  1. Multidimensionality: Each token is represented in 512 dimensions. A mere increment wouldn’t suffice to capture this complex space.
  2. Normalization Concerns: Ideally, we wish our values to lie between -1 and 1. So, directly adding large numbers (like +2000 for a protracted text) can be problematic.
  3. Sequence Length Dependency: Using direct increments just isn’t scale-agnostic. For a protracted text, where the position could be +5000, this number does not truly reflect the relative position of the token in its associated sentence. And the meaning of a world depends more on its relative position in a sentence, than its absolute position in a text.

For those who studied mathematics, the thought of circular coordinates — specifically, sine and cosine functions — should resonate along with your intuition. These functions provide a singular technique to encode position that meets our needs.

Given our matrix of size (seq_len, d_model), our aim is so as to add one other matrix, the Positional Encoding, of the identical size.

Here’s the core concept:

  1. For each token, the authors suggest providing a sine coordinate of the pairwise dimensions (2k) a cosine coordinate to (2k+1).
  2. If we fix the token position, and we move the dimension, we are able to see that the sine/cosine decrease in frequency
  3. If we have a look at a token that’s further within the text, this phenomenon happens more rapidly (the frequency is increased)
Image from article

That is summed up in the next graph (but don’t scratch your head an excessive amount of on this). The Key take away is that Positional Encoding is a mathematical function that permits the Transformer to maintain an idea of the order of tokens within the sentence. This can be a very lively area or research.

Positional Embedding, Image by writer
class PositionalEncoding(nn.Module):
"Implement the PE function."

def __init__(self, d_model, dropout, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)

# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer("pe", pe)

def forward(self, x):
x = x + self.pe[:, : x.size(1)].requires_grad_(False)
return self.dropout(x)

Let’s dive into the core concept of Google’s paper: the Attention Mechanism

High-Level Intuition:

At its core, the eye mechanism is a communication mechanism between vectors/tokens. It allows a model to deal with specific parts of the input when producing an output. Consider it as shining a highlight on certain parts of your input data. This “highlight” could be brighter on more relevant parts (giving them more attention) and dimmer on less relevant parts.

For a sentence, attention helps determine the connection between words. Some words are closely related to one another in meaning or function inside a sentence, while others are usually not. The eye mechanism quantifies these relationships.

Example:

Consider the sentence: “She gave him her book.”

If we deal with the word “her”, the eye mechanism might determine that:

  • It has a robust reference to “book” because “her” is indicating possession of the “book”.
  • It has a medium reference to “She” because “She” and “her” likely consult with the identical entity.
  • It has a weaker reference to other words like “gave” or “him”.

Technical Dive into the Attention mechanism

Scaled Dot-Product Attention, image from article

For every token, we generate three vectors:

  1. Query (Q):

Intuition: Consider the query as a “query” that a token poses. It represents the present word and tries to seek out out which parts of the sequence are relevant to it.

2. Key (K):

Intuition: The important thing could be regarded as an “identifier” for every word within the sequence. When the query “asks” its query, the important thing helps in “answering” by determining how relevant each word within the sequence is to the query.

3. Value (V):

Intuition: Once the relevance of every word (via its key) to the query is decided, we want actual information or content from those words to help the present token. That is where the worth is available in. It represents the content of every word.

How are Q, K, V generated?

Q, K, V generation, image by writer

The similarity between a question and a secret’s a dot product (measures the similarity between 2 vectors), divided by the usual deviation of this random variable, to have every little thing normalized.

Attention formula, Image from article

Let’s illustrate this with an example:

Let’s image we’ve one query, and need to figure the results of the eye with K and V:

Q, K, V, Image by writer

Now let’s compute the similarities between q1 and the keys:

Dot Product, Image by writer

While the numbers 3/2 and 1/8 might sound relatively close, the softmax function’s exponential nature would amplify their difference.

Attention weights, Image by writer

This differential suggests that q1 has a more pronounced connection to k1 than k2.

Now let’s have a look at the results of attention, which is a weighted (attention weights) combination of the values

Attention, Image by writer

Great! Repeating this operation for each token (q1 through qn) yields a group of n vectors.

In practice this operation is vectorized right into a matrix multiplication for more effectiveness.

Let’s code it:

def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask just isn't None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = scores.softmax(dim=-1)
if dropout just isn't None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn

What’s the Issue with Single-Headed Attention?

With the single-headed attention approach, every token gets to pose only one query. This generally translates to it deriving a robust relationship with only one other token, provided that the softmax tends to heavily weigh one value while diminishing others near zero. Yet, when you consider language and sentence structures, a single word often has connections to multiple other words, not only one.

To tackle this limitation, we introduce multi-headed attention. The core idea? Let’s allow each token to pose multiple questions (queries) concurrently by running the eye process in parallel for ‘h’ times. The unique Transformer uses 8 heads.

Multi-Headed attention, image from article

Once we get the outcomes of the 8 heads, we concatenate them right into a matrix.

Multi-Headed attention, image from article

This can also be straightforward to code, we just must watch out with the scale:

class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"Soak up model size and variety of heads."
super(MultiHeadedAttention, self).__init__()
assert d_model % h == 0
# We assume d_v at all times equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)

def forward(self, query, key, value, mask=None):
"Implements Figure 2"
if mask just isn't None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
nbatches = query.size(0)

# 1) Do all of the linear projections in batch from d_model => h x d_k
query, key, value = [
lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for lin, x in zip(self.linears, (query, key, value))
]

# 2) Apply attention on all of the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)

# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)
del query
del key
del value
return self.linears[-1](x)

It’s best to start to know why Transformers are so powerful now, they exploit parallelism to the fullest.

On the high-level, a Transformer is the mix of three elements: an Encoder, a Decoder, and a Generator

Endoder, Decoder, Generator, Image from article modified by writer

1. The Encoder

  • Purpose: Convert an input sequence right into a recent sequence (often of smaller dimension) that captures the essence of the unique data.
  • Note: For those who’ve heard of the BERT model, it uses just this encoding a part of the Transformer.

2. The Decoder

  • Purpose: Generate an output sequence using the encoded sequence from the Encoder.
  • Note: The decoder within the Transformer is different from the standard autoencoder’s decoder. Within the Transformer, the decoder not only looks on the encoded output but in addition considers the tokens it has generated to this point.

3. The Generator

  • Purpose: Convert a vector to a token. It does this by projecting the vector to the scale of the vocabulary after which picking the almost definitely token with the softmax function.

Let’s code that:

class EncoderDecoder(nn.Module):
"""
A typical Encoder-Decoder architecture. Base for this and lots of
other models.
"""

def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
super(EncoderDecoder, self).__init__()
self.encoder = encoder
self.decoder = decoder
self.src_embed = src_embed
self.tgt_embed = tgt_embed
self.generator = generator

def forward(self, src, tgt, src_mask, tgt_mask):
"Soak up and process masked src and goal sequences."
return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

def encode(self, src, src_mask):
return self.encoder(self.src_embed(src), src_mask)

def decode(self, memory, src_mask, tgt, tgt_mask):
return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

class Generator(nn.Module):
"Define standard linear + softmax generation step."

def __init__(self, d_model, vocab):
super(Generator, self).__init__()
self.proj = nn.Linear(d_model, vocab)

def forward(self, x):
return log_softmax(self.proj(x), dim=-1)

One remark here: “src” refers back to the input sequence, and “goal” refers back to the sequence being generated. Keep in mind that we generate the output in an autoregressive manner, token by token, so we want to maintain track of the goal sequence as well.

Stacking Encoders

The Transformer’s Encoder isn’t only one layer. It’s actually a stack of N layers. Specifically:

  • Encoder in the unique Transformer model consists of a stack of N=6 equivalent layers.

Contained in the Encoder layer, we are able to see that there are two Sublayer blocks that are very similar ((1) and (2)): A residual connection followed by a layer norm.

  • Block (1) Self-Attention Mechanism: Helps the encoder deal with different words within the input when generating the encoded representation.
  • Block (2) Feed-Forward Neural Network: A small neural network applied independently to every position.
Encoder Layer, residual connections, and Layer Norm,Image from article modified by writer

Now let’s code that:

SublayerConnection first:

We follow the overall architecture, and we are able to change “sublayer” by either “self-attention” or “FFN”

class SublayerConnection(nn.Module):
"""
A residual connection followed by a layer norm.
Note for code simplicity the norm is first versus last.
"""

def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = nn.LayerNorm(size) # Use PyTorch's LayerNorm
self.dropout = nn.Dropout(dropout)

def forward(self, x, sublayer):
"Apply residual connection to any sublayer with the identical size."
return x + self.dropout(sublayer(self.norm(x)))

Now we are able to define the total Encoder layer:

class EncoderLayer(nn.Module):
"Encoder is made up of self-attn and feed forward (defined below)"

def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size

def forward(self, x, mask):
# self attention, block 1
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
# feed forward, block 2
x = self.sublayer[1](x, self.feed_forward)
return x

The Encoder Layer is prepared, now let’s just chain them together to form the total Encoder:

def clones(module, N):
"Produce N equivalent layers."
return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

class Encoder(nn.Module):
"Core encoder is a stack of N layers"

def __init__(self, layer, N):
super(Encoder, self).__init__()
self.layers = clones(layer, N)
self.norm = nn.LayerNorm(layer.size)

def forward(self, x, mask):
"Pass the input (and mask) through each layer in turn."
for layer in self.layers:
x = layer(x, mask)
return self.norm(x)

The Decoder, similar to the Encoder, is structured with multiple equivalent layers stacked on top of one another. The variety of these layers is usually 6 in the unique Transformer model.

How is the Decoder different from the Encoder?

A 3rd SubLayer is added to interact with the encoder: that is Cross-Attention

  • SubLayer (1) is identical because the Encoder. It’s the Self-Attention mechanism, meaning that we generate every little thing (Q, K, V) from the tokens fed into the Decoder
  • SubLayer (2) is the brand new communication mechanism: Cross-Attention. It is named that way because we use the output from (1) to generate the Queries, and we use the output from the Encoder to generate the Keys and Values (K, V). In other words, to generate a sentence we’ve to look each at what we’ve generated to this point by the Decoder (self-attention), and what we asked in the primary place within the Encoder (cross-attention)
  • SubLayer (3) is equivalent as within the Encoder.
Decoder Layer, self attention, cross attention, Image from article modified by writer

Now let’s code the DecoderLayer. For those who understood the mechanism within the EncoderLayer, this must be quite straightforward.

class DecoderLayer(nn.Module):
"Decoder is product of self-attn, src-attn, and feed forward (defined below)"

def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
super(DecoderLayer, self).__init__()
self.size = size
self.self_attn = self_attn
self.src_attn = src_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 3)

def forward(self, x, memory, src_mask, tgt_mask):
"Follow Figure 1 (right) for connections."
m = memory
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
# Recent sublayer (cross attention)
x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
return self.sublayer[2](x, self.feed_forward)

And now we are able to chain the N=6 DecoderLayers to form the Decoder:

class Decoder(nn.Module):
"Generic N layer decoder with masking."

def __init__(self, layer, N):
super(Decoder, self).__init__()
self.layers = clones(layer, N)
self.norm = nn.LayerNorm(layer.size)

def forward(self, x, memory, src_mask, tgt_mask):
for layer in self.layers:
x = layer(x, memory, src_mask, tgt_mask)
return self.norm(x)

At this point you’ve gotten understood around 90% of what a Transformer is. There are still a couple of details:

Padding:

  • In a typical transformer, there’s a maximum length for sequences (e.g., “max_len=5000”). This defines the longest sequence the model can handle.
  • Nevertheless, real-world sentences can vary in length. To handle shorter sentences, we use padding.
  • Padding is the addition of special “padding tokens” to make all sequences in a batch the identical length.
Padding, image by writer

Masking

Masking ensures that through the attention computation, certain tokens are ignored.

Two scenarios for masking:

  • src_masking: Since we’ve added padding tokens to sequences, we don’t want the model to listen to those meaningless tokens. Hence, we mask them out.
  • tgt_masking or Look-Ahead/Causal Masking: Within the decoder, when generating tokens sequentially, each token should only be influenced by previous tokens and never future ones. As an illustration, when generating the fifth word in a sentence, it shouldn’t know in regards to the sixth word. This ensures a sequential generation of tokens.
Causal Masking/Look-Ahead masking, image by writer

We then use this mask so as to add minus infinity in order that the corresponding token is ignored. This instance should make clear things:

Masking, a trick within the softmax, image by writer

FFN: Feed Forward Network

  • The “Feed Forward” layer within the Transformer’s diagram is a tad misleading. It’s not only one operation, but a sequence of them.
  • The FFN consists of two linear layers. Interestingly, the input data, which could be of dimension d_model=512, is first transformed into a better dimension d_ff=2048 after which mapped back to its original dimension (d_model=512).
  • This could be visualized as the info being “expanded” in the course of the operation before being “compressed” back to its original size.
Image from article modified by writer

This is simple to code:

class PositionwiseFeedForward(nn.Module):
"Implements FFN equation."

def __init__(self, d_model, d_ff, dropout=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)

def forward(self, x):
return self.w_2(self.dropout(self.w_1(x).relu()))

The unparalleled success and recognition of the Transformer model could be attributed to several key aspects:

  1. Flexibility. Transformers can work with any sequence of vectors. These vectors could be embeddings for words. It is simple to transpose this to Computer Vision by converting a picture to different patches, and unfolding a patch right into a vector. And even in Audio, we are able to split an audio into different pieces and vectorize them.
  2. Generality: With minimal inductive bias, the Transformer is free to capture intricate and nuanced patterns in data, thereby enabling it to learn and generalize higher.
  3. Speed & Efficiency: Leveraging the immense computational power of GPUs, Transformers are designed for parallel processing.

Thanks for reading! Before you go:

You possibly can run the experiments with my Transformer Github Repository.

For more awesome tutorials, check my compilation of AI tutorials on Github

You should get my articles in your inbox. Subscribe here.

If you need to have access to premium articles on Medium, you simply need a membership for $5 a month. For those who join with my link, you support me with an element of your fee without additional costs.

LEAVE A REPLY

Please enter your comment!
Please enter your name here