With that said, let’s dive in. To grasp GPT models intimately, we must start with the transformer. The transformer employs a self-attention mechanism often known as scaled dot-product attention. The next explanation is derived from this insightful article on scaled dot-product attention, which I like to recommend for a more in-depth understanding. Essentially, for each element of an input sequence (the *i-th* element), we wish to multiply the input sequence by a weighted average of all the weather within the sequence with the *i-th* element. These weights are calculated via taking the dot-product of the vector on the *i-th* element with all the input vector after which applying a softmax to it so the weights are values between 0 and 1. In the unique “Attention is All You Need” paper, these inputs are named **query (**all the sequence**), key** (the vector on the *i-th* element) and the **value** (also the entire sequence). The weights passed to the eye mechanism are initialized to random values and learned as more passes occur inside a neural network.

nanoGPT implements scaled dot-product attention and extends it to multi-head attention, meaning multiple attention operations occurring directly. It also implements it as a `torch.nn.Module`

, which allows it to be composed with other network layers

`import torch`

import torch.nn as nn

from torch.nn import functional as Fclass CausalSelfAttention(nn.Module):

def __init__(self, config):

super().__init__()

assert config.n_embd % config.n_head == 0

# key, query, value projections for all heads, but in a batch

self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)

# output projection

self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)

# regularization

self.attn_dropout = nn.Dropout(config.dropout)

self.resid_dropout = nn.Dropout(config.dropout)

self.n_head = config.n_head

self.n_embd = config.n_embd

self.dropout = config.dropout

# flash attention make GPU go brrrrr but support is just in PyTorch >= 2.0

self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')

if not self.flash:

print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")

# causal mask to be certain that attention is just applied to the left within the input sequence

self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))

.view(1, 1, config.block_size, config.block_size))

def forward(self, x):

B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

# calculate query, key, values for all heads in batch and move head forward to be the batch dim

q, k, v = self.c_attn(x).split(self.n_embd, dim=2)

k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

# causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)

if self.flash:

# efficient attention using Flash Attention CUDA kernels

y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)

else:

# manual implementation of attention

att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))

att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))

att = F.softmax(att, dim=-1)

att = self.attn_dropout(att)

y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)

y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

# output projection

y = self.resid_dropout(self.c_proj(y))

return y

Let’s dissect this code further, starting with the constructor. First, we confirm that the variety of attention heads (`n_heads`

) divides the dimensionality of the embedding (`n_embed`

) evenly. That is crucial because when the embedding is split into sections for every head, we wish to cover the entire embedding space with none gaps. Next, we initialize two Linear layers, `c_att`

and `c_proj`

: `c_att`

is the layer that holds all our working space for the matrices that compose of a scaled dot-product attention calculation while `c_proj`

stores the finally results of the calculations. The embedding dimension is tripled in `c_att`

because we’d like to incorporate space for the three major components of attention: **query**, **key**, and **value**.

We even have two dropout layers, `attn_dropout`

and `resid_dropout`

. The dropout layers randomly nullify elements of the input matrix based on a given probability. In line with the PyTorch docs, this serves the aim of reducing overfitting for the model. The worth in `config.dropout`

is the probability that a given sample might be dropped during a dropout layer.

We finalize the constructor by verifying if the user has access to PyTorch 2.0, which boasts an optimized version of the scaled dot-product attention. If available, the category utilizes it; otherwise we arrange a bias mask. This mask is a component of the optional masking feature of the eye mechanism. The torch.tril method yields a matrix with its upper triangular section converted to zeros. When combined with the torch.ones method, it effectively generates a mask of 1s and 0s that the eye mechanism uses to supply anticipated outputs for a given sampled input.

Next, we delve into the `forward`

approach to the category, where the eye algorithm is applied. Initially, we determine the sizes of our input matrix and divide it into three dimensions: **B**atch size, **T**ime (or variety of samples), **C**orpus (or embedding size). nanoGPT employs a batched learning process, which we’ll explore in greater detail when examining the transformer model that utilizes this attention layer. For now, it’s sufficient to know that we’re coping with the info in batches. We then feed the input `x`

into the linear transformation layer `c_attn`

which expands the dimensionality from `n_embed`

to 3 times `n_embed`

. The output of that transformation is split it into our `q`

, `k`

, `v`

variables that are our inputs to the eye algorithm. Subsequently, the `view`

method is utilized to reorganize the info in each of those variables into the format expected by the PyTorch `scaled_dot_product_attention`

function.

When the optimized function isn’t available, the code defaults to a manual implementation of scaled dot-product attention. It begins by taking the dot product of the `q`

and `k`

matrices, with `k`

transposed to suit the dot product function, and the result’s scaled by the square root of the scale of `k`

. We then mask the scaled output using the previously created bias buffer, replacing the 0s with negative infinity. Next, a softmax function is applied to the `att`

matrix, converting the negative infinities back to 0s and ensuring all other values are scaled between 0 and 1. We then apply a dropout layer to avoid overfitting before getting the dot-product of the `att`

matrix and `v`

.

Whatever the scaled dot-product implementation used, the multi-head output is reorganized side by side before passing it through a final dropout layer after which returning the result. That is the entire implementation of the eye layer in lower than 50 lines of Python/PyTorch. In the event you don’t fully comprehend the above code, I like to recommend spending a while reviewing it before proceeding with the remainder of the article.