Matrix multiplication; the holy grail of deep neural networks and modern language understanding behemoths. As MLEs or data scientists, our fingers are too quick to type `tf.matmul`

or `torch.matmul`

and we never look back. But don’t tell me you’ve never had the millisecond infatuation to know what may be happening to that matrix when it enters the GPU! Should you did, you’re in the precise place. Join me in a journey through the fascinating intricacies inside a GPU.

I’ll explain to you the way these compute powerhouses crunch up the numbers. You’ll learn three little-known impressive things GPUs do, after they come face-to-face with matrices. By the tip of this blog post, you’ll have understanding of how matrix multiplication works inside GPUs.

GEMM or generalized matrix multiplication is the kernel that’s executed when GPUs perform matrix multiplication.

`C = a (A.B) + b C`

Here, `a`

and `b`

are scalars, `A`

is an `MxK`

matrix, `B`

is an `KxN`

matrix, and thus `C`

is an `MxN`

matrix. It’s easy as that! You may wonder why that trailing addition exists. Seems it is a pretty common pattern for neural networks (e.g. adding bias, applying ReLU, adding residual connections).

Should you’re asked to write down a matrix multiplication algorithm from first principles, here’s what you’ll do (unless you’re gifted with a GPU in lieu of a brain — wouldn’t that get monetary savings for an MLE!).

`for (int i = 0; i < M; ++i)`

for (int j = 0; j < N; ++j)

for (int k = 0; k < K; ++k)

C[i][j] += A[i][k] * B[k][j];

Here’s an animated visual that shows you what this does.

But did you understand GPUs despise this implementation 🤔? To grasp why that’s the case, you might want to understand the GPU memory architecture,

For all comparisons and specifications, I’ll be using the Nvidia A100 GPU specifications.

A GPU has three important memory levels,

- Global memory or HBM (what you usually discuss with as GPU memory and what you see whenever you run
`nvidia-smi`

) - Shared memory (an area memory that is devoted to a single streaming multiprocessor [or SM] and shared between threads running in that SM)
- Registers (individually allocated to threads to perform their workload)

That is what it looks like,

The very first thing to notice is that shared memory (known as SRAM any longer) is way smaller than the HBM, let alone registers. So your matrix isn’t going to slot in there (in most occasions). If we return to our animation, for a single row of `A`

all columns of `B`

must be retrieved, and repeat the method for all rows in `A`

. This implies, the GPU must do many-many reads to compute the output. The HBM (~1.5TB/s) is several magnitudes slower than SRAM (~19TB/s).

To place that in numbers, say you would like to multiply a `10x20`

and `20x30`

matrix, you might want to read columns of `B`

`10x30=300`

times. Is there a greater way we are able to do that?

Seems an easy trick can go a good distance here! Simply flip the order of the loops, in order that `k`

becomes the outer most loop. And also you’re done! 😮

`for (int k = 0; k < K; ++k) `

for (int i = 0; i < M; ++i)

for (int j = 0; j < N; ++j)

C[i][j] += A[i][k] * B[k][j];

We didn’t touch the actual computation, just the order of the loops, so we should always get the identical result as before. Here’s what the matrix multiplication looks like now!

You see, we only bring one *column* of `A`

and one *row* of `B`

at a time and never look back. This requires far less reads than the unique implementation. The one difference is we were computing the *inner product* between two vectors before, now we’re computing the *outer product*.

But still, we want entire `C`

in SRAM, which may be too big to slot in SRAM. What does CUDA do then? That brings us to the second trick.

Not to fret! I’m not going to blast you with any complex mathematics or Leetcode algorithms. The important thing to remember is, a matrix is a 2D layout of individual tiles. The next animation does justice to what I’m trying to elucidate.

The results of the green block 💚 is the sunshine blue strip of A 💙 and the sunshine yellow strip of B 💛. Taking this a step further, to compute the output, you possibly can bring one block of that strip of A and one block from B’s strip at a time, compute the output and accumulate the end in the green box.

This offers us a versatile framework where we are able to load an arbitrary size block (or tile) of A and B and still compute the ultimate answer. We don’t should stop there, we are able to keep recursively dividing the issue to even smaller problems. i.e. the matrix is broken into tiles, tiles are broken into fragments, and fragments to individual values.

And this lends itself nicely to the method execution architecture in a GPU. There are three layers to a kernel execution in a GPU. For simplicity, we’ll say a SM runs a single thread block (although in practice they execute them concurrently, to cut back something referred to as the tail effect).

- Threads
- Warps (a set of 32 threads)
- Thread blocks (a set of several warps)

The precise variety of threads in a thread block is dependent upon a particular architecture. For instance, an A100 has the next specifications.

- Maximum of 2048 threads per SM
- Maximum of 1024 threads per block
- Maximum of 32 thread blocks per SM

## Sidebar #2: Magic of the facility of two

Going back to the tiling, It has been found that (heuristically) a matrix tile of size `256x128`

per thread block gives reasonable efficiency for many problems. Subsequently it’s a standard tile size utilized by CUDA.

You may have heard a couple of best practice of keeping batch size, hidden dimension size as powers of two. That is where this comes from! When your matrix dimensions are of powers of two, it would be fully divisible to a set of tiles with no remainder. If not, it makes your code less efficient.

GPU computations are more efficient when your matrix dimensions are in the facility of two

What happens when it’s not an influence of two?

## Sidebar #2: Tile quantization

What happens is an effect referred to as *tile quantization*. In other words, if you will have a tile row dimension of 128 but your matrix has 257 elements in a row, you’ll needn’t two, but three tiles in a row (i.e. 256+1). That is illustrated below.

Problem with that is that, the thread block does the identical amount of computation whatever the useful data residing in it. So, you’re taking the chance to do useful computation away out of your GPU, resulting in inefficiencies.

An analogous effect is referred to as wave quantization, where the matrix is over-sized and the SMs collectively cannot fit it directly. Then the GPU must do the computation in 2 “waves”. Nevertheless that is less of a priority for contemporary GPUs as they leverage concurrency to cut back wave quantization.

Tile quantization happens when a thread block has to spill data partially, wave quantization happens when SMs should spill data.

The ultimate trick is kernel fusion. As a rule, it is quicker to do all of the computations in a single kernel than having two kernels called one after the opposite. Why? Because one kernel needs to write down the info to HBM and other must read that back. We already talked about how slow that is. A greater approach is just mix the 2 operations into one. Some examples are,

So because it is seen here (I’m sure Pytorch has the same glossary), there are various fused kernels offered through TensorFlow that mixes commodious operations in to a single kernel. In code, it means something like this,

`for (int m = 0; m < M; m += Mtile) `

for (int n = 0; n < N; n += Ntile)

for (int k = 0; k < K; ++k)

float tmp = 0

for (int i = 0; i < Mtile; ++i)

for (int j = 0; j < Ntile; ++j)

int row = m + i;

int col = n + j;

tmp += A[row][k] * B[k][col];

// Do other things

C[row][col] = tmp

In other words, we hold dearly to our `tmp`

variable until after we’ve finished all our computations. Then only we’ll write the result back to `C`

.

That’s it folks. I hope this was an enjoyable tour through the weeds of a GPU. Should you’re serious about the audio-visual version here’s the link to my YouTube video.

To recap, we discussed three things that makes GPUs really fast at matrix multiplication.

- GPUs abandon the friendlier inner product implementation of matmul and embrace the more read-efficient outer product implementation of matmul
- GPUs split the matrices into smaller blocks (and blocks into fragments) and split the compute load across thread blocks, warps and threads.
- GPUs employ kernel fusion to bring commonly co-occurring functionality togetter, improving GPU efficiency.

Should you enjoyed this story, be at liberty subscribe to Medium, and you’ll get notifications to fresh content from me, in addition to unlock full access to 1000’s of quality stories from other authors.

*Unless otherwise noted all images are by the writer*