Just about all the massive language models (LLM) depend on the Transformer neural architecture. While this architecture is praised for its efficiency, it has some well-known computational bottlenecks.
During decoding, one in every of these bottlenecks is within the computation of the eye with pairs of key-value tensors for every token of the input. All these tensors should be stored in memory.
Note: I won’t explain in this text what’s the role of those key-value pairs. It’s one of the complicated and interesting facets of the Transformer architecture. For those who don’t learn about it, I strongly recommend reading The Illustrated Transformer by Jay Alammar.
As LLM accepts longer and longer inputs, e.g., the LLM Claude accepts 100k token-long inputs, the memory consumed by these tensors can turn into very large.
Naively storing all these tensors in memory results in memory over-reservation and fragmentation. This fragmentation could make memory access very inefficient, especially for long sequences of tokens. As for over-reservation, the system does it to be sure it has allocated enough memory for the tensors, even when it doesn’t devour all of it.
To alleviate these issues, UC Berkeley proposes PagedAttention.
PagedAttention is implemented in vLLM (Apache 2.0 license) which is deployed by LMSYS, a company for open research founded by students and college from UC Berkeley with the assistance of UCSD and CMU.
In this text, I explain what PagedAttention is and why it significantly hastens decoding. I show towards the top of the article start with vLLM to take advantage of PagedAttention for inference and serving LLMs in your computer.
Kwon et al. (2023) propose PagedAttention.
The goal is to store key-value tensors more efficiently within the non-contiguous spaces of the GPU VRAM.
In brief, the concept behind PagedAttention is to create contiguous virtual blocks mapped to physical blocks within the GPU memory.
Each block is designed to store key-value pairs’ tensors for a predefined variety of tokens. All of the blocks are virtually contiguous and mapped to physical non-contiguous blocks, allocated on demand during inference, within the fragmented GPU memory. A straightforward index table can be created in memory to associate virtual with physical blocks.
The kernel of PagedAttention fetches as needed these blocks. That is efficient since the system fetches smaller numbers of key-value tensors attributable to the limited size of the blocks.
Let’s take the next prompt for illustration:
the cat is sleeping within the kitchen and the dog is
We have now key-value tensors for every token. With PageAttention, we will (arbitrarily) set the block size at 4. Each block accommodates 4 key-value tensors, except the last one which accommodates only 3 key-value tensors. The blocks are virtually contiguous but usually are not necessarily contiguous within the GPU memory, as illustrated by the figure within the introduction of this text.
For the computation of attention, for every query token, the system fetches the block one after the other, as illustrated below.
By fetching key-value tensors by blocks, as a substitute of all the sequence of tensors, the computation of attention is far faster.
One other advantage of PagedAttention is that the virtual blocks might be shared when sampling during inference. All of the sequences generated in parallel via sampling or beam search can use the identical virtual blocks, avoiding duplicates.
Of their experiments, LMSYS observed a 55% reduction in memory usage for beam search decoding.
Before trying it by ourselves, let’s have a have a look at the performance reported by the authors (UC Berkely/LMSYS) when using PagedAttention implemented in vLLM in comparison with the text generation inference library developed by Hugging Face.
vLLM looks much faster in line with these results, especially within the case of multiple output completions. The difference between TGI and vLLM increases with larger models. This is predicted since larger models require more memory and are thus more impacted by memory fragmentation.
Overall, vLLM is as much as 24x faster than the Hugging Face Transformers library.
Note: Actually, I’m also impressed by the development from HF to TGI. I didn’t cover TGI yet on my blog but I’ll probably write a guide about it. TGI is utilized in production at Hugging Face. While it seems much slower than vLLM, TGI has other benefits comparable to the support for a lot of more models and features.
Note: vLLM doesn’t support CUDA 12 yet. Use a lower version, comparable to 11.8.
On this section, I’ll only undergo the fundamentals of arrange and run vLLM in your computer. For more advanced usage, you possibly can have a have a look at the vLLM documentation.
As I write this text, vLLM only supports just a few sorts of models:
- GPT-2
- GPT-NeoX and Pythia based
- LLaMa based
- OPT based
You possibly can add the support of other models by following these instructions.
Within the code below, I take advantage of Dolly V2 (MIT license). It’s a chat model based on Pythia and trained by DataBricks.
I selected the smallest version with 3 billion parameters. It will possibly run a consumer GPU with 24 GB of VRAM, e.g., an nVidia RTX 3080/3090.
Probably the most straightforward option to install vLLM is with pip:
pip install vllm
Note: This could take as much as 10 minutes.
But in my case, on each my computer and Google Colab, pip did not install the vllm library. The authors of vLLM confirm that there’s a problem with some nvcc versions and environments. Nonetheless, for many configurations, pip should install vLLM with none problem.
For those who are in the identical situation as me, the workaround is solely to make use of a Docker image. This one worked for me:
docker run --gpus all -it --rm --shm-size=8g nvcr.io/nvidia/pytorch:22.12-py3
Note: Once within the docker, the authors recommend removing Pytorch before installing vLLM: pip uninstall torch. Then, “pip install vllm” should work.
Then, we will start writing Python.
We first must import vllm, after which we load the model with vllm. The inference is triggered by llm.generate().
from vllm import LLMprompts = ["Tell me about gravity"] #You possibly can put several prompts on this list
llm = LLM(model="databricks/dolly-v2-3b") # Load the model
outputs = llm.generate(prompts) # Trigger inference
You can even use vLLM for serving LLMs. It really works similarly to TGI. It’s also way more easy than running the NVIDIA Triton inference server that I described in a previous article.
You first need to start out the server:
python -m vllm.entrypoints.openai.api_server --model databricks/dolly-v2-3b
Note: The server will listen on port 8000. Be certain it is out there or change it within the vLLM configuration file.
Then, you possibly can query the server with prompts as follows:
curl http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{
"model": "databricks/dolly-v2-3b",
"prompt": "Tell me about gravity",
"max_tokens": 200
}'
And that’s it! You will have a really efficient LLM server running in your computer.
PagedAttention significantly hastens inference. It’s one other step toward more cost-effective AI with LLM.
In further experiments, I confirmed that vLLM is very efficient with batches of prompts. To completely make the most of vLLM, consider optimizing your batching strategy for inference.
While beam search with large beams could have been prohibitive with standard attention computation, beam search with PagedAttention is quicker and more memory efficient.
Certainly one of my next experiments might be to mix PagedAttention with QLoRa to cut back memory usage. It ought to be straightforward. It will make running LLMs on consumer hardware much more efficient.