Large language models, or LLMs briefly, have emerged as a groundbreaking advancement in the sphere of artificial intelligence (AI). These models, corresponding to GPT-3, have completely revolutionalized natural language understanding. With the capability of such models to interpret vast amounts of existing data and generate human-like texts, these models hold immense potential to shape the longer term of AI and open up recent possibilities for human-machine interaction and communication. Nevertheless, despite the huge success achieved by LLMs, one significant challenge often related to such models is their computational inefficiency, resulting in slow performance even on essentially the most powerful hardware. Since these models comprise tens of millions and billions of parameters, training such models demands extensive computational resources, memory, and processing power, which is just not all the time accessible. Furthermore, these complex architectures with slow response times could make LLMs impractical for real-time or interactive applications. Because of this, addressing these challenges becomes essential in unlocking the total potential of LLMs and making their advantages more widely accessible.
Tacking this problem statement, researchers from the University of California, Berkeley, have developed vLLM, an open-source library that may be a simpler, faster, and cheaper alternative for LLM inference and serving. Large Model Systems Organization (LMSYS) is currently using the library to power their Vicuna and Chatbot Arena. By switching to vLLM as their backend, in contrast to the initial HuggingFace Transformers based backend, the research organization has managed to handle peak traffic efficiently (5 times greater than before) while using limited computational resources and reducing high operational costs. Currently, vLLM supports several HuggingFace models like GPT-2, GPT BigCode, and LLaMA, to call a number of. It achieves throughput levels which can be 24 times higher than those of HuggingFace Transformers while maintaining the identical model architecture and without necessitating any modifications.
As an element of their preliminary research, the Berkeley researchers determined that memory-related issues pose the first constraint on the performance of LLMs. LLMs use input tokens to generate attention key and value tensors, that are then cached in GPU memory for generating subsequent tokens. These dynamic key and value tensors, generally known as KV cache, occupy a considerable portion of memory, and managing them becomes a cumbersome task. To handle this challenge, the researchers introduced the modern concept of PagedAttention, a novel attention algorithm that extends the standard idea of paging in operating systems to LLM serving. PagedAttention offers a more flexible approach to managing key and value tensors by storing them in non-contiguous memory spaces, eliminating the requirement for continuous long memory blocks. These blocks might be independently retrieved using a block table during attention computation, resulting in more efficient memory utilization. Adopting this clever technique reduces memory wastage to lower than 4%, leading to near-optimal memory usage. Furthermore, PagedAttention can batch 5x more sequences together, thereby enhancing GPU utilization and throughput.
PagedAttention offers the extra advantage of efficient memory sharing. During parallel sampling, i.e., when multiple output sequences are created concurrently from a single prompt, PagedAttention enables the sharing of computational resources and memory related to that prompt. That is achieved by utilizing a block table, where different sequences inside PagedAttention can share blocks by mapping logical blocks to the identical physical block. By employing this memory-sharing mechanism, PagedAttention not only minimizes memory usage but in addition ensures secure sharing. The experimental evaluations conducted by the researchers revealed that parallel sampling could reduce memory usage by a whopping 55%, leading to a 2.2 times increase in throughput.
To summarize, vLLM effectively handles the management of attention key and value memory through the implementation of the PagedAttention mechanism. This ends in exceptional throughput performance. Furthermore, vLLM seamlessly integrates with well-known HuggingFace models and might be utilized alongside different decoding algorithms, corresponding to parallel sampling. The library might be installed using an easy pip command and is currently available for each offline inference and online serving.
Check Out The Blog Article and Github. Don’t forget to affix our 25k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you might have any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is captivated with the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more in regards to the technical field by participating in several challenges.