
Previously 12 months, natural language processing has seen remarkable advancements with the emergence of language models equipped with significantly longer contexts. Amongst these models are GPT-4 with a context length of 32k, MosaicML’s MPT with 65k context, and Anthropic’s Claude, boasting a formidable 100k context length. As applications similar to long document querying and story writing proceed to grow, the necessity for language models with prolonged context becomes evident. Nonetheless, the challenge lies in scaling up the context length of Transformers, as their attention layer has computational and memory requirements that grow quadratically with the input sequence length.
Addressing this challenge, FlashAttention, an modern algorithm released only a 12 months ago, gained rapid adoption across various organizations and research labs. This algorithm successfully accelerated attention computation while reducing its memory footprint without sacrificing accuracy or approximating the outcomes. With 2-4 times faster performance than optimized baselines at its initial release, FlashAttention proved to be a groundbreaking advancement. Yet, it still had untapped potential, because it fell in need of the blazing-fast optimized matrix-multiply (GEMM) operations that achieved as much as 124 TFLOPs/s on A100 GPUs.
Taking the subsequent breakthrough, the developers of FlashAttention have now introduced FlashAttention-2, a reinvented version that significantly surpasses its predecessor. Leveraging Nvidia’s CUTLASS 3.x and CuTe core library, FlashAttention-2 achieves a remarkable 2x speedup, reaching as much as 230 TFLOPs/s on A100 GPUs. Furthermore, in end-to-end training of GPT-style language models, FlashAttention-2 attains a training speed of as much as 225 TFLOPs/s, with a formidable 72% model FLOP utilization.
The important thing enhancements of FlashAttention-2 lie in its higher parallelism and work partitioning strategies. Initially, FlashAttention parallelized over batch size and variety of heads, effectively utilizing the compute resources on the GPU. Nonetheless, for long sequences with smaller batch sizes or fewer heads, FlashAttention-2 now parallelizes over the sequence length dimension, leading to significant speedup in these scenarios.
One other improvement involves efficiently partitioning work between different warps inside each thread block. In FlashAttention, splitting K and V across 4 warps while keeping Q accessible by all warps, known as the “sliced-K” scheme, led to unnecessary shared memory reads and writes, slowing down the computation. FlashAttention-2 takes a unique approach, now splitting Q across 4 warps while keeping K and V accessible to all warps. This eliminates the necessity for communication between warps and significantly reduces shared memory reads/writes, further boosting performance.
FlashAttention-2 introduces several recent features to broaden its applicability and enhance its capabilities. It now supports head dimensions as much as 256, accommodating models like GPT-J, CodeGen, CodeGen2, and StableDiffusion 1.x, opening up more speedup and memory-saving opportunities. Moreover, FlashAttention-2 embraces multi-query attention (MQA) and grouped-query attention (GQA) variants, where multiple heads of the query can attend to the identical head of key and value, resulting in higher inference throughput and higher performance.
The performance of FlashAttention-2 is actually impressive. Benchmarked on an A100 80GB SXM4 GPU, it achieves around 2x speedup in comparison with its predecessor and as much as 9x speedup in comparison with an ordinary attention implementation in PyTorch. Furthermore, when used for end-to-end training of GPT-style models, FlashAttention-2 unlocks as much as 225 TFLOPs/s on A100 GPUs, representing a 1.3x end-to-end speedup over already highly optimized models with FlashAttention.
Looking ahead, the potential applications of FlashAttention-2 are promising. With the power to coach models with 16k longer context for a similar price as previous 8k context models, this technology may also help analyze long books, reports, high-resolution images, audio, and video. Plans for broader applicability on devices like H100 GPUs and AMD GPUs and optimizing for brand new data types like fp8 are underway. Moreover, combining the low-level optimizations of FlashAttention-2 with high-level algorithmic changes could pave the best way for training AI models with unprecedentedly longer context. Collaboration with compiler researchers to boost programmability can be on the horizon, promising a shiny future for the subsequent generation of language models.
Try the Paper and Github. Don’t forget to affix our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you will have any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com
🚀 Check Out 900+ AI Tools in AI Tools Club
Niharika
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-264×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-902×1024.jpg”>
Niharika is a Technical consulting intern at Marktechpost. She is a 3rd 12 months undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the most recent developments in these fields.
edge with data: Actionable market intelligence for global brands, retailers, analysts, and investors. (Sponsored)