Home Community Stanford University Researchers Introduce FlashFFTConv: A Recent Artificial Intelligence System for Optimizing FFT Convolutions for Long Sequences

Stanford University Researchers Introduce FlashFFTConv: A Recent Artificial Intelligence System for Optimizing FFT Convolutions for Long Sequences

0
Stanford University Researchers Introduce FlashFFTConv: A Recent Artificial Intelligence System for Optimizing FFT Convolutions for Long Sequences

Reasoning efficiently across prolonged sequences is a significant difficulty in machine learning. Recently, convolutions have emerged as a critical primitive for sequence modeling, supporting state-of-the-art performance in language modeling, time-series evaluation, computer vision, DNA modeling, and more. Despite these impressive quality findings and extra benefits, akin to improved stability and higher scalability because the sequence length increases, convolutional sequence models are still significantly slower than Transformers. 

One fundamental cause is unreliable hardware support. Convolutions for sequence modeling regularly employ filters as lengthy because the input sequence, in contrast to the short filters utilized in classical convolutions for visual applications. The Fast Fourier Transform (FFT) convolution algorithm calculates the convolution between an input u and convolution kernel k by mapping the input and output frequencies. 

Despite being asymptotically efficient, the FFT convolution algorithm has a low wall-clock time on contemporary accelerators. Nonetheless, technological progress in systems has allowed Transformers to achieve the bounds of current accelerators, with an end-to-end FLOP usage of over 72% when using FlashAttention-v2. 

To supply longer-context capabilities, a brand new research from Stanford University investigates the best way to optimize the FFT convolution method on contemporary accelerators. The researchers consider that, as advances in systems like FlashAttention led to raised models and latest attention algorithms, optimizing the FFT convolution will result in latest and higher algorithms, boosting the standard of convolutional sequence models. 

The FFT convolution might be easily optimized for brief sequences. It’s common practice to reuse kernel filters over multiple batches, which makes it possible to precompute the FFT of the filter before reusing it. Thus, the FFT convolution is parallel across batches and filters, and kernel fusion allows intermediate convolution outputs to be cached in SRAM or registers.

  1. Nonetheless, the team highlights that two major bottlenecks appear because the sequence length grows. Regarding current accelerators, FFT convolutions don’t optimally utilize the specialized matrix-matrix multiply units. 
  2. Second, kernel fusion fails as sequences grow too long to slot in SRAM, and dear I/O operations are required. Padding operations for causality and conversions from real-valued inputs/outputs to complex-valued FFT intermediates might increase these I/O costs further.

In response, the researchers offer FlashFFTConv, a novel algorithm that employs a Monarch decomposition of the FFT to optimize the FFT convolution for prolonged sequences. The FFT might be effectively transferred onto hardware due to a Monarch decomposition of order p, which rewrites the FFT as a series of p matrix-matrix multiply operations. Higher p values incur less FLOP cost as a consequence of smaller matrices but call for more I/O to convey intermediate results. Hence, there’s a tradeoff involved.

The study demonstrates the best way to optimize p for FLOP cost, and I/O cost in a GPU using an easy cost model based on sequence length. Along with facilitating kernel fusion at greater sequence lengths, this decomposition reduces the quantity of the sequence that should be maintained in SRAM. Subsequently, FlashFFTConv can easily handle sequences anywhere from 256 to 4 million characters long. By utilizing a real-valued FFT algorithm and skipping parts of the matrix-multiply operations when the input is zero-padded, FlashFFTConv can reduce the length of the FFT operation by as much as half. Last but not least, the matrix view of the FFT convolution provides a straightforward interface for implementing two architectural modifications: partial convolutions, which learn with a convolution kernel that’s shorter than the input sequence, and frequency sparse convolutions, which zero out sections of the kernel in frequency space. Each approaches might be implemented just by omitting sections of the matrix decomposition, lowering memory footprint and wall-clock runtime, and might be regarded as convolutional parallels of sparse/approximate attention in Transformers.

The researchers show that FlashFFTConv accelerates the FFT convolution, leading to higher quality, more efficient, and longer sequence models. 

  • FlashFFTConv improves the standard of convolutional sequence models via higher efficiency: for a similar compute budget, FlashFFTConv allows Hyena-GPT-s to realize 2.3 points higher perplexity and allows M2-BERT-base to realize as much as 3.3 higher average GLUE rating—a gain in performance similar to doubling the parameters of the model.
  • FlashFFTConv improves the efficiency of convolutions by as much as 7.93 and by as much as 5.60 in memory savings in comparison with PyTorch, and this efficiency holds over 4 orders of magnitude in sequence length. FlashFFTConv is quicker in wall-clock time than FlashAttention-v2 end-to-end for sequence lengths 2K and longer as a consequence of lower FLOP costs and achieves as much as 62.3% end-to-end FLOP usage, which is just 10% lower than FlashAttention-v2.
  • Models of longer sequences are possible with FlashFFTConv. FlashFFTConv has produced the one model able to completing the lengthy arena benchmark’s Path-512 job (sequence length 256K) for high-resolution picture classification. FlashFFTConv is the primary model to embed the longest human genes (as much as 2.3M base pairs) at single nucleotide resolution; it extends HyenaDNA to 4M sequence length via partial convolutions. 

The team hopes that FlashFFTConv will pave the best way for wider use of convolutional sequence models and that the teachings learned will result in more resource-efficient computer architectures.


Try the PaperGithub, and Blog Article. All credit for this research goes to the researchers of this project. Also, don’t forget to hitch our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.

For those who like our work, you’ll love our newsletter..


Dhanshree

” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-169×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-576×1024.jpg”>

Dhanshree Shenwai is a Computer Science Engineer and has a superb experience in FinTech corporations covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is passionate about exploring latest technologies and advancements in today’s evolving world making everyone’s life easy.


🔥 Join The AI Startup Newsletter To Learn About Latest AI Startups

LEAVE A REPLY

Please enter your comment!
Please enter your name here