
The event of Large Language Models (LLMs) built from decoder-only transformer models has played an important role in transforming the Natural Language Processing (NLP) domain, in addition to advancing diverse deep learning applications including reinforcement learning, time-series evaluation, image processing, and rather more. Nonetheless, despite their scalability and robust performance, LLMs built from decoder-only transformer models still face significant shortcomings. Although expressive, the eye mechanism in transformer-derived LLMs requires high computational resources during each inference and training, necessitating substantial memory for the sequence length and quadratic FLOPs. This high computational requirement limits the context length of transformer models, making autoregressive generation tasks proportionally expensive with scale, and hinders learning from continuous data streams and the aptitude for truly unlimited sequence processing.
In recent times, State Space Models (SSMs) have demonstrated remarkable capabilities and performance, competing with transformer-architecture models in large-scale modeling benchmarks while achieving memory complexity as a function of sequence length and linear time. Furthermore, Mamba, a recently released State Space Model, has shown outstanding performance in a spread of language modeling and long-sequence processing tasks. Concurrently, Mixture of Expert (MoE) models have also shown impressive performance while significantly reducing the latency and computational costs of inference, albeit on the expense of a bigger memory footprint. Constructing on Mamba and MoE models, this text will discuss BlackMamba, a novel architecture that mixes the Mamba State Space Model with MoE models to leverage the advantages offered by each frameworks. Experiments on BlackMamba have demonstrated its ability to outperform the prevailing Mamba framework and transformer baselines in each training FLOPs and inference. The exceptional performance of the BlackMamba framework shows that it will possibly effectively mix the skills of the Mamba and MoE frameworks, offering fast and cost-effective inference from MoE with linear-complexity generation from Mamba.
This text goals to cover the BlackMamba framework in depth. We explore the mechanism, methodology, and architecture of the framework, together with its comparison to state-of-the-art image and video generation frameworks. Let’s start.
The progression of Large Language Models (LLMs), particularly those based on decoder-only transformer architectures, has notably influenced the Natural Language Processing (NLP) field and expanded into various deep learning applications, including reinforcement learning, time-series evaluation, image processing, and beyond. Nonetheless, despite their scalability and robust performance, these decoder-only transformer-based LLMs encounter notable challenges. The eye mechanism, a key feature of transformer-based LLMss, demands extensive computational resources for each inference and training. This involves a necessity for memory that grows with the sequence length and computational operations (FLOPs) that increase quadratically. Such intensive computational needs restrict the models’ context length, elevate the prices of autoregressive generation tasks because the model scales, and hinder the models’ ability to learn from continuous data streams or process sequences of unlimited length efficiently.
Significant efforts have been made prior to now few years in an try and overcome these limitations, and a spotlight has been shifted towards devising architectural alternatives to the canonical dense attention transformer models with SSMs and MoE models being essentially the most promising candidate architectures. The important thing profit reaped by favoring State Space Models over transformer architecture models is the linear computational complexity with respect to input sequence length offered by SSMs versus the quadratic complexity offered by transformers. Theoretically, linear computational complexity with respect to input sequence length enables State Space Models to process larger sequences than transformer-architecture models for a given FLOPS or Floating-point operations per second budget, and to render autoregressive generation constant in compute and not using a KV cache. Recently developed State Space Models including Mamba, RetNet and just a few others have demonstrated efficient long-sequence inference and training, together with competitive language modeling task performance to transformers with similar scaling properties. However, Mixture of Expert models architectures is gaining popularity as an alternative choice to dense transformers because it facilitates a major reduction in inference and training FLOPs essential for achieving comparable quality to a dense model. MoE (Mixture of Experts) models operate by activating only a sparse number of the overall parameters during a single forward pass. They utilize a routing function to find out which ‘experts’ are called into motion based on the given context. This approach creates a separation between the computational cost of inference and the overall variety of parameters, allowing for enhanced performance inside a set inference budget, albeit with an increased variety of parameters and a bigger memory requirement.
This advancement in architecture offers notable advantages over traditional transformers and represents an exciting direction for further development. We posit that integrating these enhancements right into a combined Mamba-MoE model could significantly speed up language modeling capabilities and efficiency beyond that of ordinary transformer models. The anticipated benefits of a Mamba-MoE architecture in comparison with a standard dense transformer model include:
Mamba: Achieves linear computational complexity relative to the input sequence length for each training and inference phases. It enables autoregressive generation to occur in a relentless timeframe and with constant memory usage.
MoE: Offers the inference speed and training computational efficiency comparable to a smaller, dense baseline model while maintaining a level of model quality that rivals that of a model with an equivalent variety of parameters because the denser version.
With that being said, it is important to state that transformer architecture models are still state-of-the-art, and have demonstrated consistent and memorable strong performance on language modeling tasks and sequence processing tasks. At its core, the transformer architecture employs self-attention that performs a quadratic all-to-all comparison of the dot product similarities between the embeddings of various tokens in a sequence, and performs a linear map to an output vector. The transformer model consists of self-attention blocks stacked between MLP or Multi-Layer Perceptron blocks that further consist of a two-layer MLP with a given activation function.
BlackMamba : Architecture and Methodology
State Space Models
State Space Models belong to the group of sequence models with linear complexity with respect to the length of the input sequence. The architecture of State Space Models aligns more with Recurrent Neural Networks and Convolutional Neural Networks somewhat than attention-based architecture, and is inspired from a continuous dynamical system that maps a 1-dimensional function through an implicit latent space. A linear dynamical system makes parallel computations efficient using either an associative or a convolution scan. In practical scenarios, the recurrent nature of State Space Models has been the rationale why it remains to be to be adopted on highly-parallel AI hardware like GPUs. Nonetheless, the emergence of SSMs like RWKV and Mamba have used parallel scan kernels to map recurrent operations efficiently to GPUs, thus facilitating the training of novel architectures with efficiency comparable to those achieved by transformer models.
The inherent quadratic complexity in relation to sequence length inside transformers is a well known limitation that impedes reasoning and comprehension over very long contexts. Recent innovations have introduced the thought of extending the context length, enabling transformers to be trained on a feasible scale before being applied to for much longer contexts during inference. Despite these advancements, the inference process still demands a substantial amount of computational resources and memory, especially for maintaining the Key-Value (KV) cache, making it a resource-intensive endeavor. Recent research efforts have focused on enhancing the expressive capabilities of state-space models by incorporating input-dependent gating mechanisms, akin to the Query, Key, Value (QKV) matrices present in attention mechanisms.
These efforts aim to preserve the inherently linear progression of state-space recursion, allowing for efficient execution through either convolution or a selective scan process. This approach significantly narrows the performance disparity with transformers in practical applications. Amongst these advancements, Mamba stands out as a state-space model that mirrors the objectives of prior research, showing impressive performance levels comparable to transformers at scales as much as 2.8 billion parameters. It achieves this by applying input-dependent gating to the inputs of the state-space model (SSM) recursion, all of the while ensuring efficient computation through using bespoke selective scan kernels.
Mixture of Expert Models
Mixture of Expert (MoE) models achieve a separation between the inference cost and the overall parameter count by selectively activating parameters in the course of the forward pass. As an alternative of using all parameters, these models direct tokens to specific Multilayer Perceptron (MLP) experts. Ideally, each expert is tailored to process a specific kind of input, with a routing mechanism, essentially a compact neural network, determining essentially the most suitable expert for every token. This approach goals to preserve the great expressive power of a model with an equivalent variety of parameters in a denser configuration, but with considerably reduced computational demands. Typically, the router is a mapping of the linear layers from tokens to expert indices with each expert simply being a typical transformer Multilayer Perceptron. Nonetheless, developers are yet to determine the optimal training method for the router for the reason that expert task problem is non-differentiable, and Mixture of Expert models often struggle with load balancing and training stability between different experts for hardware efficiency.
Architecture
At its core, BlackMamba employs a typical transformer model consisting of interleaved MLP blocks and a spotlight blocks added in sequence along a residual stream. Now, a majority of Mixture of Expert models simply replace the multilayer perceptron blocks with a routed expert layer. However, the BlackMamba framework not only replaces the multilayer perceptron block within the transformer with a routed expert layer, but additionally replaces the eye layer with a Mamba State Space Model layer. The architecture of the BlackMamba framework is demonstrated in the next figure.
Training and Dataset
The BlackMamba model is trained on over 300 billion tokens on a custom dataset, and uses the SwiGLU activation function for the expert multilayer perceptrons. The framework trains with 8 experts, a number that developers found to be the correct balance and trade off between the memory footprint and inference cost of the model. The custom dataset used to coach the BlackMamba framework consists of a mix of already existing open source datasets including Starcoder, SlimPajama, Pile, and more. The next table demonstrates the weights of every of the dataset used for training the BlackMamba framework. Overall, there are 1.8 trillion tokens within the dataset.
BlackMamba : Results
To make sure a good comparison between Mamba and BlackMamba, developers have trained each the models with the identical training parameters on the identical training data. The BlackMamba framework is capable of outperform each Mamba and transformer models for similar forward pass model size on the inference time in addition to training Floating-point operations per second. The next figure demonstrates the time taken to generate a sequence of a given length autoregressively from an initial one-token prompt as a function of the sequence length.
Moreover, the latency advantages of each the Mixture of Expert and Mamba models are combined within the BlackMamba framework leading to significantly faster inference times when put next against transformer models, pure Mamba models, and MoE models. Moreover, the inference advantage of the BlackMamba framework is directly proportional to the sequence lengths, making BlackMamba extremely effective at long sequence generation. Moving along, the next figure illustrates the variety of tokens assigned to the BlackMamba models with 340 million and 640 million parameters respectively. As it will possibly be seen, a majority of the layers reveal a high level of expert balance in consequence of the improved Sinkhorn algorithm implemented by the BlackMamba models.
The next table covers the evaluation scores of the BlackMamba framework compared against a spread of open-source pre-trained language models. As it will possibly be observed, the BlackMamba framework is capable of compete and outperform with a majority of the frameworks across all baselines. Moreover, it’s price noting that the models that outperform BlackMamba have considerably higher variety of parameters, and the gap in performance is minimal, indicating the flexibility of the BlackMamba framework with less parameters.
Final Thoughts
In this text, we now have talked about BlackMamba, a novel architecture that mixes the Mamba State Space Model with Mixture of Expert models to reap the advantages offered by each these frameworks. Experiments on BlackMamba have demonstrated it to outperform the prevailing Mamba framework and transformer baselines in each training FLOPs and inference. The exceptional performance of the BlackMamba framework demonstrates that it’s capable of inherit and mix the skills of the Mamba and MoE frameworks exceptionally well because it combines the low cost and fast inference from MoE with linear-complexity generation from Mamba. We’ve talked about how the architecture of the BlackMamba framework is capable of outperform strong trained Large Language Models, existing Mamba framework, and Mixture of Expert models by way of training FLOPs and inference cost. Moreover, the BlackMamba framework also inherits the generation FLOPs and reduced training from each Mixture of Expert models and Mamba framework concurrently.