Language models and generative AI, renowned for his or her capabilities, are a hot topic within the AI industry. Global researchers are enhancing their efficacy and capability. These systems, typically deep learning models, are pre-trained on extensive labeled data, incorporating neural networks for self-attention. They use various layers—feedforward, recurrent, embedded, and a spotlight—to process input text and produce relevant outputs.

Mostly, large language models’ feedforward layers hold essentially the most parameters. Studies show that these models use only a fraction of obtainable neurons for output computation during inference.

This text introduces UltraFastBERT, a BERT-based framework matching the efficacy of leading BERT models but using just 0.3% of neurons during inference, specifically 12 out of 4095 in each layer. We’ll explore UltraFastBERT’s architecture, functionality, and results. Let’s begin.

Traditionally, a language model employs different components to equip itself with content generation capabilities including feedforward layers, recurrent layers, embedded layers, and a spotlight layers. These components are accountable for learning to acknowledge patterns during training, and ultimately generate accurate output on the premise of the input texts. Each of those components have some parameters, and in language models, a bulk of those parameters is held by the feedforward layers. Nevertheless, these feedforward layers don’t utilize 100% of the neurons available to them to generate output for each input at interference time which results in wastage of resources that increases complexity, computation time, and computational costs.

At its core, the UltraFastBERT framework is a variant of the BERT framework, builds on this idea, and replaces feedforward layers with faster feedforward networks in its architecture that ultimately leads to the UltraFastBERT framework utilizing only 0.3% of the available neurons while delivering results comparable to BERT models with an identical size and training process, especially on the downstream tasks. As a consequence of its design implementations, the intermediate layers in UltraFastBERT framework is exponentially faster,

Given a quick feedforward(FFF) network, and a feedforward(FF) network, each with n variety of neurons, the time complexity of a forward pass in a feedforward network is O(n) whereas the time complexity is O(log2n) for a quick feedforward network, and the difference in time complexity is primarily on account of the very fact in a quick feedforward network, the neurons are organized right into a balanced binary tree, and when the input is provided, the network executes just one branch of the tree conditionally. Moreover, performing interference on a quick feedforward network leads to CMM or Conditional Matrix Multiplication, through which the input rows dot with the natural weight columns individually, and the output of the previous dot-product operation determines the burden of the columns to proceed with. Resultantly, the network uses all of the neurons just for just a few inputs, and no input requires greater than just a few neurons to be handled by the network. The CMM dot product contrasts the DMM or Dense Matrix Multiplication that computes the dot product of all inputs with all the burden columns.

To sum it up, UltraFastBERT is a BERT-based framework that gives results comparable to cutting-edge BERT language models that

- Utilizes only 0.3% of the available neurons through the interference stage, and engages just 12 neurons out of a complete of 4095 neurons for every interference layer.
- Delivers strong performance comparable to cutting-edge BERT models by implementing fine-tuning strategies on downstream tasks.
- Provides a native implementation of the CMM or Conditional Matrix Multiplication that forms the bottom for the fast feedforward network, and ultimately results in 78x speedup in performance in comparison to native optimized DMM or Dense Matrix Multiplication.

**Feed Forward Neural Networks**

A feedforward neural network is one of the vital straightforward artificial neural networks that moves the data in just the forward direction, from the input nodes to the output nodes via hidden nodes. One in every of the principal highlights of a quick forward neural network is that there aren’t any loops or cycles in such networks, and so they are simpler to construct in comparison to RNN or Recurrent Neural Networks, and CNN or Conventional Neural Networks. The architecture of a quick forward neural network comprises three components namely input layers, hidden layers, and output layers, and each layer consists of units called neurons, and every layer is interconnected to the opposite with the assistance of weights.

The neurons present within the input layers receive inputs, and forwards it to the subsequent layer. The quantity of neurons in each input layer is decided by the dimension of the input data. Next up, we now have the hidden layers that usually are not exposed either to the input or the output, and so they are accountable for the obligatory computations. The neurons in each hidden layer take the weighted sum of the outputs given by the previous layer, employ an activation function, and pass the result to the subsequent layer, and the method repeats all once more. Finally, we now have the output layer that produces the output for the given inputs. Each neuron in every layer of a quick feedforward network is interconnected with every neuron in the subsequent layer, thus making FFF neural networks a totally connected network. Weights are used to represent the strength of connection between the neurons, and the network updates these weights to learn the patterns by updating the weights on the premise of the error occurring within the output.

Moving forward, there are two key stages within the working of a quick feedforward neural network: the feedforward phase, and the backpropagation phase.

**Feedforward Phase**

Within the feedforward phase, the input is fed to the network, and it then propagates forward. The hidden layers then compute the weighted sum of the inputs, and introduce non-linearity within the model by passing the sum of the inputs through an activation function like ReLu, Sigmoid, and TanH. The method repeats all once more until the weights reach the output layer, and the model makes a prediction.

**Backpropagation Phase**

Once the model makes a prediction, it computes the error between the generated output, and the expected output. The error is then back propagated through the network, and the network uses a gradient descent optimization algorithm to regulate the weights in an attempt to attenuate the error.

**UltraFastBERT : Model Architecture and Working**

The UltraFastBERT framework is built on the crammedBERT architecture, and the UltraFastBERT framework employs all of the components of the crammedBERT framework except the character of the intermediate layers. As a substitute, the UltraFastBERT framework replaces the transformer encoder within the feedforward networks contained within the intermediate layers of the crammedBERT framework with fast feedforward networks. The UltraFastBERT framework makes the next changes to the unique feedforward networks.

- The framework removes the difference between leaf, and non-leaf nodes through the use of the GeLu activation function across nodes, and equipping these nodes with output weights, and removing output biases in its entirety. Post this, the framework fixes the leaf size to 1.
- Finally, the framework allows multiple fast feedforward network trees in parallel by jointly computing the intermediate output layers. The framework manages to do that computation by taking a sum of individual trees, after which presents the sum because the intermediate output layer.

Moving along, in training, the UltraFastBERT framework follows the training procedure employed by the crammedBERT framework that features disabling the dropout in pretraining, and using the 1-cycle triangular learning rate schedule. The model is then fine-tuned to maximise its performance on a wide selection of tasks mainly of the GLUE benchmark for a complete of 5 epochs.

**Interference**

Interference is a crucial part for a quick feedforward network, and these fast feedforward networks in themselves form a significant chunk of huge language models, and so they are known for his or her exceptional acceleration potential. To know this acceleration potential, let’s take an example of one of the vital advanced language models, the GPT-3 through which the feedforward networks in every transformer layer consist of over 49,100 neurons. If trainable, a quick feedforward network(maximum depth of 15) could replace the unique feedforward network. The introduced fast feedforward network can have over 65,000 neurons, but it can only utilize 16 of those neurons for interference, which amounts to roughly 0.03% of the neurons available to GPT-3.

**Algorithm and Compatibility**

The UltraFastBERT framework makes use of a recursive pseudocode algorithm for fast feedforward interference, and the algorithm is depicted within the image below.

Here, B represents the batch size, H represents the width of the input layers, and M represents columns. One other major reason for concern with using a Computational Matrix Multiplication approach is whether or not it makes the fast feedforward networks incompatible with the method that’s already in use for Dense Matrix Multiplication and existing Deep Learning frameworks. Fortunately, using CMM doesn’t affect the performance or introduces incompatibility, even though it does increase the caching complexity.

It’s vital to notice that as a component of the fast feedforward network, single-threaded Dense Matrix Multiplication relies on executing the MAC or Multiplication and Accumulation instructions, and resultantly, replacing DMM with CMM approach will profit CPUs because fewer MAC instructions are needed to compute the layer output per element. Subsequently, despite employing a conditionality that will likely be related to branching, the “neural branching” acts as an addition to the memory offset to relevant pointers within the framework. Subsequently, within the UltraFastBERT framework, the instruction branch prediction is rarely fully engaged to facilitate the conditionality of the CMM, and only loads the relevant columns of the burden matrix individually. Moreover, because the framework performs row-column dot products, the SIMD or single instruction multiple data vector parallel processing continues to be choice to speed up the interference implementations for specific devices.

**UltraFastBERT : Performance and Results**

We’ll talk concerning the performance of the UltraFastBERT framework for fine-tuning in addition to for interference tasks to research how the framework fares against cutting-edge language models.

**Fantastic-Tuning Results**

The next figure demonstrates the performance of assorted models on GLUE-dev test datasets. Here, N represents the variety of neurons available to the frameworks for training, “Avg” represents the typical rating of all tasks.

As it might be clearly seen, the UltraFastBERT framework that has been trained on the A6000 GPU for over 24 hours manages to retain almost 96% of the predictive performance on GLUE downstream tasks in comparison to the unique BERT framework. Moreover, it might even be seen that with a rise within the depth of the fast feedforward networks, the performance of the frameworks witness a decline, although the vast majority of performance degradation occurs just for the CoLa task. If the CoLa task is disregarded for some time, the UltraFastBERT framework returns a predictive performance rating of about 98.6%.

**Interference Results**

On this section, we’ll compare the performance of several feedforward or fast feedforward networks on interference implementations, and these implementations are spread across three levels.

- In Level 1 implementation, the implementation is constructed using BLAS Level 1 routines namely scalar-vector product, and vector-vector dot products.
- In Level 2, the implementations make use of BLAS Level 2 routines namely batched scalar-vector product, and batched matrix-vector dot products.
- In Level 3, the implementations employ the non-batched BLAS Level 3 matrix-matrix multiplication approach, and even though it is the fastest implementation available for feedforward networks, such implementations usually are not available for fast feedforward networks since the library doesn’t support the vector-level sparsity of the Computational Matrix Multiplication.

Moreover, the UltraFastBERT framework deploys GPU implementations through the use of either custom CUDA or PyTorch kernels.

The above table, compares the performance of the UltraFastBERT framework with its predecessors, the BERT-based frameworks when it comes to feedforward and fast feedforward layers where every column accommodates the relative inference Fast Feedforward over Feedforward implementation speedups once they are making use of the identical linear-algebraic routine primitives.

Nevertheless, it’s price noting that the speedups reported within the above table are meant for “fair comparisons” i.e each the fast feedforward and feedforward implementations make use of equivalent linear-algebraic routine primitive operations. Moreover, on Level 1 and Level 2, the implementations of the fast feedforward networks are able to performing the interference 48x and 78x quicker than the quickest feedforward implementation respectively.

**Final Thoughts**

In this text, we now have talked concerning the UltraFastBERT, a variant of the BERT framework, builds on the concept that feedforward layers don’t utilize 100% of the neurons available to them to generate output for each input at interference time which results in wastage of resources that increases complexity, computation time, and computational costs, and replaces feedforward layers with faster feedforward networks in its architecture that ultimately leads to the UltraFastBERT framework utilizing only 0.3% of the available neurons while delivering results comparable to BERT models with an identical size and training process, especially on the downstream tasks.

As a consequence of its design implementations, the intermediate layers in UltraFastBERT framework are exponentially faster. Moreover, the strong performance delivered by the UltraFastBERT framework is a proof that LLMs can deliver strong performance by engaging only a fraction of their parameters for individual interferences, because the UltraFastBERT framework utilizes only 0.3% of the available neurons during interference, and yet manages to realize 78x speedup over interference times.