
Transformers
A broad overview of Transformers research
The pace of research in deep learning has accelerated significantly lately, making it increasingly difficult to maintain abreast of all the most recent developments. Despite this, there may be a specific direction of investigation that has garnered significant attention attributable to its demonstrated success across a various range of domains, including natural language processing, computer vision, and audio processing. That is due largely to its highly adaptable architecture. The model is known as Transformer, and it makes use of an array of mechanisms and techniques in the sphere (i.e., attention mechanisms). You may read more in regards to the constructing blocks and their implementation together with multiple illustrations in the next articles:
This text provides more details in regards to the attention mechanisms that I will likely be talking about throughout this text:
A comprehensive range of models has been explored based on the vanilla Transformer up to now, which may broadly be broken down into three categories:
- Architectural modifications
- Pretraining methods
- Applications
Each category above incorporates several other sub-categories, which I’ll investigate thoroughly in the subsequent sections. Fig. 2. illustrates the categories researchers have modified Transformers.
Self-attention plays an elemental role in Transformer, although, it suffers from two major disadvantages in practice [1].
- Complexity: As for long sequences, this module turns right into a bottleneck since its computational complexity is O(TΒ²Β·D).
- Structural prior: It doesn’t tackle the structural bias of the inputs and requires additional mechanisms to be injected into the training data which later it may possibly learn (i.e. learning the order information of the input sequences).
Subsequently, researchers have explored various techniques to beat these drawbacks.
- Sparse attention: This method tries to lower the computation time and the memory requirements of the eye mechanism by taking a smaller portion of the inputs into consideration as a substitute of all the input sequence, producing a sparse matrix in contrast to a full matrix.
- Linearized attention: Disentangling the eye matrix using kernel feature maps, this method tries to compute the eye within the reverse order to cut back the resource requirements to linear complexity.
- Prototype and memory compression: This line of modification tries to diminish the queries and key-value pairs to realize a smaller attention matrix which in turn reduces the time and computational complexity.
- Low-rank self-attention: By explicitly modeling the low-rank property of the self-attention matrix using parameterization or replacing it with a low-rank approximation tries to enhance the performance of the transformer.
- Attention with prior: Leveraging the prior attention distribution from other sources, this approach, combines other attention distributions with the one obtained from the inputs.
- Modified multi-head mechanism: There are numerous ways to switch and improve the performance of the multi-head mechanism which will be categorized under this research direction.
3.1. Sparse attention
The usual self-attention mechanism in a transformer requires every token to take care of all other tokens. Nonetheless, it has been observed that in lots of cases, the eye matrix is usually very sparse, meaning that only a small variety of tokens actually attend to one another [2]. This means that it is feasible to cut back the computational complexity of the self-attention mechanism by limiting the variety of query-key pairs that every query attends to. By only computing the similarity scores for pre-defined patterns of query-key pairs, it is feasible to significantly reduce the quantity of computation required without sacrificing performance.
Within the un-normalized attention matrix Γ, the -β items aren’t typically stored in memory so as to reduce the memory footprint. This is finished to diminish the quantity of memory required to implement the matrix, which may improve the efficiency and performance of the system.
We will map the eye matrix to a bipartite graph where the usual attention mechanism will be considered a whole bipartite graph, where each query receives information from all the nodes within the memory and uses this information to update its representation. In this fashion, the eye mechanism allows each query to take care of all the other nodes within the memory and incorporate their information into its representation. This permits the model to capture complex relationships and dependencies between the nodes within the memory. The sparse attention mechanism, alternatively, will be considered a sparse graph. Which means not all the nodes within the graph are connected, which may reduce the computational complexity of the system and improve its efficiency and performance. By limiting the variety of connections between nodes, the sparse attention mechanism can still capture essential relationships and dependencies, but with less computational overhead.
There are two major classes of approaches to sparse attention, based on the metrics used to find out the sparse connections between nodes [1]. These are position-based and content-based sparse attention.
3.1.1. Position-based sparse attention
In any such attention, the connections in the eye matrix are limited in line with predetermined patterns. They will be expressed as combos of simpler patterns, which will be useful for understanding and analyzing the behavior of the eye mechanism.
3.1.1.1. Atomic sparse attention: There are five basic atomic sparse attention patterns that will be used to construct quite a lot of different sparse attention mechanisms which have different trade-offs between computational complexity and performance as shown in Fig. 4.
- Global attention: Global nodes will be used as an information hub across all other nodes that may attend to all other nodes within the sequence and vice versa as in Fig. 4 (a).
- Band attention (also sliding window attention or local attention): The relationships and dependencies between different parts of the info are sometimes local quite than global. Within the band attention, the eye matrix is a band matrix, with the queries only attending to a certain variety of neighboring nodes on either side as shown in Fig. 4 (b).
- Dilated attention: Much like how dilated convolutional neural networks (CNNs) can increase the receptive field without increasing computational complexity, it is feasible to do the identical with band attention through the use of a dilated window with gaps of dilation w_d >= 1, as shown in Fig. 4 (c). Also, it may possibly be prolonged to strided attention where the dilation π€ π is assumed to be a big value.
- Random attention: To enhance the power of the eye mechanism to capture non-local interactions, a couple of edges will be randomly sampled for every query, as depicted in Fig. 4 (d).
- Block local attention: The input sequence is segmented into several non-intersecting query blocks, each of which is related to a neighborhood memory block. The queries inside each query block only attend to the keys within the corresponding memory block, shown in 3(e).
3.1.1.2. Compound sparse attention: As illustrated in Fig. 5, many existing sparse attention mechanisms are composed of greater than one in every of the atomic patterns described above.
3.1.1.3. Prolonged sparse attention: There are also other forms of patterns which have been explored for specific data types. By the use of example, BP-Transformer [3] uses a binary tree to capture a mixture of worldwide and native attention across the input sequence. Tokens are leaf nodes and the interior nodes are span nodes containing multiple tokens. Fig. 6 shows quite a lot of prolonged sparse attention patterns.
3.1.2. Content-based sparse attention
On this approach, a sparse graph is constructed where the sparse connections are based on the inputs. It selects the keys which have high similarity scores with the given query. An efficient strategy to construct this graph is to make use of Maximum Inner Product Search (MIPS) which finds the utmost dot-product between keys and the query without calculating all dot-products.
Routing Transformer [4] as shown in Fig. 7, equips the self-attention mechanism with a sparse routing module through the use of online k-means clustering to cluster keys and queries on the identical centroid vectors. It isolates the queries to only attend keys throughout the same cluster. Reformer [5] uses locality-sensitive hashing (LSH) as a substitute of dot-product attention to pick keys and values for every query. It enables the queries to only attend to tokens throughout the same bucket that are derived from the queries and keys using LSH. Using the LSTM edge predictor, Sparse Adaptive Connection (SAC) [6] constructs a graph from the input sequence and achieves attention edges to reinforce the tasks-specific performance by leveraging an adaptive sparse connection.
3.2. Linearized attention
The computational complexity of the dot-product attention mechanism (softmax(QK^β€)V) increases quadratically with the spatiotemporal size (length) of the input. Subsequently, it impedes its usage when exposed to large inputs akin to videos, long sequences, or high-resolution images. By disentangling softmax(QK^β€) to Qβ² Kβ²^β€, the (Qβ² Kβ²^β€ V) will be computed in reverse order, leading to a linear complexity O(π ).
Assuming Γ = exp(QK^β€) denotes an un-normalized attention matrix, where exp(.) is applied element-wise, Linearized attention is a way that approximates the un-normalized attention matrix exp(QK^β€) with π(Q) π(K)^β€ where π is a row-wise feature map. By applying this method, we are able to do π(Q) (π(K)^β€ V) which is a linearized computation of an un-normalized attention matrix as illustrated in Fig. 8.
To attain a deeper understanding of linearized attention, I’ll explore the formulation in vector form. I’ll examine the overall type of attention so as to gain further insight.
On this context, sim(Β·, Β·) is a scoring function that measures the similarity between input vectors. Within the vanilla Transformer, the scoring function is the exponential of the inner product, exp(β¨Β·, Β·β©). An appropriate selection for sim(Β·, Β·) is a kernel function, K(x, y) = π(x)π(y)^β€ , which ends up in further insights into the linearized attention.
on this formulation, the outer product of vectors is denoted by β. Attention will be linearized by first computing the highlighted terms which permit the autoregressive models i.e. transformer decoders to run like RNNs.
Eq. 2 shows that it keeps a memory matrix by aggregating associations from outer products of (feature-mapped) keys and queries. It later retrieves it by multiplying the memory matrix with the feature-mapped query with proper normalization.
This approach consists of two foundational components:
- Feature map π (Β·): the kernel feature map for every attention implementation (ex. ππ(x) = elu(π₯ π )+1 proposed in Linear Transformer
- Aggregation rule: aggregating the associations {π (k)π β vπ} into the memory matrix by easy summation.
3.3. Query prototyping and memory compression
Apart from employing the utilization of sparse attention or kernel-based linearized attention, it’s also feasible to mitigate the intricacy of attention through a decrease in the amount of queries or key-value pairs, thereby leading to the initiation of query prototypes and the implementation of memory compression techniques, respectively.
3.3.1. Attention with prototype queries: The implementation of Attention with Prototype Queries involves the utilization of a set of query prototypes as the first basis for computing attention distributions. The model employs two distinct methodologies, either by copying the computed distributions to the positions occupied by the represented queries, or by filling those positions with discrete uniform distributions. The flow of computation on this process is depicted in Figure 9(a).
Clustered Attention, as described in [7], involves the aggregation of queries into several clusters, with attention distributions being computed for the centroids of those clusters. All queries inside a cluster are assigned the eye distribution calculated for its corresponding centroid.
Informer, as outlined in [8], employs a strategy of explicit query sparsity measurement, derived from an approximation of the Kullback-Leibler divergence between the queryβs attention distribution and the discrete uniform distribution, to pick query prototypes. Attention distributions are then calculated just for the top-π’ queries as determined by the query sparsity measurement, with the remaining queries being assigned discrete uniform distributions.
3.3.2. Attention with compressed key-value memory: This method reduces the complexity of the eye mechanism within the Transformer by reducing the variety of key-value pairs before applying attention as shown in Fig. 9(b). That is achieved by compressing the key-value memory. The compressed memory is then used to compute attention scores. This method can significantly reduce the computational cost of attention while maintaining good performance on various NLP tasks.
Liu et al. [9] suggest a way called Memory Compressed Attention (MCA) of their paper. MCA involves using strided convolution to diminish the variety of keys and values. MCA is utilized alongside local attention, which can be proposed in the identical paper. By reducing the variety of keys and values by an element of the kernel size, MCA is in a position to capture global context and process longer sequences than the usual Transformer model with the identical computational resources.
Set Transformer [10] and Luna [11] are two models that utilize external trainable global nodes to condense information from inputs. The condensed representations then function as a compressed memory that the inputs attend to, effectively reducing the quadratic complexity of self-attention to linear complexity regarding the length of the input sequence.
Linformer [12] reduces the computational complexity of self-attention to linear by linearly projecting keys and values from the length n to a smaller length n_k. The setback with this approach is the pre-assumed input sequence length, making it unsuitable for autoregressive attention.
Poolingformer [13] employs a two-level attention mechanism that mixes sliding window attention with compressed memory attention. Compressed memory attention helps with enlarging the receptive field. To scale back the variety of keys and values, several pooling operations are explored, including max pooling and Dynamic Convolution-based pooling.
3.4. Low-rank self-attention
In keeping with empirical and theoretical analyses conducted by various researchers [14, 12], the self-attention matrix A β Rπ Γπ exhibits low-rank characteristics in lots of cases. This commentary offers two implications: Firstly, the low-rank nature will be explicitly modeled using parameterization. This could lead on to the event of recent models that leverage this property to enhance performance. Secondly, as a substitute of using the complete self-attention matrix, a low-rank approximation could possibly be used as a replacement. This approach could enable more efficient computations and further enhance the scalability of self-attention-based models.
3.4.1. Low-rank parameterization: When the rank of the eye matrix is lower than the sequence length, it suggests that over-parameterizing the model by setting π·π > π would result in overfitting in situations where the input is often short. Subsequently, it is wise to limit the dimension of π·π and leverage the low-rank property as an inductive bias. To this end, Guo et al. [14] propose decomposing the self-attention matrix right into a low-rank attention module with a small π·π that captures long-range non-local interactions, and a band attention module that captures local dependencies. This approach will be helpful in scenarios where the input is brief and requires effective modeling of each local and non-local dependencies.
3.4.2. Low-rank approximation: The low-rank property of the eye matrix may also be leveraged to cut back the complexity of self-attention through the use of a low-rank matrix approximation. This system is closely related to the low-rank approximation of kernel matrices, and a few existing works are inspired by kernel approximation. For example, Performer, as discussed in Section 3.2, uses a random feature map originally proposed to approximate Gaussian kernels to decompose the eye distribution matrix A into Cπ GCπΎ, where G is a Gaussian kernel matrix and the random feature map approximates G.
An alternate approach to coping with the low-rank property of attention matrices is to make use of NystrΓΆm-based methods [15, 16]. In these methods, a subset of landmark nodes is chosen from the input sequence using down-sampling techniques akin to strided average pooling. The chosen landmarks are then used as queries and keys to approximate the eye matrix. Specifically, the eye computation involves softmax normalization of the product of the unique queries with the chosen keys, followed by the product of the chosen queries with the normalized result. This will be expressed as:
Note that the inverse of the matrix M^-1 = (softmax(QΜKΜ^T))^-1 may not all the time exist, but this issue will be mitigated in various ways. For instance, CSALR [15] adds an identity matrix to M to make sure the inverse all the time exists, while NystrΓΆm-former [16] uses the Moore-Penrose pseudoinverse of M to handle singular cases.
3.5. Attention with prior
The eye mechanism is a way of specializing in specific parts of an input sequence. It does this by generating a weighted sum of the vectors within the sequence, where the weights are determined by an attention distribution. The eye distribution will be generated from the inputs, or it may possibly come from other sources, akin to prior knowledge. Normally, the eye distribution from the inputs and the prior attention distribution are combined by computing a weighted sum of their scores before applying softmax, thus, allowing the neural network to learn from each the inputs and the prior knowledge.
3.5.1. Prior that models locality: To model the locality of certain forms of data like text, a Gaussian distribution over positions will be used as prior attention. This involves multiplying the generated attention distribution with a Gaussian density and renormalizing or adding a bias term G to the generated attention scores, where higher G indicates a better prior probability of attending to a particular input.
Yang et al. [17] propose a way of predicting a central position for every input and defining the Gaussian bias accordingly:
where π denotes the usual deviation for the Gaussian. The Gaussian bias is defined because the negative of the squared distance between the central position and the input position, divided by the usual deviation of the Gaussian distribution. The usual deviation will be determined as a hyperparameter or predicted from the inputs.
The Gaussian Transformer [18] model assumes that the central position for every input query ππ is π, and defines the bias term πΊπ π for the generated attention scores as
where π€ is a non-negative scalar parameter controlling the deviation and π is a negative scalar parameter reducing the burden for the central position.
3.5.2. Prior from lower modules: In Transformer architecture, attention distributions between adjoining layers are sometimes found to be similar. Subsequently, it is affordable to make use of the eye distribution from a lower layer as a previous for computing attention in a better layer. This will be achieved by combining the eye scores from the present layer with a weighted sum of the previous layerβs attention scores and a translation function that maps the previous scores to the prior to be applied.
where A(π) represents the l-th layer attention scores while w1β and w2β control the relative importance of the previous attention scores and the present attention scores. Also, the function π: RπΓπ β RπΓπ translates the previous attention scores right into a prior to be applied to the present attention scores.
The Predictive Attention Transformer proposed within the paper [19] suggests using a 2D-convolutional layer on the previous attention scores to compute the ultimate attention scores as a convex combination of the generated attention scores and the convolved scores. In other words, the burden parameters for the generated and convolved scores are set to πΌ and 1-πΌ, respectively, and the function π(Β·) in Eq. (6) is a convolutional layer. The paper presents experiments showing that training the model from scratch and fine-tuning it after adapting a pre-trained BERT model each result in improvements over baseline models.
The Realformer model proposed in [20] introduces a residual skip connection on attention maps by directly adding the previous attention scores to the newly generated ones. This will be seen as setting π€ 1 = π€ 2 = 1 and π(Β·) to be the identity map in Eq. (6). The authors conduct pre-training experiments on this model and report that it outperforms the baseline BERT model in multiple datasets, even with significantly lower pre-training budgets.
Lazyformer [21] proposes an modern approach where attention maps are shared between adjoining layers to cut back computational costs. That is achieved by setting π(Β·) to identity and alternately switching between the settings of π€ 1 = 0, π€ 2 = 1 and π€ 1 = 1, π€ 2 = 0. This method enables the computation of attention maps just once and reuses them in succeeding layers. The pre-training experiments conducted by Lazyformer show that their model shouldn’t be only efficient but additionally effective, outperforming the baseline models with significantly lower computation budgets.
3.5.3. Prior as multi-task adapters: The Prior as Multi-task Adapters approach uses trainable attention priors that enable efficient parameter sharing across tasks [22]. The Conditionally Adaptive Multi-Task Learning (CAMTL) [23] framework is a way for multi-task learning that permits the efficient sharing of pre-trained models between tasks. CAMTL uses trainable attention prior, which will depend on task encoding, to act as an adapter for multi-task inductive knowledge transfer. Specifically, the eye prior is represented as a block diagonal matrix that’s added to the eye scores of upper layers in pre-trained Transformers:
by which, β represents direct sum, π΄π are trainable parameters with dimensions (π/π)Γ(π/π) and πΎπ and π½π are Feature Sensible Linear Modulation functions with input and output dimensions of Rπ·π§ and (π/π)Γ(π/π), respectively [24]. The CAMTL framework specifies a maximum sequence length ππππ₯ in implementation. The eye prior, which is a trainable matrix, is added to the eye scores of the upper layers in pre-trained Transformers. This addition creates an adapter that enables for parameter-efficient multi-task inductive knowledge transfer. The prior is organized as a block diagonal matrix for efficient computation.
3.5.4. Attention with only prior: Zhang et al. [25] have developed an alternate approach to attention distribution that doesn’t depend on pair-wise interaction between inputs. Their method is known as the βaverage attention network,β and it uses a discrete uniform distribution as the only source of attention distribution. The values are then aggregated as a cumulative average of all values. To reinforce the networkβs expressiveness, a feed-forward gating layer is added on top of the typical attention module. The good thing about this approach is that the modified Transformer decoder will be trained in a parallel manner, and it may possibly decode like an RNN, avoiding the O(πΒ²) complexity related to decoding.
just like Yang et al. [17] and Guo et al. [18], which use a set local window for attention distribution, You et al. [26] incorporate a hardcoded Gaussian distribution attention for attention calculation. Nonetheless, They completely ignore the calculated attention and solely use the Gaussian distribution for attention computation by which, the mean and variance are the hyperparameters. Provided it’s implemented on self-attention, it may possibly produce results near the baseline models in machine translation tasks.
Synthesizer [27] has proposed a novel way of generating attention scores in Transformers. As a substitute of using the normal approach to generating attention scores, they replace them with two variants: (1) learnable, randomly initialized attention scores, and (2) attention scores output by a feed-forward network that is simply conditioned on the input being queried. The outcomes of their experiments on machine translation and language modeling tasks exhibit that these variants perform comparably to the usual Transformer model. Nonetheless, the explanation why these variants work shouldn’t be fully explained, leaving room for further investigation.
3.6. Improved multi-head mechanism
Multi-head attention is a robust technique since it allows a model to take care of different parts of the input concurrently. Nonetheless, it shouldn’t be guaranteed that every attention head will learn unique and complementary features. Consequently, some researchers have explored methods to be sure that each attention head captures distinct information.
3.6.1. Head behavior modeling: Multi-head attention is a great tool in natural language processing models because it enables the simultaneous processing of multiple inputs and have representations [28]. Nonetheless, the vanilla Transformer model lacks a mechanism to be sure that different attention heads capture distinct and non-redundant features. Moreover, there isn’t a provision for interaction among the many heads. To deal with these limitations, recent research has focused on introducing novel mechanisms that guide the behavior of attention heads or enable interaction between them.
In an effort to promote diversity amongst different attention heads, Li et al. [29] propose a further regularization term within the loss function. This regularization consists of two parts: the primary two aim to maximise the cosine distances between input subspaces and output representations, while the latter encourages dispersion of the positions attended by multiple heads through element-wise multiplication of their corresponding attention matrices. By adding this auxiliary term, the model is inspired to learn a more diverse set of attention patterns across different heads, which may improve its performance on various tasks.
Quite a few studies have shown that pre-trained Transformer models exhibit certain self-attention patterns that don’t align well with natural language processing. Kovaleva et al. [30] discover several of those patterns in BERT, including attention heads that focus exclusively on the special tokens [CLS] and [SEP]. To enhance training, Deshpande and Narasimhan [31] suggest using an auxiliary loss function that measures the Frobenius norm between the eye distribution maps and predefined attention patterns. This approach introduces constraints to encourage more meaningful attention patterns.
Within the paper by Shen et al. [32], a brand new mechanism called Talking-head Attention is introduced, which goals to encourage the model to transfer information between different attention heads in a learnable manner. This mechanism involves linearly projecting the generated attention scores from the hidden dimension to a brand new space with h_k heads, applying softmax on this space, after which projecting the outcomes to a different space with h_v heads for value aggregation. This manner, the eye mechanism can learn to dynamically transfer information between the various attention heads, resulting in improved performance in various natural language processing tasks.
Collaborative Multi-head Attention is a mechanism proposed in [33] that involves using shared query and key projections, denoted as Wπ and WπΎ, respectively, together with a mixing vector mπ. This mixing vector is used to filter the projection parameters for the π-th head. Specifically, the eye computation is tailored to reflect this mechanism, leading to a modified equation (3).
where all heads share W^q and W^k.
3.6.2. Multi-head with restricted spans:
The vanilla attention mechanism typically assumes full attention spans, allowing a question to take care of all key-value pairs. Nonetheless, it has been observed that some attention heads are inclined to focus more on local contexts, while others attend to broader contexts. Consequently, it could be advantageous to impose constraints on attention spans for specific purposes:
- Locality: Restricting attention spans can explicitly impose local constraints, which will be helpful in scenarios where locality is a crucial consideration.
- Efficiency: Appropriately implemented, such a model can scale to longer sequences without introducing additional memory usage or computational time.
Restricting attention spans involves multiplying each attention distribution value with a mask value, followed by re-normalization. The mask value will be determined by a non-increasing function that maps a distance to a worth within the range [0, 1]. In vanilla attention, a mask value of 1 is assigned for all distances, as illustrated in Figure 12(a).
In a study by Sukhbaatar et al. [34], a novel approach was proposed, introducing a learnable attention span that’s depicted within the intriguing Figure 12(b). This modern technique utilizes a mask parameterized by a learnable scalar π§, combined with a hyperparameter π , to adaptively modulate the eye span. Remarkably, experimental results on character-level language modeling demonstrated that these adaptive-span models outperformed the baseline models while requiring significantly fewer FLOPS. Notably, an interesting commentary was made that lower layers of the model tended to exhibit smaller learned spans, whereas higher layers displayed larger spans. This intriguing finding suggests that the model can autonomously learn a hierarchical composition of features, showcasing its exceptional ability to capture complex patterns and structures in the info.
The Multi-Scale Transformer [35] presents a novel approach to attention spans that challenges the normal paradigm. Unlike vanilla attention, which assumes a uniform attention span across all heads, this modern model introduces a set attention span with dynamic scaling in several layers. Illustrated in Fig. 12(c), the fixed attention span acts as a window that will be scaled up or down, controlled by a scale value denoted as π€.
The dimensions values vary, with higher layers favoring larger scales for broader contextual dependencies and lower layers choosing smaller scales for more localized attention as shown in Figure 13. The experimental results of the Multi-Scale Transformer exhibit its superior performance over baseline models on various tasks, showcasing its potential for more efficient and effective language processing.
3.6.3. Multi-head with refined aggregation:
The vanilla multi-head attention mechanism, as proposed by Vaswani et al. [28], involves the computation of multiple attention heads that operate in parallel to generate individual output representations. These representations are then concatenated and subjected to a linear transformation, as defined in Eq. (11), to acquire the ultimate output representation. By combining Eqs. (10), (11), and (12), it may possibly be observed that this concatenate-and-project formulation is comparable to a summation over re-parameterized attention outputs. This approach allows for efficient aggregation of the various attention head outputs, enabling the model to capture complex dependencies and relationships within the input data.
and
where
To facilitate the aggregation process, the burden matrix Wπ β Rπ·π Γπ·π used for the linear transformation is partitioned into π» blocks, where π» represents the variety of attention heads.
The burden matrix Wπ_π, with dimension π·π£ Γ π·π, is used for the linear transformation in each attention head, allowing for re-parameterized attention outputs through the concatenate-and-project formulation, as defined in Eq. (14):
Some researchers may argue that the simple aggregate-by-summation approach may not fully leverage the expressive power of multi-head attention and that a more complex aggregation scheme could possibly be more desirable.
Gu and Feng [36] and Li et al. [37] propose employing routing methods originally conceived for capsule networks [38] as a way to further aggregate information derived from distinct attention heads. Through a strategy of transforming the outputs of attention heads into input capsules and subsequently undergoing an iterative routing procedure, output capsules are obtained. These output capsules are then concatenated to function the ultimate output of the multi-head attention mechanism. Notably, the dynamic routing [38] and EM routing [39] mechanisms employed in these works introduce additional parameters and computational overhead. Nevertheless, Li et al. [37] empirically exhibit that selectively applying the routing mechanism to the lower layers of the model achieves an optimal balance between translation performance and computational efficiency.
3.6.4. Other multi-head modifications:
Along with the aforementioned modifications, several other approaches have been proposed to reinforce the performance of the multi-head attention mechanism. Shazeer [40] introduced the concept of multi-query attention, where key-value pairs are shared amongst all attention heads. This reduces the memory bandwidth requirements during decoding and results in faster decoding, albeit with minor quality degradation in comparison with the baseline. Alternatively, Bhojanapalli et al. [41] identified that the dimensions of attention keys could impact their ability to represent arbitrary distributions. To deal with this, they proposed disentangling the top size from the variety of heads, contrary to the standard practice of setting the top size as π·π/β, where π·π is the model dimension and β is the variety of heads.