The hunt to refine neural networks for practical applications traces its roots back to the foundational days of the sector. When Rumelhart, Hinton, and Williams first demonstrated methods to use the backpropagation algorithm to successfully train multi-layer neural networks that might learn complex, non-linear representations in 1986, the vast potential of those models became apparent. Nevertheless, the computational power available within the Eighties restricted their practical use and the complexity of problems they might solve, a situation which mirrors the challenges we face with deploying LLMs today. Although the dimensions of models and the considerations being made were very different, early discoveries in network minimization would pave the best way for large wins in model compression a long time later. On this section, we take a transient journey through the history and motivations driving pruning research, discover the comparative strengths and weaknesses of unstructured versus structured methods, and prepare ourselves to explore their use in the trendy era of LLMs.
Network pruning was originally motivated by the pursuit of higher model generalization through freezing unimportant weights at zero, somewhat akin in theory to L1/Lasso and L2/Ridge regularization in linear regression, though different in that weights are chosen and hard-set to zero (pruned) after training based on an importance criteria quite than being coaxed towards zero mathematically by the loss function during training (informed readers will know that regularization may also be achieved in neural network training using weight decay).
The common motivation behind each regularization and pruning (which might be seen as a type of regularization) is the theoretical and empirical evidence that neural networks are best at learning when overparameterized because of a higher-dimensional manifold of the loss function’s global minima and a bigger exploration space during which effective subnetworks usually tend to be initialized (see “the lottery ticket hypothesis”). Nevertheless, this overparameterization in turn results in overfitting on the training data, and ultimately ends in a network with many redundant or inactive weights. Although the theoretical mechanisms underlying the “unreasonable effectiveness” of overparameterized neural networks were less well studied on the time, researchers within the Eighties accurately hypothesized that it needs to be possible to remove a big portion of the network weights after training without significantly affecting task performance, and that performing iterative rounds of pruning and fine-tuning the remaining model weights should lead to higher generalization, enhancing the model’s ability to perform well on unseen data.
Unstructured Pruning
To pick out parameters for removal, a measure of their impact on the associated fee function, or “saliency,” is required. While the earliest works in network minimization worked under the idea that the magnitude of parameters should function an acceptable measure of their saliency, LeCun et al. made a major step forward in 1989 with “Optimal Brain Damage” (OBD), during which they proposed to make use of a theoretically justifiable measure of saliency using second-derivative information of the associated fee function with respect to the parameters, allowing them to directly discover the parameters which might be removed with the least increase in error.
Written within the era when the model of interest was a fully-connected neural network containing just 2,600 parameters, the authors of OBD were less concerned about removing weights as a consequence of computational efficiency than we’re today with our billionaire behemoths, and were more enthusiastic about improving the model’s ability to generalize to unseen data by reducing model complexity. Even operating on a tiny model like this, nevertheless, the calculation of second-derivative information (Hessian matrix) may be very expensive, and required the authors to make three convenient mathematical assumptions: 1) that the model is currently trained to an optimum, meaning the gradient of the loss with respect to each weight is currently zero and the slope of the gradient is positive in each directions, which zeroes out the first-order term of the Taylor expansion and implies the change in loss attributable to pruning any parameter is positive, 2) that the Hessian matrix is diagonal, meaning the change in loss attributable to removal of every parameter is independent, and subsequently the loss deltas might be summed over subset of weights to calculate the overall change in loss attributable to their collective removal, and three) that the loss function is sort of quadratic, meaning higher-order terms might be neglected from the Taylor expansion.
Despite this requisite list of naïve assumptions, their theoretically justified closed-form saliency metric proved itself superior to magnitude-based pruning in accurately identifying the least vital weights in a network, capable of retain more accuracy at higher rates of compression. Nevertheless, the efficacy and profound simplicity of magnitude-based pruning methods would make them the highest alternative for a lot of future research endeavors in model compression, particularly as network sizes began to scale quickly, and Hessians became exponentially more frightening. Still, this successful demonstration of using a theoretically justified saliency measure to more accurately estimate saliency and thereby enable more aggressive pruning provided an inspirational recipe for future victories in model compression, although it could be a while before those seeds bore fruit.
4 years later in 1993, Hassibi et al.’s Optimal Brain Surgeon (OBS) expanded on the concept of OBD and raised the degrees of compression possible without increasing error by eschewing the diagonality assumption of OBD and as an alternative considering the cross-terms throughout the Hessian matrix. This allowed them to find out optimal updates to the remaining weights based on the removal of a given parameter, concurrently pruning and optimizing the model, thereby avoiding the necessity for a retraining phase. Nevertheless, this meant much more complex mathematics, and OBS was thus initially of limited utility to twenty first Century researchers working with much larger networks. Nonetheless, like OBD, OBS would eventually see its legacy revived in future milestones, as we are going to see later.
The pruning methods in OBD and OBS are examples of unstructured pruning, wherein weights are pruned on a person basis based on a measure of their saliency. A contemporary exemplar of unstructured pruning techniques is Han et al. 2015, which reduced the sizes of the early workhorse convolutional neural networks (CNNs) AlexNet and VGG-16 by 9x and 13x, respectively, with no loss in accuracy, using a number of rounds of magnitude-based weight pruning and fine-tuning. Their method unfortunately requires performing sensitivity evaluation of the network layers to find out the very best pruning rate to make use of for every individual layer, and works best when retrained at the least once, which implies it could not scale well to extremely large networks. Nevertheless, it’s impressive to see the degrees of pruning which might be achieved using their unstructured approach, especially since they’re using magnitude-based pruning. As with all unstructured approach, the reduced memory footprint can only be realized by utilizing sparse matrix storage techniques which avoid storing the zeroed parameters in dense matrices. Although they don’t employ it of their study, the authors mention of their related work section that the hashing trick (as demonstrated within the 2015 HashedNets paper) is complementary to unstructured pruning, as increasing sparsity decreases the variety of unique weights within the network, thereby reducing the probability of hash collisions, which results in lower storage demands and more efficient weight retrieval by the hashing function.
While unstructured pruning has the intended regularization effect of improved generalization through reduced model complexity, and the memory footprint can then be shrunk substantially by utilizing sparse matrix storage methods, the gains in computational efficiency offered by this kind of pruning should not so readily accessed. Simply zeroing out individual weights without consideration of the network architecture will create matrices with irregular sparsity that may realize no efficiency gains when computed using dense matrix calculations on standard hardware. Only specialized hardware which is explicitly designed to use sparsity in matrix operations can unlock the computational efficiency gains offered by unstructured pruning. Fortunately, consumer hardware with these capabilities is becoming more mainstream, enabling their users to actualize performance gains from the sparse matrices created from unstructured pruning. Nevertheless, even these specialized hardware units must impose a sparsity ratio expectation on the variety of weights in each matrix row which needs to be pruned with a view to allow for the algorithmic exploitation of the resulting sparsity, often known as semi-structured pruning, and enforcing this constraint has been shown to degrade performance greater than purely unstructured pruning.
Structured Pruning
We’ve seen that unstructured pruning is a well-established regularization technique that is understood to enhance model generalization, reduce memory requirements, and offer efficiency gains on specialized hardware. Nevertheless, the more tangible advantages to computational efficiency are offered by structured pruning, which entails removing entire structural components (filters, layers) from the network quite than individual weights, which reduces the complexity of the network in ways in which align with how computations are performed on hardware, allowing for gains in computational efficiency to be easily realized without specialized kit.
A formative work in popularizing the concept of structured pruning for model compression was the 2016 Li et al. paper “Pruning Filters for Efficient ConvNets,” where, because the title suggests, the authors pruned filters and their associated feature maps from CNNs with a view to greatly improve computational efficiency, because the calculations surrounding these filters might be easily excluded by physically removing the chosen kernels from the model, directly reducing the scale of the matrices and their multiplication operations while not having to fret about exploiting sparsity. The authors used a straightforward sum of filter weights (L1 norm) for magnitude-based pruning of the filters, demonstrating that their method could reduce inferences costs of VGG-16 and ResNet-110 by 34% and 38%, respectively, without significant degradation of accuracy.
Their study also reveals some fascinating insights about how convolutional networks work by comparing the sensitivity of individual CNN layers to pruning, revealing that layers on the very starting or past halfway through the depth of the network were capable of be pruned aggressively with almost no impact on the model performance, but that layers around 1/4 of the best way into the network were very sensitive to pruning and doing so made recovering model performance difficult, even with retraining. The outcomes, shown below, reveal that the layers that are most sensitive to pruning are those containing many filters with large absolute sums, supporting the speculation of magnitude as a saliency measure, as these layers are clearly more vital to the network, since pruning them away causes pronounced negative impact on model performance which is difficult to recuperate.
Most significantly, the outcomes from Li et al. show that many layers in a CNN might be pruned of even as much as 90% of their filters without harming (and in some cases even improving) model performance. Moreover, they found that when pruning filters from the insensitive layers, iterative retraining layer-by-layer was unnecessary, and a single round of pruning and retraining (for 1/4 of the unique training time) was all that was required to recuperate model performance after pruning away significant portions of the network. That is great news by way of efficiency, since multiple rounds of retraining might be costly, and former work had reported requiring as much as 3x original training time to supply their pruned models. Below we are able to see the general results from Li et al. which reveal that the variety of floating point operations (FLOPs) might be reduced between 15 and 40 percent within the CNNs studied without harming performance, and in actual fact offering gains in lots of instances, setting a firm example of the importance of pruning models after training.
Although this study was clearly motivated by efficiency concerns, we all know from a long time of evidence linking reduced model complexity to improved generalization that these networks should perform higher on unseen data as well, a fundamental advantage which motivated pruning research in the primary place. Nevertheless, this pruning method requires a sensitivity evaluation of the network layers with a view to be done accurately, requiring additional effort and computation. Further, as LeCun and his colleagues accurately identified back in 1989: although magnitude-based pruning is a time-tested strategy, we should always expect a theoretically justified metric of salience to supply a superior pruning strategy, but with the scale of contemporary neural networks, computing the Hessian matrix required for the second-order Taylor expansions utilized in their OBD method can be too intensive. Fortunately, a blissful medium was forthcoming.
Trailing Li et al. by only a couple of months in late 2016, Molchanov and his colleagues at Nvidia reinvestigated using Taylor expansion to quantify salience for structured pruning of filters from CNNs. In contrast to OBD, they avoid the complex calculation of the second-order terms, and as an alternative extract a useful measure of saliency by considering the variance quite than the mean of the first-order Taylor expansion term. The study provides empirical comparison of several saliency measures against an “oracle” rating which was computed by exhaustively calculating the change in loss attributable to removing each filter from a fine-tuned VGG-16. In the outcomes shown below, we are able to see that the proposed Taylor expansion saliency measure most closely correlates with the oracle rankings, followed in second place by the more computationally intensive OBD, and the performance results reflect that these methods are also best at preserving accuracy, with the advantage more clearly in favor of the proposed Taylor expansion method when plotting over GFLOPs. Interestingly, the inclusion of random filter pruning of their study shows us that it performs surprisingly well in comparison with minimum weight (magnitude-based) pruning, difficult the notion that weight magnitude is a reliable measure of saliency, at the least for the CNN architectures studied.