
Speed up model inference speed in production

Introduction
When a Machine Learning model is deployed into production there are sometimes requirements to be met that should not taken into consideration in a prototyping phase of the model. For instance, the model in production may have to handle plenty of requests from different users running the product. So you’ll want to optimize as an illustration latency and/o throughput.
- Latency: is the time it takes for a task to get done, like how long it takes to load a webpage after you click a link. It’s the waiting time between starting something and seeing the result.
- Throughput: is how much requests a system can handle in a certain time.
Which means the Machine Learning model needs to be very fast at making its predictions, and for this there are numerous techniques that serve to extend the speed of model inference, let’s have a look at a very powerful ones in this text.
There are techniques that aim to make models smaller, which is why they’re called model compression techniques, while others that deal with making models faster at inference and thus fall under the sphere of model optimization.
But often making models smaller also helps with inference speed, so it’s a really blurred line that separates these two fields of study.
Low Rank Factorization
That is the primary method we see, and it’s being studied quite a bit, in actual fact many papers have recently come out concerning it.
The fundamental idea is to interchange the matrices of a neural network (the matrices representing the layers of the network) with matrices which have a lower dimensionality, although it might be more correct to speak about tensors, because we will often have matrices of greater than 2 dimensions. In this manner we may have fewer network parameters and faster inference.
A trivial case is in a CNN network of replacing 3×3 convolutions with 1×1 convolutions. Such techniques are utilized by networks akin to SqueezeNet.