## Learn a robust technique to effectively compress large data

S**imilarity search** is an issue where given a question the goal is to search out probably the most similar documents to it amongst all of the database documents.

In data science, similarity search often appears within the NLP domain, search engines like google or recommender systems where probably the most relevant documents or items must be retrieved for a question. There exists a big number of alternative ways to enhance search performance in massive volumes of knowledge.

In the primary a part of this text series, we checked out kNN and inverted file index structure for performing similarity search. As we learned, kNN is probably the most straightforward approach while inverted file index acts on top of it suggesting a trade-off between speed acceleration and accuracy. Nevertheless, each methods don’t use data compression techniques which could result in memory issues, especially in cases of enormous datasets and limited RAM. In this text, we are going to try to handle this issue by taking a look at one other method called **Product Quantization**.

**Product quantization** is the method where each dataset vector is converted right into a short memory-efficient representation (called **PQ code**). As an alternative of fully keeping all of the vectors, their short representations are stored. At the identical time, product quantization is a lossy-compression method which ends up in lower prediction accuracy but in practice, this algorithm works thoroughly.

Normally, quantization is the means of mapping infinite values to discrete ones.

Firstly, the algorithm divides each vector into several equal parts — **subvectors**. Each of the respective parts of all dataset vectors form independent **subspaces** and is processed individually. Then a clustering algorithm is executed for every subspace of vectors. By doing so, several centroids in each subspace are created. Each subvector is encoded with the ID of the centroid that it belongs to. Moreover, the coordinates of all centroids are stored for later use.

Subspace centroids are also called

quantized vectors.In product quantization, a cluster ID is also known as a

reproduction value.

*Note.* Within the figures below a rectangle represents a vector containing several values while a square indicates a single number.

Consequently, if an original vector is split into *n* parts, then it will probably be encoded by *n* numbers — IDs of respective centroids for every of its subvectors. Typically, the variety of created centroids *k* is often chosen as an influence of two for more efficient memory usage. This manner, the memory required to store an encoded vector is *n * log(k) *bits.

The gathering of all centroids inside a subspace is named a

codebook. Running n clustering algorithms for all subspaces produces n separate codebooks.

## Compression example

Imagine an original vector of size 1024 which stores floats (32 bits) was divided into *n = 8* subvectors where each subvector is encoded by certainly one of *k = 256* clusters. Due to this fact, encoding the ID of a single cluster would require *log(256) = 8* bits. Allow us to compare the memory sizes for the vector representation in each cases:

- Original vector: 1024 * 32 bits = 4096 bytes.
- Encoded vector: 8 * 8 bits = 8 bytes.

The ultimate compression is 512 times! That is the actual power of product quantization.

Listed here are some necessary notes:

- The algorithm could be trained on one subset of vectors (e.g., to create clusters) and be used for one more one: once the algorithm is trained, one other dataset of vectors is passed where latest vectors are encoded through the use of already constructed centroids for every subspace.
- Typically, k-means is chosen as a clustering algorithm. Certainly one of its benefits is that the variety of clusters
*k*is a hyperparameter that could be manually defined, in response to memory usage requirements.

To get a greater understanding, allow us to first have a take a look at several naive approaches and discover their downsides. This can even help us realize why they shouldn’t be normally used.

## Naive approaches

The primary naive approach consists of decompressing all vectors by concatenating the corresponding centroids of every vector. After that, the *L2* distance (or one other metric) could be calculated from a question vector to all of the dataset vectors. Obviously, this method works but it is vitally time-consuming since the brute-force search is performed and the space calculation is performed on high-dimensional decompressed vectors.

One other possible way is to separate a question vector into subvectors and compute a sum of distances from each query subvector to respective quantized vectors of a database vector, based on its PQ code. As a consequence, the brute-search technique is used again and the space calculation here still requires a linear time of the unique vectors’ dimensionality, as within the previous case.

One other possible method is to encode the query vector right into a PQ code. Then this PQ code is directly utilized to calculate distances to all other PQ codes. The dataset vector with the corresponding PQ code which has the shortest distance is then regarded as the closest neighbour to the query. This approach is quicker than the previous two because the space is all the time computed between low-dimensional PQ codes. Nonetheless, PQ codes are composed by cluster IDs which do not need lots of semantic meaning and could be regarded as a categorical variable explicitly used as an actual variable. Clearly, it is a bad practice and this method can result in poor prediction quality.

## Optimized approach

A question vector is split into subvectors. For every of its subvectors, distances to all of the centroids of the corresponding subspace are computed. Ultimately, this information is stored in table *d*.

Calculated subvector-to-centroid distances are also known as

partial distances.

Through the use of this subvector-to-centroid distance table *d*, the approximate distance from the query to any database vector could be easily obtained by its PQ codes:

- For every of subvectors of a database vector, the closest centroid
*j*is found (through the use of mapping values from PQ codes) and the partial distance*d[i][j]*from that centroid to the query subvector*i*(through the use of the calculated matrix*d*) is taken. - All of the partial distances are squared and summed up. By taking the square root of this value, the approximate euclidean distance is obtained. If you should know how one can get approximate results for other metrics as well, navigate to the section below
*“Approximation of other distance metrics”*.

Using this method for calculating approximate distances assumes that partial distances

dare very near actual distancesabetween query and database subvectors.

Nevertheless, this condition will not be satisfied, especially when the space *c* between the database subvector and its centroid is large. In such cases, calculations lead to lower accuracy.

After now we have obtained approximate distances for all database rows, we seek for vectors with the smallest values. Those vectors can be the closest neighbours to the query.

## Approximation of other distance metrics

Thus far have checked out how one can approximate euclidean distance through the use of partial distances. Allow us to generalize the rule for other metrics as well.

Imagine we would really like to calculate a distance metric between a pair of vectors. If we all know the metrics’ formula, we will directly apply it to get the result. But sometimes we will do it by parts in the next manner:

- Each vectors are divided into
*n*subvectors. - For every pair of respective subvectors, the space metric is calculated.
- Calculated
*n*metrics are then combined to supply the actual distance between the unique vectors.

Euclidean distance is an example of a metric which could be calculated by parts. Based on the figure above, we will select the aggregation functions to be *h(z) = z²* , *g(z₀, z₁, …, zₙ) = sum(z₀, z₁, …, zₙ)* and *f(z) = √z*.

Inner product is one other example of such metric with aggregation functions *h(z) = z, g(z₀, z₁, …, zₙ) = sum(z₀, z₁, …, zₙ) and f(z) = z*.

Within the context of product quantization, that is a vital property because during inference the algorithm calculates distances by parts. Which means that it will be rather more problematic to make use of metrics for product quantization that do not need this property. Cosine distance is an example of such metric.

If there continues to be a necessity to make use of a metric without this property, then additional heuristics must be applied to aggregate partial distances with some error.

## Performance

The principal advantage of the product quantization is a large compression of database vectors that are stored as short PQ codes. For some applications, such compression rate could also be even higher than 95%! Nonetheless, other than PQ codes, the matrix *d* of size *k *x* n *containing quantized vectors of every subspace must be stored.

Product quantization is a lossy-compression method, so the upper the compression is, the more likely that the prediction accuracy will decrease.

Constructing a system for efficient representation requires training several cluster algorithms. Other than it, during inference, *k * n* partial distances must be calculated in a brute-force manner and summed up for every of the database vectors which can take a while.

Faiss(Facebook AI Search Similarity) is a Python library written in C++ used for optimised similarity search. This library presents various kinds of indexes that are data structures used to efficiently store the information and perform queries.

Based on the knowledge from the Faiss documentation, we are going to see how product quantization is utilized.

Product quantization is implemented within the *IndexPQ* class. For initialisation, we want to offer it 3 parameters:

**d**: variety of dimensions in data.**M**: variety of splits for every vector (the identical parameter as*n*used above).**nbits**: variety of bits it takes to encode a single cluster ID. Which means that the variety of total clusters in a single subspace can be equal to*k = 2^nbits*.

For equal subspace dimensions splitting, the parameter *dim* have to be divisible by *M*.

The full variety of bytes required to store a single vector is the same as:

As we will see within the formula above, for more efficient memory usage the worth of M * nbits needs to be divisible by

8.

We’ve looked through a very fashionable algorithm in information retrieval systems that efficiently compresses large volumes of knowledge. Its principal downside is a slow inference speed. Despite this fact, the algorithm is widely utilized in modern Big data applications, especially together with other similarity search techniques.

In the primary a part of the article series, we described the workflow of the inverted file index. In reality, we will merge these two algorithms right into a more efficient one which is able to possess the benefits of each! That is what exactly we’re going to do in the subsequent a part of this series.

*All images unless otherwise noted are by the writer.*