
Explore how similarity information might be incorporated into hash function

11 hours ago
Similarity search is an issue where given a question the goal is to search out probably the most similar documents to it amongst all of the database documents.
In data science, similarity search often appears within the NLP domain, search engines like google and yahoo or recommender systems where probably the most relevant documents or items should be retrieved for a question. There exists a big number of alternative ways to enhance search performance in massive volumes of information.
In previous parts of this text series, we discussed inverted file index, product quantization and HNSW and the way they might be used together to enhance search quality. On this chapter, we’re going to take a look at a principally different approach that maintains each high search speed and quality.
Local Sensitive Hashing (LSH) is a set of methods that’s used to cut back the search scope by transforming data vectors into hash values while preserving details about their similarity.
We’re going to discuss the normal approach which consists of three steps:
- Shingling: encoding original texts into vectors.
- MinHashing: transforming vectors right into a special representation called signature which might be used to check similarity between them.
- LSH function: hashing signature blocks into different buckets. If the signatures of a pair of vectors fall into the identical bucket a minimum of once, they’re considered candidates.
We’re going to regularly dive into the small print throughout the article of every of those steps.
Shingling is the means of collecting k-grams on given texts. k-gram is a bunch of k sequential tokens. Depending on the context, tokens might be words or symbols. The last word goal of shingling is through the use of collected k-grams to encode each document. We can be using one-hot encoding for this. Nevertheless, other encoding methods will also be applied.
Firstly, unique k-grams for every document are collected. Secondly, to encode each document, a vocabulary is required which represents a set of unique k-grams in all documents. Then for every document, a vector of zeros with the length equal to the dimensions of the vocabulary is created. For each appearing k-gram within the document, its position within the vocabulary is identified and a “1” is placed on the respective position of the document vector. Even when the identical k-gram appears several times in a document, it doesn’t matter: the worth within the vector will at all times be 1.
At this stage, initial texts have been vectorised. The similarity of vectors might be compared via Jaccard index. Do not forget that Jaccard index of two sets is defined because the variety of common elements in each sets divided by the length of all the weather.
If a pair of encoded vectors is taken, the intersection within the formula for Jaccard index is the variety of rows that each contain 1 (i.e. k-gram appears in each vectors) and the union is the variety of rows with a minimum of one 1 (k-gram is presented a minimum of in one in all the vectors).
The present problem straight away is the sparsity of encoded vectors. Computing a similarity rating between two one-hot encoded vectors would take loads of time. Transforming them to a dense format would make it more efficient to operate on them later. Ultimately, the goal is to design such a function that can transform these vectors to a smaller dimension preserving the knowledge about their similarity. The strategy that constructs such a function known as MinHashing.
MinHashing is a hash function that permutes the components of an input vector after which returns the primary index where the permutated vector component equals 1.
For getting a dense representation of a vector consisting of n numbers, n minhash functions might be used to acquire n minhash values which form a signature.
It could not sound obvious at first but several minhash values might be used to approximate Jaccard similarity between vectors. In truth, the more minhash values are used, the more accurate the approximation is.
That is only a useful statement. It seems that there’s a whole theorem behind the scenes. Allow us to understand why Jaccard index might be calculated through the use of signatures.
Statement proof
Allow us to assume that a given pair of vectors incorporates only rows of type 01, 10 and 11. Then a random permutation on these vectors is performed. Since there exists a minimum of one 1 in all of the rows, then while computing each hash values, a minimum of one in all these two hash-value computation processes will stop at the primary row of a vector with the corresponding hash value equal to 1.
What’s the probability that the second hash value can be equal to the primary one? Obviously, this may only occur if the second hash value can also be equal to 1. Which means the primary row must be of type 11. For the reason that permutation was taken randomly, the probability of such an event is the same as P = count(11) / (count(01) + count(10) + count(11)). This expression is totally the identical because the Jaccard index formula. Subsequently:
The probability of getting equal hash values for 2 binary vectors based on a random rows permutation equals the Jaccard index.
Nonetheless, by proving the statement above, we assumed that initial vectors didn’t contain rows of type 00. It is obvious that rows of type 00 don’t change the worth of Jaccard index. Similarly, the probability of getting the identical hash values with rows of type 00 included doesn’t affect it. For instance, if the primary permutated row is 00, then minhash algorithm just ignores it and switches to the following row until there exists a minimum of one 1 in a row. After all, rows of type 00 can lead to different hash values than without them however the probability of getting the identical hash values stays the identical.
Now we have proven a very important statement. But how the probability of getting the identical minhash values might be estimated? Definitely, it is feasible to generate all possible permutations for vectors after which calculate all minhash values to search out the specified probability. For obvious reasons, this isn’t efficient since the variety of possible permutations for a vector of size n equals n!. Nevertheless, the probability might be evaluated roughly: allow us to just use many hash functions to generate that many hash values.
The Jaccard index of two binary vectors roughly equals the variety of corresponding values of their signatures.
It is straightforward to note that taking longer signatures leads to more accurate calculations.
At the present moment, we will transform raw texts into dense signatures of equal length preserving the knowledge about similarity. Nevertheless, in practice, such dense signatures still normally have high dimensions and it will be inefficient to directly compare them.
Consider n = 10⁶ documents with their signatures of length 100. Assuming that a single variety of a signature requires 4 bytes to store, then the entire signature would require 400 bytes. For storing n = 10⁶ documents, 400 MB of space is required which is doable in point of fact. But comparing each document with one another in a brute-force manner would require roughly 5 * 10¹¹ comparisons which is simply too much, especially when n is even larger.
To avoid the issue, it is feasible to construct a hash table to speed up search performance but even when two signatures are very similar and differ only in 1 position, they’re still prone to have a distinct hash (because vector remainders are prone to be different). Nonetheless, we normally want them to fall into the identical bucket. That is where LSH involves the rescue.
LSH mechanism builds a hash table consisting of several parts which puts a pair of signatures into the identical bucket in the event that they have a minimum of one corresponding part.
LSH takes a signature matrix and horizontally divides it into equal b parts called bands each containing r rows. As a substitute of plugging the entire signature right into a single hash function, the signature is split by b parts and every subsignature is processed independently by a hash function. As a consequence, each of the subsignatures falls into separate buckets.
If there may be a minimum of one collision between corresponding subvectors of two different signatures, the signatures are considered candidates. As we will see, this condition is more flexible since for considering vectors as candidates they don’t should be absolutely equal. Nevertheless, this increases the variety of false positives: a pair of various signatures can have a single corresponding part but in overall be completely different. Depending on the issue, it’s at all times higher to optimize parameters b, r and k.
With LSH, it is feasible to estimate the probability that two signatures with similarity s can be regarded as candidates given various bands b and variety of rows r in each band. Allow us to find the formula for it in several steps.
Note that the formula doesn’t think about collisions when different subvectors unintentionally hash into the identical bucket. For this reason, the true probability of signatures being the candidates might insignificantly differ.
Example
For getting a greater sense of the formula we’ve just obtained, allow us to consider an easy example. Consider two signatures with the length of 35 symbols that are equally split into 5 bands with 7 rows each. Here is the table which represents the probability of getting a minimum of one equal band based on their Jaccard similarity:
We notice that if two similar signatures have the Jaccard similarity of 80%, then they’ve a corresponding band in 93.8% of cases (true positives). In the remaining 6.2% of scenarios such a pair of signatures is false negative.
Now allow us to take two different signatures. As an illustration, they’re similar only by 20%. Subsequently, they’re false positive candidates in 0.224% of cases. In other 99.776% of cases, they don’t have an identical band, in order that they are true negatives.
Visualisation
Allow us to now visualise the connection between similarity s and probability P of two signatures becoming candidates. Normally with higher signature similarity s, signatures must have the next probability of being candidates. Ideally, it will seem like below:
Based on the probability formula obtained above, a typical line would seem like within the figure below:
It is feasible to differ the variety of bands b to shift the road within the figure to the left or to the fitting. Increasing b moves the road to the left and leads to more FP, decreasing — shifts it to the fitting and results in more FN. It will be significant to search out a great balance, depending on the issue.
Experimentations with different numbers of bands and rows
Several line plots are built below for various values of b and r. It’s at all times higher to regulate these parameters based on the precise task to successfully retrieve all pairs of comparable documents and ignore those with different signatures.
Now we have walked through a classical implementation of the LSH method. LSH significantly optimizes search speed through the use of lower dimensional signature representations and a quick hashing mechanism to cut back the candidates’ search scope. At the identical time, this comes at the fee of search accuracy but in practice, the difference is often insignificant.
Nonetheless, LSH is vulnerable to high dimensional data: more dimensions require longer signature lengths and more computations to keep up a great search quality. In such cases, it is suggested to make use of one other index.
In truth, there exist different implementations of LSH, but all of them are based on the identical paradigm of transforming input vectors to hash values while preserving details about their similarity. Mainly, other algorithms simply define other ways of obtaining these hash values.
Random projections is one other LSH method that can be covered in the following chapter and which is implemented as an LSH index within the Faiss library for similarity search.
All images unless otherwise noted are by the writer.