Home Artificial Intelligence Finding Needles in a Haystack — Search Indexes for Jaccard Similarity Set and Jaccard Inverted Indexes for Jaccard Search Approximate Index for Jaccard Search Final Thoughts

Finding Needles in a Haystack — Search Indexes for Jaccard Similarity Set and Jaccard Inverted Indexes for Jaccard Search Approximate Index for Jaccard Search Final Thoughts

0
Finding Needles in a Haystack — Search Indexes for Jaccard Similarity
Set and Jaccard
Inverted Indexes for Jaccard Search
Approximate Index for Jaccard Search
Final Thoughts

Within the last section, I explained two search algorithms that work on inverted index. These algorithms are exact, meaning that the k best candidates they return are the true k best candidates. Sounds trite? Well, it is a query we must always ask ourselves every time we design search algorithms on large-scale data, because in lots of scenarios it isn’t crucial to get the true k best candidates.

Think concerning the Spotify example again: do you actually care if the results of a search may miss just a few individuals with similar taste as yours? Most individuals understand that in on a regular basis applications (Google, Spotify, Twitter, etc.), the search isn’t exhaustive or exact. These applications are usually not mission critical enough to justify exact search. This is the reason essentially the most widely used search algorithms are all approximate.

There are mainly two advantages of using approximate search algorithms:

  1. Faster. You may cut many corners in case you not need exact results.
  2. Predicable resource consumption. This one is less obvious, but for several approximate algorithms, their resource usage (e.g., memory) may be configured a priori independent of information distribution.

On this post, I write about essentially the most widely used approximate index for Jaccard: Minwise Locality Sensitive Hashing (MinHash LSH).

What’s LSH?

Locality Sensitive Hashing indexes are true wonders in computer science. They’re algorithmic magic powered by number theory. In machine learning literature, they’re k-NN models, but unlike typical machine learning models, LSH indexes are data-agnostic so their accuracy conditioned on similarity may be determined a priori before ingesting latest data points or changing data distribution. So, they’re more much like inverted indexes than a model.

An LSH index is actually a set of hash tables each with a special hash function. Identical to a typical hash table, an LSH index’s hash function takes a knowledge point (e.g., a set, a feature vector, or an embedding vector) as input, and outputs a binary hash key. Aside from this, they can not be more different.

A typical hash function outputs keys which might be pseudo-randomly and uniformly distributed over the whole key space for any input data. For instance, MurmurHash is a widely known hash function that outputs near-uniformly and randomly over a 32-bit key space. Which means for any two inputs, comparable to abcdefg and abcefg, so long as they’re different, their MurmurHash keys mustn’t be correlated and will have the identical probability to be any certainly one of the keys in the whole 32-bit key space. This can be a desired property of a hash function, because you would like even distribution of keys over hash buckets to avoid chaining or always resizing your hash table.

An LSH’s hash function does something opposite: for a pair of comparable inputs, with similarity defined through some metric space measure, their hash keys needs to be more prone to be equal, than one other pair of hash keys of dis-similar inputs.

What does this mean? It signifies that an LSH hash function has higher probability of hash key collision for data points which might be more similar. Effectively, we’re utilizing this higher collision probability for similarity-based retrieval.

MinHash LSH

For each similarity/distance metric, there’s an LSH hash function. For Jaccard, the function is known as Minwise Hash Function, or MinHash function. Given an input set, a MinHash function consumes all elements with a random hash function and keeps track of the minimum hash value observed. You may construct an LSH index using a single MinHash function. See the diagram below.

A MinHash LSH index with a single random hash function. Image by the creator.

The mathematical theory behind MinHash function states that the probability of two sets having the identical minimum hash value (i.e., hash key collision) is similar as their Jaccard.

h(A) is the hash values of all elements in A by random hash function h.
min(h(A)) is the minimum hash value of all elements in A.

It’s a magical result, however the proof is kind of easy.

A MinHash LSH index with a single MinHash function doesn’t offer you satisfactory accuracy since the collision probability is linearly proportional to the Jaccard. See the next plot to know why.

Collision probability of a single MinHash function over Jaccard between query set and indexed set. The Y-axis is the collision probability and the X-axis is the Jaccard between the query set and an indexed set. For instance, an indexed set having Jaccard = 0.8 with the query set has 80% probability to be retrieved by the index; one other indexed set having Jaccard 0.2 with the query set has 20% probability to be retrieved. Image by the creator.

Imagine we draw a threshold at Jaccard = 0.9: results with higher Jaccard than 0.9 with the query set is relevant, wheres results with lower than 0.9 Jaccard are irrelevant. Within the context of search, the notion of “false positive” signifies that irrelevant results are returned, wheres the notion of “false negative” signifies that relevant results are usually not returned. Based on the plot above and looking out at the world corresponding to false positive: if the index only uses a single MinHash function, it will produce false positives at a really high probability.

Boosting the Accuracy of MinHash LSH

This why we’d like one other LSH magic: a process called boosting. We are able to boost the index to be way more attuned to the relevancy threshold specified.

As an alternative of just one, we use m MinHash functions generated through a process called Universal Hashingprincipally m random permutations of the identical hash function of 32-bit or 64-bit integer. For each indexed set, we generate m minimum hash values using universal hashing.

Imagine you list the m minimum hash values for an indexed set. We group every r variety of hash values right into a band of hash values, and we make b such bands. This requires m = b * r.

Minimum hash values of an indexed set in MinHash LSH with m= 16, b = 4 and r= 4. Image by the creator.

The probability that two sets having “band collision” — all of the hash values in a band collide between two sets, or r contiguous hash collisions, is Jaccard(A, B)^r. That’s loads smaller than a single hash value. Nonetheless, the probability of getting no less than one “band collision” between two sets is 1 — (1-Jaccard(A, B)^r)^b.

Why can we care about 1 — (1-Jaccard(A, B)^r)^b? Because this function has a special shape:

Boosted probability function for retrieval over Jaccard for MinHash LSH Index using b = 32 and r = 32. Image by the creator.

Within the plot above, you possibly can see by utilizing m MinHash functions, the “at-least-one band collision” probability is an S-curve function with a steep rise around Jaccard = 0.9. Assuming the relevancy threshold is 0.9, the false positive probability of this index is far smaller than the index that uses just one random hash function.

For this reason, an LSH index at all times uses b bands of r MinHash functions to spice up accuracy. Each band is a hash table storing tips that could indexed sets. During search, any indexed set collides with the query set on any band is returned.

A MinHash LSH Index using b = 4 and r = 4. Each band is a hash table whose hash key’s a concatenation of minimum hash values from 4 MinHash functions. Image by the creator.

To construct an MinHash LSH index, we will specify a previous a relevancy threshold and acceptable false positive and negative probabilities conditioned on Jaccard similarity, and calculate the optimal m, b and r, before indexing any data points. That is a fantastic advantage of using LSH over other approximate indexes.

You’ll find my implementation of MinHash LSH within the Python package datasketch. It also has other MinHash-related algorithms like LSH Forest and Weighted MinHash.

I actually have covered a number of topics on this post, but I barely scratched the surface of search indexes for Jaccard similarity. In case you are curious about reading more about these topics, I actually have a listing of further readings for you:

  • Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman. The third chapter goes into detail about MinHash and LSH. I believe it’s a fantastic chapter for gaining the intuition of MinHash. Remember the applying described within the chapter is targeted on n-gram based text matching.
  • JOSIE: Overlap Set Similarity Seek for Finding Joinable Tables in Data Lakes. The preliminary section of this paper explains the intuitation behind the search_top_k_merge_list and search_top_k_probe_set algorithms. The predominant section explains find out how to take cost into consideration when input sets are large, comparable to a table column.
  • Datasketch and SetSimilaritySearch libraries respectively implement the state-of-the-art approximate and exact Jaccard similarity search indexes. The problems list of the datasketch project is a treasure trove of application scenarios and practical considerations when applying MinHash LSH.

What about Embeddings?

Lately, as a consequence of breakthroughs in representation learning using deep neural networks like Transformers, similarity between learned embedding vectors is meaningful when the input data is an element of the identical domain that the embedding model trained on. The predominant differences between that scenario and the search scenario described on this post are:

  • Embedding vectors are dense vectors with typically 60 to 700 dimensions. Every dimension is non-zero. In contrast, sets, when represented as one-hot vectors are sparse: 10k to thousands and thousands of dimensions, but most dimensions are zeros.
  • Cosine similarity (or dot-product on normalized vectors) is usually used for embedding vectors. For sets we use Jaccard similarity.
  • It is tough to specify a relevancy threshold on similarity between embedding vectors, since the vectors are learned black-box representations of the unique data comparable to image or text. Then again, Jaccard similarity threshold for sets is far easier to specify because sets are the unique data.

As a result of the above differences, it isn’t straightforward to match embeddings and sets because they’re distinctively different data types, though you could possibly classify each of them as high-dimensional. They’re suitable for various application scenarios.

LEAVE A REPLY

Please enter your comment!
Please enter your name here