## Exploring the trail to fast nearest neighbour search with Hierarchical Navigable Small Worlds

18 hours ago

Hierarchical Navigable Small World (HNSW) has change into popular as top-of-the-line performing approaches for approximate nearest neighbour search. HNSW is somewhat complex though, and descriptions often lack a whole and intuitive explanation. This post takes a journey through the history of the HNSW idea to assist explain what “hierarchical navigable small world” actually means and why it’s effective.

- Approximate Nearest Neighbour Search
- Small Worlds
- Navigable Small Worlds
- Hierarchical Navigable Small Worlds
- Summary
- Appendix

– Improved search

– HNSW search & insertion

– Improved insertion - References

A typical application of machine learning is *nearest neighbour search*, which implies finding essentially the most similar items* to a goal — for instance, to recommend items which can be just like a user’s preferences, or to look for items which can be just like a user’s search query.

The straightforward method is to calculate the similarity of each item to the goal and return the closest ones. Nevertheless, if there are a lot of items (perhaps thousands and thousands), this can be slow.

As an alternative, we are able to use a structure called an *index* to make things much faster.

There’s a trade-off, nevertheless. Unlike the straightforward method, indexes only give approximate results: we may not retrieve the entire nearest neighbours (i.e. recall could also be lower than 100%).

There are several several types of index (e.g. locality sensitive hashing; inverted file index), but HNSW has proven particularly effective on various datasets, achieving high speeds while keeping high recall.

**Typically, items are represented as *embeddings*, that are vectors produced by a machine learning model; the similarity between items corresponds to the *distance* between the embeddings. This post will normally talk of vectors and distances, though usually HNSW can handle any sort of items with some measure of similarity.*

Small worlds were famously studied in Stanley Milgram’s small-world experiment [1].

Participants got a letter containing the address and other basic details of a randomly chosen goal individual, together with the experiment’s instructions. Within the unlikely event that they personally knew the goal, they were instructed to send them the letter; otherwise, they were told to think about someone they knew who was more prone to know the goal, and send the letter on to them.

The surprising conclusion was that the letters were typically only sent around six times before reaching the goal, demonstrating the famous idea of “six degrees of separation” — any two people can normally be connected by a small chain of friends.

Within the mathematical field of graph theory, a graph is a set of points, a few of that are connected. We are able to consider a social network as a graph, with people as points and friendships as connections. The small-world experiment found that almost all pairs of points on this graph are connected by short paths which have a small variety of steps. (That is described technically because the graph having a low *diameter*.)

Having short paths isn’t that surprising in itself: most graphs have this property, including graphs created by just connecting random pairs of points. But social networks aren’t connected randomly, they’re highly *local*: friends are inclined to live close to one another, and should you know two people, it’s quite likely they know one another too. (That is described technically because the graph having a high *clustering coefficient*.) The surprising thing concerning the small-world experiment is that two distant points are only separated by a brief path despite connections typically being short-range.

In cases like these when a graph has plenty of local connections, but additionally has short paths, we are saying the graph is a **small world**.

One other good example of a small world is the worldwide airport network. Airports in the identical region are highly connected to 1 one other, however it’s possible to make a protracted journey in just a number of stops by making use of major hub airports. For instance, a journey from Manchester, UK to Osaka, Japan typically starts with a neighborhood flight from Manchester to London, then a protracted distance flight from London to Tokyo, and at last one other local flight from Tokyo to Osaka. Long-range hubs are a typical way of achieving the small world property.

A final interesting example of graphs with the small world property is biological neural networks corresponding to the human brain.

In small world graphs, we are able to quickly reach a goal in a number of steps. This means a promising idea for nearest neighbour search: perhaps if we create connections between our vectors in such a way that it forms a small world graph, we are able to quickly find the vectors near a goal by ranging from an arbitrary “entry point” vector after which navigating through the graph towards the goal.

This possibility was explored by Kleinberg [2]. He noted that the existence of short paths wasn’t the one interesting thing about Miller’s experiment: it was also surprising that folks were in a position to *find* these short paths, without using any global knowledge concerning the graph. Relatively, the people were following a straightforward greedy algorithm. At each step, they examined each of their immediate connections, and sent it to the one they thought was closest to the goal. We are able to use the same algorithm to look a graph that connects vectors.

Kleinberg desired to know whether this greedy algorithm would at all times discover a short path. He ran easy simulations of small worlds wherein the entire points were connected to their immediate neighbours, with additional longer connections created between random points. He discovered that the greedy algorithm would only discover a short path in specific conditions, depending on the lengths of the long-range connections.

If the long-range connections were too long (as was the case after they connected pairs of points in completely random locations), the greedy algorithm could follow a long-range connection to quickly reach the rough area of the goal, but after that the long-range connections were of no use, and the trail needed to step through the local connections to catch up with. Alternatively, if the long-range connections were too short, it might simply take too many steps to succeed in the realm of the goal.

If, nevertheless, the lengths of the long-range connections were good (to be precise, in the event that they were uniformly distributed, so that every one lengths were equally likely), the greedy algorithm would typically reach the neighbourhood of the goal in an especially small variety of steps (to be more specific, a number proportional to *log(n)*, where *n* is the variety of points within the graph).

In cases like this where the greedy algorithm can find the goal in a small variety of steps, we are saying the small world is a **navigable** small world (NSW).

An NSW seems like an excellent index for our vectors, but for vectors in a posh, high-dimensional space, it’s not clear learn how to actually construct one. Fortunately, Malkov et al. [3] discovered a technique: we insert one randomly chosen vector at a time to the graph, and connect it to a small number *m* of nearest neighbours that were already inserted.

This method is remarkably easy and requires no global understanding of how the vectors are distributed in space. It’s also very efficient, as we are able to use the graph built to this point to perform the closest neighbour seek for inserting each vector.

Experiments confirmed that this method produces an NSW. Since the vectors inserted early on are randomly chosen, they have a tendency to be quite far apart. They subsequently form the long-range connections needed for a small world. It’s not so obvious why the small world is navigable, but as we insert more vectors, the connections will get progressively shorter, so it’s plausible that the distribution of connection lengths can be fairly even, as required.

Navigable small worlds can work well for approximate nearest neighbours search, but further evaluation revealed areas for improvement, leading Markov et al. [4] to propose HNSW.

A typical path through an NSW from the entry point towards the goal went through two phases: a “zoom-out” phase, wherein connection lengths increase from short to long, and a “zoom-in” phase, wherein the reverse happens.

The primary easy improvement is to make use of a long-range hub (corresponding to the primary inserted vector) because the entry point. This manner, we are able to skip the zoom-out phase and go straight to the zoom-in phase.

Secondly, although the search paths were short (with a lot of steps proportional to *log(n)*), the entire search procedure wasn’t so fast. At each vector along the trail, the greedy algorithm must examine each of the connected vectors, calculating their distance to the goal with the intention to select the closest one. While a lot of the locally connected vectors had only a number of connections, most long-range hubs had many connections (again, a number proportional to *log(n)*); this is sensible as these vectors were normally inserted early on within the constructing process and had many opportunities to connect with other vectors. Because of this, the overall variety of calculations during a search was quite large (proportional to *log(n)²*).

To enhance this, we’d like to limit the variety of connections checked at each hub. This results in the most important idea of HNSW: explicitly distinguishing between short-range and long-range connections. Within the initial stage of a search, we’ll only consider the long-range connections between hubs. Once the greedy search has found a hub near the goal, we then switch to using the short-range connections.

Because the variety of hubs is comparatively small, they need to have few connections to envision. We may also explicitly impose a maximum variety of long-range and short-range connections of every vector after we construct the index. This leads to a quick search time (proportional to *log(n)*).

The concept of separate short and long connections could be generalised to incorporate several intermediate levels of connection lengths. We are able to visualise this as a **hierarchy** of layers of connected vectors, with each layer only using a fraction of the vectors within the layer below.