Similarity Search, Part 4: Hierarchical Navigable Small Worlds (HNSW)

1. Description

        S Imilarity search is a problem in which, given a query, the goal is to find the document that is most similar to it among all database documents. Similarity search refers to the process of finding the object most similar to a query object in a large-scale data set. The process usually involves calculating a similarity score between two objects, then sorting them by score and returning the top k objects that are most similar to the query object. Similarity search is widely used in similarity matching problems in large-scale data sets, such as image retrieval, audio matching, recommendation systems, natural language processing and other fields. Common similarity search algorithms include vector space models based on cosine similarity, locality-sensitive hashing based on Hamming distance, nearest neighbor search based on edit distance, etc.

2. Introduction

        In data science, similarity searches often appear in the field of NLP, search engines, or recommender systems, where the most relevant documents or items need to be retrieved for a query. There are various ways to improve search performance in large amounts of data.

Hierarchical Navigable Small Worlds  (HNSW) is a state-of-the-art algorithm for approximate nearest neighbor search. Under the hood, HNSW constructs an optimized graph structure that makes it very different from the other methods discussed in earlier parts of this series.

The main idea of ​​HNSW is to construct a graph where the path between any pair of vertices can be traversed in a small number of steps.

A well-known analogy to the famous six-way handshake rule relates to this approach:

All were socially connected to each other by six or fewer.

Before moving on to discuss the inner workings of HNSW, let's first discuss skip lists and navigable widgets - the key data structures used in the HNSW implementation.

3. Skip list

A skip list is a probabilistic data structure that allows insertion and search  of elements in a sorted list on average  O(logn) . The skip list consists of several levels of linked lists. The lowest level has the original linked list, which contains all elements. As you move to higher levels, the number of skipped elements increases, reducing the number of connections.

Find element 20 in skip list

The search process for a specific value starts from the highest level and compares its next element with that value. If the value is less than or equal to the element, the algorithm continues to its next element. Otherwise, the search process drops to lower layers with more connections and the same process is repeated. Finally, the algorithm descends to the lowest level and finds the required node.

According to information from Wikipedia , skip lists have a main parameter p , which defines the probability of an element appearing in more than one list. If an element appears in level  i  , then the probability of it appearing in level  i + 1  is equal to p (p  is usually set to 0.5 or 0.25 ). On average, each element appears in 1/(1 - p) list.

As we can see, this process is much faster than a normal linear search in a linked list. In fact, HNSW inherits the same idea, but instead of a linked list, it uses a chart.

4. Navigable small world

A navigable small world is a graph with polylogarithmic  T = O(logkn)  search complexity, which uses greedy routing. Routing refers to the process of searching starting from low-degree vertices and ending with high-degree vertices. Since low-degree vertices are rarely connected, the algorithm can move quickly between them to efficiently navigate to areas where nearest neighbors are likely to be. The algorithm then gradually zooms in and switches to height vertices to find the nearest neighbor among the vertices in the area.

Vertices are sometimes called nodes .

4.1 Search

First, search by selecting an entry point. To determine the next vertex (or vertices) the algorithm moves to, it calculates the distance from the query vector to the current vertex's neighboring vertices and moves to the closest vertex. At some point, the algorithm terminates the search process when it cannot find a neighbor node that is closer to the query than the current node itself. This node is returned as a response to the query.

A greedy search process in a small navigable world. Node A is used as the entry point. It has two neighbors B and D. Node D is closer to the query than B. Node D has three neighbors C, E and FE which are the nearest neighbors to the query, so we move to E. Finally, the search process will point to node L. Since all neighbors of L are further away from the query than L itself, we stop the algorithm and return L as the answer to the query.

This greedy strategy does not guarantee that it will find the exact nearest neighbor, as the method only uses local information to make decisions in the current step. Early stopping is one of the problems with the algorithm. It happens especially at the beginning of the search process when there are no neighboring nodes that are better than the current node. In most cases, this can happen when the starting region has too many low degree vertices.

Stop early. Both neighbors of the current node are further away from the query. Therefore, the algorithm returns the current node as a response, although there are nodes closer to the query.

Search accuracy can be improved by using multiple entry points.

4.2 Construction

The NSW plot is constructed by randomly arranging the data set points and inserting them one by one into the current plot. When a new node is inserted, the node will be linked to the M  vertices closest to the node through edges  .

Insert nodes sequentially (from left to right), M = 2. In each iteration, a new vertex is added to the graph and linked to its M = 2 nearest neighbors. The blue line represents the edge connecting to the newly inserted node.

In most cases, remote edges may be created at the beginning of graph construction. They play an important role in graphical navigation.

Links to the nearest neighbors of elements inserted at the beginning of construction later become bridges between network hubs that maintain overall graph connectivity and allow logarithmic scaling of the number of hops during greedy routing. ——Yu.A. Markov, DA Yashunin

From the example in the picture above, we can see the importance of the remote edge AB added at the beginning . Suppose a query requires traversing paths from relatively distant nodes  A  and I. Having an edge  AB  allows you to do this quickly by navigating directly from one side of the graph to the other.

As the number of vertices in a graph increases, the likelihood that new edges connecting to new nodes will decrease in length also increases.

5. HNSW

HNSW is based on the same principles as skip lists and navigable small worlds. Its structure represents a multi-layered graph, with fewer connections at the top and denser regions at the bottom.

5.1 Search

The search starts at the highest level and proceeds one level below each time a local nearest neighbor is greedily found in a layer node. Ultimately, the nearest neighbor found at the lowest level is the answer to the query.

Search in HNSW

Similar to New South Wales, HNSW search quality can be improved by using multiple entry points. Instead of finding just one nearest neighbor on each layer, the closest  efSearch (hyperparameter) to the query vector is found, and each of these neighbors is used as the entry point for the next layer.

5.2 Complexity

The authors of the original paper claim that the number of operations required to find the nearest neighbor on any layer is bounded by a constant. Considering that the number of all layers in the graph is logarithmic, we get a total search complexity of  O(logn).

6. Construction

6.1 Select the largest layer

Nodes in HNSW are inserted one by one in order. Each node is randomly assigned an integer  l indicating the maximum level in which the node can be represented in the graph. For example, if  l =  1, nodes can only be found on levels 0 and 1. The authors randomly select l for each node with an exponentially decaying probability distribution normalized by a non-zero multiplier mL (mL = 0 results in a single layer and non-optimized search complexity in HNSW ). In general, most l  values ​​should equal 0, so most nodes only exist at the lowest level. The larger the mL  value, the greater the probability that the node appears on a higher layer.

The number of layers l for each node is randomly selected, and the probability distribution decays exponentially.

Layer number distribution based on normalization factor mL. The horizontal axis represents uniform (0, 1) distributed values.

To achieve the best performance benefits of a controllable hierarchy, the overlap between adjacent elements on different layers (i.e., the percentage of adjacent elements to elements that also belong to other layers) must be small. ——Yu.A. Markov, DA Yashunin.

One way to reduce overlap is to reduce milliliters . But it's important to remember that reducing mL also results in more traversals on average during the greedy search of each layer . This is why such a  value of mL has to be chosen  to balance overlap and number of traversals.

The authors of the paper recommend choosing an optimal value of mL equal to 1/ln(M) . This value corresponds to the parameter p  = 1/M of the skip list and is the average single-element overlap between layers.

6.2 Insertion

After a node is assigned the value  l  , its insertion has two phases:

  1. The algorithm starts from the upper layer and greedily finds the closest nodes. The found nodes are then used as entry points for the next layer and the search process continues. After reaching level l , insert and enter the second step.
  2. Starting from  layer l  , the algorithm inserts new nodes into the current layer. It then behaves the same as before in step 1, but instead of finding just one nearest neighbor, it greedily searches for the  efConstruction (hyperparameter) nearest neighbor. Then   select  M adjacent nodes from efConstruction  and build edges from the inserted node to them. Afterwards, the algorithm descends to the next layer, with each found  efConstruction  node acting as an entry point. The algorithm terminates when a new node and its edges are inserted into the lowest level 0.

Insert node (blue) in HNSW. The maximum layer of new nodes is randomly chosen to be l = 2. Therefore, nodes are inserted into levels 2, 1, and 0. On each of these layers, a node will be connected to its M = 2 nearest neighbors.

6.3 Selecting values ​​for construction parameters

The original paper provides several useful insights into how to choose hyperparameters:

  • According to simulations, good values ​​of M  are between 5 and 48. Smaller M values ​​tend to be more suitable for low recall or low-dimensional data, while higher  M  values ​​are more suitable for high recall or high-dimensional data.
  • Higher efConstruction  values ​​mean the search is deeper as more candidates are explored. However, it requires more calculations. The authors recommend choosing  efConstruction  values ​​that result in a recall close to 0.95-1 during training  .
  • Additionally, there is another important parameter  Mmₐₓ  — the maximum number of edges a vertex can have. Apart from this, the same parameters  Mmₐₓ₀ are present , but separately for the lowest layer.  It is recommended to choose a   value close to 2 * M for  Mmₐₓ . Values ​​larger than  2 * M  may cause performance degradation and excessive memory usage. At the same time, Mmₐₓ = M will lead to poor performance at high recall.

6.4 Candidate selection heuristics

        It has been pointed out above that during the node insertion process,  candidates  are selected  from efConstruction to build their edges. Let us discuss   possible ways to select these M nodes.

        The naive approach treats M as the closest candidate. However, it's not always the best option. Here's an example demonstrating it.

        Imagine a diagram with the structure shown below. As you can see, there are three areas, two of which are not connected to each other (left and top). So, for example, getting from point A to point B requires a long path through another area. It would be logical to connect these two areas in some way for better navigation.

        Node X will be inserted into the graph. The goal is to connect it to other M = 2 points in the best possible way.

        Node  X is then  inserted into the graph and needs to be linked to  M  = 2  other vertices.

        In this case, the naive method directly takes  M = 2  nearest neighbors ( B  and  C ) and  connects X  to them. Although X connects to its true nearest neighbor, it doesn't solve the problem. Let's look at the heuristic invented by the author.

The heuristic takes into account not only the nearest distance between nodes, but also the connectivity of different regions on the graph.

The heuristic selects the first nearest neighbor ( B         in our case  ) and connects the inserted node ( X ) to it. The algorithm then gets another nearest neighbor in sorted order (C) and only if the distance from that neighbor to the new node (X) is less than the distance from that neighbor to all connected vertices ( B ) The algorithm only constructs an edge when it is any distance to a new node ( X ). After that, the algorithm continues to the next nearest neighbor until  M  edges are constructed.

        Returning to the example, the heuristic process is shown below. The heuristic selects  B  as the nearest neighbor of X and constructs the edge  BX . The algorithm then selects  C  as the next nearest neighbor. However, this time BC<CX . This shows that adding edge  CX  to the graph is not optimal since edge  BX already exists and nodes  B  and  C  are very close to each other. The same analogy   goes on nodes D  and  E. After that , the algorithm checks node  A. This time, it satisfies  the condition since BA  > AX  . Therefore, the new edge  AX  and the two initial areas will be connected to each other.

        The example on the left uses a naive approach. The example on the right uses a selection heuristic that results in two initially disjoint regions being connected to each other.

6.5 Complexity

        Compared to the search process, the insertion process works very similarly without any significant differences and may require a non-constant number of operations. Therefore, inserting a single vertex imposes  O(logn)  time. To estimate the total complexity,   the number of all inserted nodes n in the given dataset should be considered. Ultimately, HNSW construction takes O(n*logn) time.

7. Combine HNSW with other methods

        HNSW can be used with other similarity search methods to provide better performance. One of the most popular methods is to combine it with Inverted File Indexing and Product Quantification ( IndexIVFPQ ), which is covered elsewhere in this article series.

In this paradigm, HNSW plays the role of the coarse quantizer of IndexIVFPQ , which means it will be responsible for finding the nearest Voronoi partition and therefore can narrow down the search scope. To do this, a HNSW index must be built on all Voronoi centroids. HNSW is used to find the nearest Voronoi centroid when given a query (instead of doing a brute force search by comparing the distance to each centroid as before). Afterwards, the query vector is quantized in the corresponding Voronoi partition and the distance is calculated using PQ code.

The nearest Voronoi centroid is selected by finding the nearest neighbor built above the Voronoi centroid in HNSW.

When using only the inverted file index, it is best to set the number of Voronoi partitions not too large (such as 256 or 1024) because a brute force search is performed to find the nearest centroid. By selecting a small number of Voronoi partitions, the number of candidates within each partition becomes relatively large. As a result, the algorithm can quickly identify the nearest centroid for a query, and most of its runtime is focused on finding the nearest neighbors within the Voronoi partition.

However, introducing HNSW into the workflow requires adjustments. Consider running HNSW only on a small number of centroids (256 or 1024): HNSW does not bring any significant benefit, because for a small number of vectors, HNSW performs relatively the same in terms of execution time as a naive brute force search. In addition, HNSW requires more memory to store the graph structure.

This is why it is recommended to set the number of Voronoi centroids much larger than usual when combining HNSW and inverted file indexes. By doing this, the number of candidates within each Voronoi partition is much smaller.

This shift in paradigm resulted in the following setup:

  • HNSW quickly identifies the nearest Voronoi centroid in logarithmic time.
  • Afterwards, an exhaustive search is performed within the respective Voronoi partition. This shouldn't be a hassle since the number of potential candidates is small.

8. Implementation of Faith

Faiss (Facebook AI Search Similarity) is a Python library written in C++ for optimized similarity search. The library provides different types of indexes, which are data structures used to store data efficiently and perform queries.

Based on  the information in the Faiss documentation , we will see how to leverage HNSW and merge it with inverted file indexing and product quantification.

8.1 IndexHNSWFlat

FAISS has a class IndexHNSWFlat that implements the HNSW structure . As usual, the suffix " Flat " means that the dataset vector is stored entirely in the index. The constructor accepts 2 parameters:

  • D : Data dimension.
  • M : The number of edges that need to be added to each new node during insertion.

Additionally, through the hnsw field, IndexHNSWFlat provides several useful properties (which can be modified) and methods:

  • hnsw.efConstruction : Number of nearest neighbors to explore during construction.
  • hnsw.efSearch : The number of nearest neighbors to explore during search.
  • hnsw.max_level : Returns the maximum number of levels.
  • hnsw.entry_point : Returns the entry point.
  • faiss.vector_to_array(index.hnsw.levels) : Returns a list of the maximum number of levels for each vector
  • hnsw.set_default_probas(M: integer, level_mult: floating point) : Allows setting  M  and  mL  values ​​separately. By default, level_mult is set to  1/ln(M).

Faiss implements IndexHNSWFlat

IndexHNSWFlat  sets values  ​​for  Mmₐₓ = M and Mmₐₓ  ₀ = 2 * M.

8.2 IndexHNSWFlat + IndexIVFPQ

IndexHNSWFlat  can also be used in conjunction with other indexes. One example is IndexIVFPQ described in the previous section . This comprehensive index is created in two steps:

  1. IndexHNSWFlat  is initialized as a coarse quantizer.
  2. The quantizer is passed as a parameter to  the IndexIVFPQ  constructor.

Training and addition can be done using different or the same data.

FAISS implementation of IndexHNSWFlat + IndexIVFPQ

9. Conclusion

        In this paper, we study a robust algorithm that is particularly suitable for large dataset vectors. By using multi-layer graph representation and candidate selection heuristics, its search speed can be efficiently scaled while maintaining decent prediction accuracy. It is also worth noting that HNSW can be combined with other similarity search algorithms, making it very flexible. Vyacheslav Yefimov

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132726404