Recommendation algorithm: HNSW [Recommend products similar to those searched by the user/interested in the user]

Overview of HNSW algorithm

HNSW (Hierarchical Navigable Small Word) algorithm is currently a commonly used ANN (Approximate Nearest Neighbor) algorithm in the recommendation field. Its purpose ishow to quickly find the k nearest neighbor elements of a query among a very large number of candidate sets.

To find the k nearest neighbor elements of a query, a simple idea is to calculate the distance between the query and all N candidate elements, and then select the top k smallest elements among them. The algorithm complexity of this classic algorithm It is O(Nlog(k)). Obviously the complexity of this algorithm is too high and cannot be applied to actual usage scenarios.

To solve this problem, there are many implementation methods. The HNSW algorithm mentioned here is a commonly used search algorithm at present. It is an upgraded version of its predecessor NSW algorithm, but the essence of the two is It is all based on a simple idea, that is, a graph connection relationship is defined in advance for all N candidate elements through graph connection, so that N of the aforementioned algorithm complexity can be reduced The part is reduced, thereby optimizing the overall retrieval efficiency.

The overall picture result can be expressed as follows:

Problems solved:Do efficient similarity search. In the recommendation system, how to find the items that are closest to the user's query and then recommend them [that is, recommend products that are similar to what the user searched for/are of interest to the user].

The solutions are: Annoy, KD-Tree, LSH, PQ, NSW, HNSW, etc.

Development of Approximate Nearest Neighbor Search (ANNS): Proximity Graph–> NSW --> Skip List --> HNSW

Approximate Nearest Neighbor Search (ANNS)

1. Proximity Graph

Proximity Graph: The simplest graph algorithm

Idea: Construct a graph where each vertex is connected to the nearest N vertices. Target (red dot) is the vector to be queried. When searching, select any vertex to start. First traverse its friend nodes, find a node closest to Target, set it as the starting node, and then traverse from its friend nodes, iterate repeatedly, continue to approach, and finally find the node closest to Target. The search is over.

Problems:

  1. Point K in the graph cannot be queried.
  2. If you want to find the topK points closest to Target (red point), and if there are no connections between the points, the search efficiency will be affected.
  3. Does point D have so many friend nodes? Increased construction complexity. How to determine who is whose friend node?
  4. If the initial point is not chosen well (for example, it is far away), a multi-step search will be performed.

2. Principle of NSW algorithm

NSW, that is, a navigable small world structure without hierarchies (Navigable-Small-World-Graph).

Solution to the above problem:

  1. Some points cannot be queried -> It is stipulated that all nodes must have friend nodes when composing the picture.
  2. The problem of similar points not being adjacent -> It is stipulated that all nodes that are close to a certain extent in the composition must be friends of each other.
  3. Regarding certain points having too many friend nodes -> Specifies a limit on the number of friend nodes for each node.
  4. The initial point is chosen far away -> Add highway mechanism.

2.1 NSW composition algorithm

When a new node is inserted into the graph, the m nodes closest to the new node are found through a randomly existing node (a maximum of m friend nodes are specified, m is set by the user), and the new node is connected to the m nodes closest to the new node. A node's friend nodes will be continuously updated during the insertion of new nodes.

m=3 (each point finds 3 immediate friend points during insertion).

The first construction: the graph is empty, A is randomly inserted, and the initial point is A. There is only A in the graph, so friend nodes cannot be selected. Insert B. Only point A is available for point B, so connect BA.

The second construction: Insert F. Only A and B can be selected for F, so connect FA and FB.

The third construction: Insert C, only A, B, and F are available at point C, connect CA, CB, and CF.

The fourth construction: Insert E, start from any point A, B, C, F, calculate the distance between the starting point and E and the distance between all "friend nodes" of the starting point and E, and select the nearest point as the new starting point. If The selected point is the starting point itself. Then look at what our m equals. If it is not enough, continue to find the second or third closest point. Based on the principle of not looking for duplicate points, until we find 3 close points. until. Three close points of E are found, connecting EA, EC, and EF.

The fifth construction: Insert D, which is exactly the same as the insertion of point E. The three nearest nodes are found in the "ready-made" graph as "friend nodes" and connected.

The sixth construction: Inserting G is exactly the same as the insertion of point E. The three nearest nodes are found in the "ready-made" graph as "friend nodes" and connected.

In the early stages of graph construction, it is very possible to build "highways".

The nth construction: Insert 6 more points on the basis of this graph. 3 of these 6 points are very close to E, and 3 are very close to A. Then there is no A in the 3 closest points to E. The distance There is no E among the three closest points to A, but because A and E are points added early in the composition, A and E are connected. We call this connection a "highway", which can improve search efficiency when searching. (When the entry point is E and the distance to be searched is very close to A, we can directly reach A from E through the AE connection instead of jumping to A multiple times in small steps).

Conclusion: At one point, the earlier it is inserted, the easier it is to form a "highway" connection related to it, and the later it is inserted, the harder it is to form a "highway" connection related to it.

The beauty of this algorithm design is that it abandons the Delaunay triangle composition method and uses "brainless addition" (NSW naive insertion algorithm), which reduces the time complexity of the composition algorithm and also brings a limited number of "highway ”, speeding up the search.

2.2 NSW search algorithm

NSW.png

The edges in the graph serve two different purposes:

  1. Short-range edges, used as approximate Delaunay graphs required for greedy search algorithms.
  2. Long-range edges,logarithmic scaling for greedy search. Responsible for constructing the navigable small world (NSW) properties of the graph.

Optimize search:

  1. Create an abandoned list visitedSet, and the points traversed in a search task will no longer be traversed.
  2. Create a dynamic list result, store the n points closest to the search point in the table, calculate the distance between the "friend nodes" and the point to be found for these n points in parallel, and select n among these "friend nodes" Perform a union operation on points and n points in the dynamic list, select the n nearest friend points in the union, and update the dynamic list.

Recommendation Algorithm: Introduction to HNSW Algorithm-CSDN Blog

Retrieval model-rough sorting HNSW_hnsw model-CSDN blog

Guess you like

Origin blog.csdn.net/weixin_43135178/article/details/134931780