Annoy (Fast Neighbor Vector Search Package) Learning Notes - pip command learning and basic use of annoy

1. Write in front

When writing the YouTubeDNN recall of the fun-rec news recommendation system, the user vector and the news vector are obtained. Based on the user vector, the most similar TopK news needs to be obtained from the massive news. At this time, the fast vector retrieval technology needs to be used. One of the tools I have used is faiss. I also recorded a blog Faiss (Facebook's open-source efficient similarity search library) to learn how to use it. However, faiss is not very easy to install in the windows system, and it looks a bit complicated. This time, I came into contact with another useful toolkit for vector retrieval, which is annoy. This article mainly records how to use the annoy toolkit to do vector retrieval.

Simple summary: the annoy package is used for vector nearest neighbor retrieval to quickly find similar TopK items from a large number of items

For a detailed introduction to the annoy package, see https://github.com/spotify/annoy

2. Install annoy

First of all, we need to install annoy first, we can install it directly pip install annoyor specify the source pip install -i https://pypi.tuna.tsinghua.edu.cn/simple annoy.

But when I use this command, it will report Microsoft visual c++ 14.0 is required ..., because the system on my side is currently Windows, it should be easy to use on Linux or Mac.
insert image description here

I seem to have encountered this error when I installed faiss or the kind of package that requires a C++ compilation environment before. The way to fix it once and for all is to install a C/C++ compilation environment, but it is troublesome and takes up a lot of memory.

I don't want to use this method at the moment, but use another method. Here is a python universal package library , search for annoy in it, find the specified python version, and then download it.

Then pip install 文件的绝对路径install it locally. This method is easy for me to use here. Now that the installation package is mentioned, let's organize a little more knowledge.

When we install python packages, the most common one is to use pip to install. Here, we also take this opportunity to learn the common commands of pip and record them here. For details, see the pip must-have quick checklist

# 安装python包
pip install 包名

# 指定版本号
pip install 包名==版本 
pip install 包名>=2.22, <3
pip install 包名!=2.22

# 指定镜像源安装
pip install -i url 包名  # 其中国内镜像源( url ) 可以是清华、中科大、豆瓣等
#清华:https://pypi.tuna.tsinghua.edu.cn/simple
#豆瓣:http://pypi.douban.com/simple/

# 本地wheel文件安装 whl文件可以去https://www.lfd.uci.edu/~gohlke/pythonlibs/#pyhook离线下载到本地
pip install 包名.whl

# github仓库中安装
pip install git+包的github地址

# 更新python库
pip install 包名 -U

# 查看可提供安装的版本
pip install 包名==lemon

# 查看已经安装的python库
pip list
# 查询当前环境可升级的包
pip list -o

# 查看python库的信息
pip show 包名
pip show 包名 -f

# 卸载包
pip uninstall 包名

# 导出依赖包列表 freeze命令, 复用包很方便
pip freeze > requirements.txt  # 获取当前环境安装python库的版本信息,导入到txt文件中
# 从依赖包中安装
pip install -r requirements.txt

# 将python库制作成wheel文件,可以提供给他人用.whl文件,先安装wheel库
pip install wheel
# 特定python库制作成wheel文件
pip wheel --wheel-dir DIR some-package[==version] # 其中,DIR是保存文件的路径,比如users/lemon/Downloads

# 根据依赖包信息,制作wheel文件
pip wheel --wheel-dir DIR -r requirements.txt

Another way is to download the package directly, and then copy it offline to the package directory of the corresponding environment.

  • Windows environment: Anaconda -> Lib -> site-packages
  • Linux environment: anaconda -> lib -> python version -> site-packages
  • Mac environment: anaconda -> pkgs

In this way, you can also go to the corresponding folder to see the underlying source code of the specific package implementation.

3. Basic use of annoy

Here mainly refer to the example of annoy's GitHub, write it down

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10)
#t.save('test.ann')

# ...

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors

This example is actually very easy to understand. The role of annoy is to quickly search for neighbor vectors in massive vectors. First, we should build an efficient search index for the massive vectors. The tree method is used here. So the first 7 lines of code are mainly building the index. The real retrieval is actually the last sentence. The function of this sentence is to retrieve the 1000 vectors that are most similar to the 0-position vector. The returned result here will have itself.

The following is a list of commonly used functions about annoy:

  • Build index related functions

    • AnnoyIndex(f, metric): Returns a new readable and writable index to store the f-dimensional vector, where the metric can be "angular", "euclidean", "manhattan", "hamming", or "dot". Here the angular distance cosine similarity is the Normalization formula sqrt(2(1-cos(u,v)))
    • a.add_item(i, v):Add a vector at the i position (non-negative integer). The maximum of this dictionary is max(i)+1 items, for example, I have 10000 items, the dictionary size is 0-9999 positions, each position i stores the vector corresponding to the item, through This function can build a dictionary
    • a.build(n_trees, n_jobs=-1): Build a forest of n_trees. The more trees, the higher the precision. After creation, no additional items can be added. n_jobs is used to specify the number of threads to build the tree, -1 means to use all the extra cpu cores
    • a.save(fn, prefault=False): save the index to disk and load it (see next function). After saving, no more items can be added.
    • a.load(fn, prefault=False): Load (mmaps) an index from disk. If default is set to True, it will pre-read the entire file into memory (using mmap and MAP POPULATE). Default is false.


    The above are how to use the annoy package to build a good vector dictionary, and how to organize the vectors (the way of trees), and then save them. The following are how to get TopK's function to use.

  • Functions used in vector retrieval

    • a.get_nns_by_item(i, n, search_k=-1, include_distances=False): Returns the closest n items. During the query, search_k nodes will be checked, which defaults to n_trees*n. serarch_k implements a runtime tradeoff between accuracy and speed. When include_distances is True will return a 2-element tuple containing two lists: the second contains all corresponding distances.
    • a.get_nns_by_vector(v, n, search_k=-1, include_distances=False): The same as the above query based on item, except that a query vector v is given here. For example, given a user embedding, the n nearest neighbor items are returned. Generally, when used in this way, the following distance will be carried, which may be used as The strong side of the fine row
    • a.get_item_vector(i): returns the vector corresponding to index i
    • a.get_distance(i, j): Returns the squared distance between item_i and item_j
  • index property function

    • a.get_n_items(): Returns the number of items in the index, that is, the size of the dictionary
    • a.get_n_trees(): the number of index trees

Two hyperparameters to consider: the number of trees n_trees and the number of nodes checked during the search search_k

  • n_trees: Provided during build, affects build time and index size. The larger the value, the more accurate the result, but the larger the index.
  • search_k: Provided at runtime and affects search performance. The larger the value, the more accurate the result, but the longer it takes to return. If not provided, it is n_trees * n, where n is the number of nearest neighbors

Check out my few examples here:

insert image description here

4. Applications in YoutubeDNN

When YoutubeDNN does the recall, we can get the user's embedding and the item's embedding according to the model. Next, we take the user's embedding, and then go to the massive item, retrieve the most similar TopK and return it as the user's candidate item. So assuming we already have user_embs and item_embs, how do we perform fast nearest neighbor retrieval through annoy?

I wrote a function here:

def get_youtube_recall_res(user_embs, doc_embs, user_idx_2_rawid, doc_idx_2_rawid, topk):
    """近邻检索,这里用annoy tree"""
    # 把doc_embs构建成索引树
    f = user_embs.shape[1]
    t = AnnoyIndex(f, 'angular')
    for i, v in enumerate(doc_embs):
        t.add_item(i, v)
    t.build(10)
    # 可以保存该索引树 t.save('annoy.ann')
    
    # 每个用户向量, 返回最近的TopK个item
    user_recall_items_dict = collections.defaultdict(dict)
    for i, u in enumerate(user_embs):
        recall_doc_scores = t.get_nns_by_vector(u, topk, include_distances=True)
        # recall_doc_scores是(([doc_idx], [scores])), 这里需要转成原始doc的id
        raw_doc_scores = list(recall_doc_scores)
        raw_doc_scores[0] = [doc_idx_2_rawid[i] for i in raw_doc_scores[0]]
        # 转换成实际用户id
        user_recall_items_dict[user_idx_2_rawid[i]] = dict(zip(*raw_doc_scores))
    
    # 默认是分数从小到大排的序, 这里要从大到小
    user_recall_items_dict = {
    
    k: sorted(v.items(), key=lambda x: x[1], reverse=True) for k, v in user_recall_items_dict.items()}
    
    # 保存一份
    pickle.dump(user_recall_doc_dict, open('youtube_u2i_dict.pkl', 'wb'))
    
    return user_recall_items_dict

There are two additional parameters here user_idx_2_rawid, doc_idx_2_rawid, these two are dictionaries, which save the mapping between the position index of the user vector and the original id of the user, and the mapping between the index of the item vector and the original item_id, our final In the dictionary, the original id of the user and the original id of the item should be saved. After this function runs, the result is like this:

insert image description here
Ok, the exploration of annoy is here first, and if you learn new knowledge later, you will supplement it.

Guess you like

Origin blog.csdn.net/wuzhongqiang/article/details/122516942