[python] Vector retrieval library Faiss usage guide

Faiss is a library developed by facebook for efficient similarity search and dense vector clustering. It is capable of searching in vector sets of arbitrary size. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with a full interface to Python. Some of the most useful algorithms are implemented on the GPU. The official Faiss repository is: faiss .

The so-called similarity search refers to searching for the target data most similar to the input data by comparing the similarity between the data in the multidimensional space. For example, in face recognition, by comparing the distance before the face vector to identify which face the current face is similar to. Therefore, this technology is widely used in information retrieval, computer vision, data analysis and other fields. If there are a lot of data to be retrieved, then a vector retrieval library is needed to speed up the retrieval. Faiss includes multiple similarity search methods and provides cpu and gpu version support. The advantage of Faiss is to improve the retrieval speed of vector similarity and reduce the memory usage with a small loss of precision. This article mainly describes the use of the python3 interface of faiss. For the official faiss tutorial, see: faiss official tutorial .

For Faiss, the installation of the linux system is as follows:

# CPU安装
pip install faiss-cpu
# GPU安装
pip install faiss-gpu

The installation of the windows system requires conda, and the installation instructions are as follows. Conda activate is required to activate the environment before use, do not install directly with pip.

# CPU安装
conda install -c pytorch faiss-cpu

# GPU安装
conda install -c pytorch faiss-gpu

1 Basic usage

1.1 Getting Started

database creation

Faiss can handle fixed-dimensional collections of vectors, which can be stored in matrices. Faiss uses only 32-bit floating-point matrices, where the columns of the matrix represent the features of the vector, and the rows represent the number of samples of the vector. In general for vector retrieval, we need two matrices:

  • The index matrix, the index matrix contains all the vectors that must be indexed, in which we will search. Its dimension is [nb vector sample number, d vector dimension].
  • query matrix, query matrix. Its size is [nq number of vector samples, d vector dimension]. If we have only one query vector, then nq=1.

The following example introduces an example of the input matrix required by faiss.

import numpy as np
# 向量特征维度
d = 64               
# 索引矩阵向量样本数          
nb = 10000
# 查询矩阵向量样本数                   
nq = 1000        
# 设计随机种子   
np.random.seed(42)             
# 随机生成0到1的数据,简历索引矩阵,但是必须要是float32类型
xb = np.random.random((nb, d)).astype('float32')
# 索引矩阵第一列加上扰动
xb[:, 0] += np.arange(nb) / 1000.
# 建立查询矩阵
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

Build an index library

Faiss is built around index objects. It preloads the index matrix and optionally preprocesses it to improve search efficiency. There are many types of indexing methods, we will use the simplest version and only perform the simplest L2 distance (Euclidean distance) search on it, which is IndexFlatL2. Before preloading the index matrix, the dimension d of the vector needs to be set. In addition, most index methods also need a training phase to analyze the distribution of the vector. For IndexFlatL2, we can skip this operation, so the L2 distance is very slow, which belongs to violent retrieval, but with high precision. In Faiss, the add function is used to preload the index matrix, and the search function is used to retrieve the query matrix. At the same time, through is_trained, it indicates whether the vector is trained. If it is false, it means that the training vector stage needs to be increased, and ntotal indicates the number of index vectors. An example of building an index library is as follows:

# 加载faiss库
import faiss            
# 设置检索维度       
index = faiss.IndexFlatL2(d)   
# 判断该索引方式是否已经训练,false表示没有训练
print(index.is_trained)
# 添加索引库
index.add(xb)    
# 打印待索引向量数     
print(index.ntotal)
True
10000

vector retrieval

The basic search operation performed on the index is a k-nearest neighbor search, that is, for each query vector, find its k similar vectors in the database. Faiss uses search to query, and the search function returns two unumpy structure matrices:

  • Retrieve the distance matrix D, the dimension is [nq the number of query vector samples, k the number of similar vectors]. where the columns represent the distances to k similar vectors, and the distances are sorted from closest to farthest.
  • The distance of the retrieval result is I, and the dimension is [nq number of query vector samples, k number of similar vectors]. The column represents the id numbers of k similar vectors in the index library, and the similarity is sorted from high to low.

The code for vector retrieval is as follows, since the first feature of each first vector is added with a value close to id. Therefore, in terms of query results, the similar index vector id number of the query vector is close to the id number of the query vector. For example, the id number of the first query vector is 0, and the id of its index vector is also close to 0.

import time

# 设置查找5个相近向量
k = 5            
# 索引库健全性检查,没出问题表明库加载成功             
D, I = index.search(xb[:5], k)

# 开始时间
start = time.time()
# 实际检索
D, I = index.search(xq, k)    
# 结束时间
end = time.time()
# 这里用的是一台很弱的电脑,速度慢正常。
print("耗时{}s".format(end-start))

# 前五个查询向量的检索结果 
print(I[:5])        
print('---分割线---')
# 最后五个查询向量的检索结果
print(I[-5:])
耗时2.492363214492798s
[[ 234  642  860  369  820]
 [ 145  430   49   27   62]
 [ 200  279  193  331  564]
 [1449  453  515 1173 1502]
 [ 108  442  133 1273  323]]
---分割线---
[[ 842  781 1939 1535 2579]
 [ 321 1591  265 1449  873]
 [1687 1530 1257 1370  942]
 [1403  373 1032  862 1975]
 [ 852  211  673  937  228]]

The above program uses numpy to calculate directly as follows. If the dimension is not high and the samples are not many, the speed is faster than faiss. But if the data is complex, do not run the following program, it will cause the system to crash.


# 矩阵向量l2距离计算
def dist(xq, xb):
    xq2 = np.sum(xq**2, axis=1, keepdims=True)
    xb2 = np.sum(xb**2, axis=1, keepdims=True)
    xqxb = np.dot(xq,xb.T)
    # faiss在计算l2距离的时候没有开方,加快速度
    # return np.sqrt(xq2 - 2 * xqxb + xb2.T)
    return xq2 - 2 * xqxb + xb2.T

# 获取结果
def get_result(dst,k = 5):
    D = np.sort(dst)[:,:k]
    I = np.argsort(dst)[:,:k]
    return D,I 

# 开始时间
start = time.time()
# dst = dist(xq,xb)
# D_, I_ = get_result(dst,k)
# 结束时间
end = time.time()
# 这里用的是一台很弱的电脑,速度慢正常。
# print("耗时{}s".format(end-start))

# 前五个查询向量的检索结果 
# print(I[:5])        
# print('---分割线---')
# 最后五个查询向量的检索结果
# print(I[-5:])

1.2 Speed ​​up retrieval

As you can see from the previous code, the retrieval speed is too slow, especially on a computer with very weak performance. In order to speed up the retrieval, in Faiss, the data set is divided into several parts, and then Voronoi cells are defined in the d-dimensional space, and each database vector falls in one of the cells. This is the spatial nearest neighbor query based on the Voronoi diagram. When searching, the query vector first determines which unit it falls on, and then compares it with the index vector in the unit where it is located and the index vectors of several adjacent units.

This is done in Faiss by setting IndexIVFFlat indexing, but this type of indexing requires a training phase and can be performed on any collection of vectors that have the same distribution as the database vectors. IndexIVFFlat also needs another index quantizer, through which each index vector is assigned to the Voronoi unit, this operation is generally implemented by using IndexFlatL2. This way of building a library is time-consuming, and the smaller the number of adjacent units used for retrieval, the faster the retrieval speed and the lower the accuracy. A practical balance between accuracy and speed is required to set the number of neighbors used for retrieval.

There are two parameters in the IndexIVFFlat search method: the nlist used to set the number of cells, and the number of similar cells to search for nprobe (the default is one). The increase of nprob will lead to a linear increase in retrieval speed, but the retrieval accuracy will also increase accordingly. An example retrieval is as follows:

# 单元数
nlist = 100
# 查询向量个数
k = 5
# 设置量化器建立检索空间
quantizer = faiss.IndexFlatL2(d)
# 向量维度,单元数
index = faiss.IndexIVFFlat(quantizer, d, nlist)
# 检索库没有被训练
print(index.is_trained)
# 训练检索库
index.train(xb)
print(index.is_trained)

# 这一操作还是耗时
index.add(xb)             
False
True

Search on 1 adjacent cell

It can be seen that the retrieval speed becomes much faster, but the accuracy will decrease. But it is not exactly the same as the previous L2 retrieval result. This is because some results are not in Voronoi cells with retrieval. Therefore, it may prove useful to visit more cells when retrieving.

# 默认1个相近单元用于查找
print(index.nprobe)
# 开始时间
start = time.time()
D, I = index.search(xq, k) 
# 结束时间
end = time.time()
# 这里用的是一台很弱的电脑,速度慢正常。
print("耗时{}s".format(end-start))
# 前五个向量检索结果
print(I[-5:])
1
耗时0.09508228302001953s
[[ 937 1026  879  461  313]
 [ 321  927  514  581  960]
 [1530 1216 1924 1518 1497]
 [1032 1998 1185 2109 1500]
 [ 211  228  412  267   35]]

Find on multiple adjacent cells

# 设置在多少个相近单元进行查找
index.nprobe = 10
# 开始时间
start = time.time()
D, I = index.search(xq, k) 
# 结束时间
end = time.time()
# 这里用的是一台很弱的电脑,速度慢正常。
print("耗时{}s".format(end-start))
# 前五个向量检索结果
print(I[-5:])
耗时0.2911374568939209s
[[ 842  781 1939 1535 1951]
 [ 321  265 1449  873  947]
 [1687 1530 1370  942 1216]
 [1403  373 1032 1975 1411]
 [ 852  211  673  937  228]]

Find on all adjacent cells

When nprobe=nlist, it means to search on all data units. This method will give the same result as direct L2 retrieval, but the speed is very slow.

# 设置在多少个相近单元进行查找
index.nprobe = nlist
# 开始时间
start = time.time()
D, I = index.search(xq, k) 
# 结束时间
end = time.time()
# 这里用的是一台很弱的电脑,速度慢正常。
print("耗时{}s".format(end-start))
# 前五个向量检索结果
print(I[-5:])
耗时0.5938379764556885s
[[ 842  781 1939 1535 2579]
 [ 321 1591  265 1449  873]
 [1687 1530 1257 1370  942]
 [1403  373 1032  862 1975]
 [ 852  211  673  937  228]]

1.3 Lower memory usage

The indices IndexFlatL2 and IndexIVFFlat we saw both store full vectors. To scale to very large datasets, Faiss provides variants that compress stored vectors using lossy compression based on product quantizers. Vectors are still stored in Voronoi cells, but their size is reduced to a configurable number of bytes m (feature dimension d must be a multiple of m).

Compression is based on a product quantizer, which can be seen as an additional quantization level, applied to a subvector of the vector to be encoded. In this case, since the vectors are not stored exactly, the distances returned by the search method are also approximate. However, the accuracy of this method is not high, and it takes a long time to train the model, so use it with caution.

nlist = 100
# 子量化器的数量
m = 8   
k = 4
# 设置量化器建立检索空间
quantizer = faiss.IndexFlatL2(d) 
# 8指定每个子向量编码为8位
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], k) 
index.nprobe = 10             
# 搜索
D, I = index.search(xq, k)     
print(I[-5:])
[[ 879 1951 1099  895]
 [ 907  906  828 1453]
 [1687 1216 1530 1247]
 [1346 1398 1599 1141]
 [ 673  545  894  267]]

2 Build indicators

2.1 Metrics

Faiss supports two metrics, L2 (Euclidean distance) and inner product.

  • For the L2 distance, Faiss does not take the square root when calculating the L2 distance. If the exact distance is required, an additional square root is required.
  • For inner products, this method is based on cosine similarity. Cosine similarity is to evaluate the similarity of two vectors by calculating the cosine value of the angle between them. The larger the cosine similarity, the smaller the angle between the vectors, which means that the two vectors are more similar. The cosine similarity is calculated by dividing the inner product of two vectors by the modulus of the two vectors. But under normal circumstances, only the inner product is calculated and the cosine similarity is not calculated. When both vectors are normalized, the result of the inner product calculation is the cosine similarity. In faiss, the inner product index is established by using IndexFlatIP.

Directly calculate the inner product, the code is as follows, the precision is not high.

# 加载faiss库
import faiss            
# 设置检索维度       
index = faiss.IndexFlatIP(d)   
# 判断该索引方式是否已经训练,false表示没有训练
print(index.is_trained)
# 添加索引库
index.add(xb)    
# 打印待索引向量数     
print(index.ntotal)
import time


# 设置查找5个相近向量
k = 5            
# 索引库健全性检查,没出问题表明库加载成功             
D, I = index.search(xb[:5], k)
# 开始时间
start = time.time()
# 实际检索
D, I = index.search(xq, k)    
# 结束时间
end = time.time()
# 这里用的是一台很弱的电脑,速度慢正常。
print("耗时{}s".format(end-start))

# 前五个查询向量的检索结果,前五个的检索id应该接近0
print(I[:5])        
print(D[:5])        
True
10000
耗时1.2941181659698486s
[[9763 9880 9553 9863 9034]
 [9073 6585 9863 3904 9763]
 [9814 9766 9880 7796 9815]
 [9863 9553 9763 9003 9682]
 [9916 9763 9880 8035 9709]]
[[32.214005 31.82538  31.135605 31.076077 31.035263]
 [20.439157 20.378412 20.31539  20.28349  20.214535]
 [27.588043 27.546564 27.50164  27.338028 27.227661]
 [27.075897 26.859488 26.714897 26.675337 26.580912]
 [27.065765 26.823675 26.782066 26.74991  26.71664 ]]

After the data is normalized, calculate the inner product, the code is as follows.

# 加载faiss库
import faiss            
# 设置检索维度       
index = faiss.IndexFlatIP(d)   
# 判断该索引方式是否已经训练,false表示没有训练
print(index.is_trained)

xb_ = xb.copy()
# 正则化
faiss.normalize_L2(xb_)
# 添加索引库
index.add(xb_)    
# 打印待索引向量数     
print(index.ntotal)
import time


# 设置查找5个相近向量
k = 5            
# 开始时间
start = time.time()
# 实际检索
xq_ = xq.copy()
# 正则化
faiss.normalize_L2(xq_)
D, I = index.search(xq_, k)    
# 结束时间
end = time.time()
# 这里用的是一台很弱的电脑,速度慢正常。
print("耗时{}s".format(end-start))

# 前五个查询向量的检索结果 
print(I[:5])        
print(D[:5])   
True
10000
耗时1.6847233772277832s
[[ 860 1240  234  618  642]
 [ 145  273  348  437  228]
 [1223  200  279  193  605]
 [1502 1449 1696  515 1415]
 [ 442  323  133 1273  108]]
[[0.8677575  0.86685914 0.86608535 0.8649474  0.86269784]
 [0.83961076 0.82170016 0.81780475 0.81556916 0.81182253]
 [0.8426961  0.83725685 0.8371294  0.83701724 0.83649486]
 [0.8574949  0.8456532  0.8434802  0.8426977  0.83955705]
 [0.848532   0.8466242  0.84535056 0.8434353  0.84117293]]

You can directly calculate the normalization and inner product with numpy. This method is recommended. The code is as follows.

# 加载faiss库
import faiss            
# 设置检索维度       
index = faiss.IndexFlatIP(d)   
# 判断该索引方式是否已经训练,false表示没有训练
print(index.is_trained)

# 归一化
xb_ = xb.copy()/np.linalg.norm(xb)
# 正则化
faiss.normalize_L2(xb_)
# 添加索引库
index.add(xb_)    
# 打印待索引向量数     
print(index.ntotal)
import time

# 设置查找5个相近向量
k = 5            
# 开始时间
start = time.time()
# 实际检索
# 归一化
xq_ = xq.copy()/np.linalg.norm(xq)
# 正则化
faiss.normalize_L2(xq_)
D, I = index.search(xq_, k)    
# 结束时间
end = time.time()
# 这里用的是一台很弱的电脑,速度慢正常。
print("耗时{}s".format(end-start))

# 前五个查询向量的检索结果 
print(I[:5])        
print(D[:5]) 
True
10000
耗时1.384207010269165s
[[ 860 1240  234  618  642]
 [ 145  273  348  437  228]
 [1223  200  279  193  605]
 [1502 1449 1696  515 1415]
 [ 442  323  133 1273  108]]
[[0.8677576  0.8668592  0.86608535 0.8649473  0.8626978 ]
 [0.83961076 0.8217001  0.8178047  0.8155691  0.8118226 ]
 [0.8426961  0.83725685 0.8371294  0.83701736 0.836495  ]
 [0.8574948  0.8456532  0.8434803  0.8426978  0.839557  ]
 [0.8485321  0.8466241  0.84535056 0.8434351  0.84117293]]

In addition, there is a commonly used distance: Mahalanobis distance. For the usage of Mahalanobis distance in faiss, see mahalnobis_to_L2.ipynb . See Mahalanobis Distance for a description of the Mahalanobis distance .

2.2 Faiss building blocks

Faiss builds on some basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.

clustering

Faiss provides an efficient implementation of k-means. As follows.

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# 随机创建数据
# n_features特征维度,n_samples样本数,centers聚类中心也可以理解为label数,cluster每个类别数据的方差,random_state随机种子
data, label = make_blobs(n_features=3, n_samples=1000, centers= 5, cluster_std= 0.5 , random_state = 42)
data = data.astype('float32')
# 查看数据
print(data.shape)
# 查看标签
print(label.shape)
(1000, 3)
(1000,)
# 簇中心
ncentroids = 5
# 迭代次数
niter = 1
verbose = True
d = data.shape[1]
# 创建模型
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
# 训练模型
kmeans.train(data)
D, I = kmeans.index.search(data, 1)
# 精度
sum(label==I.reshape(-1))/len(label)
Clustering 1000 points in 3D to 5 clusters, redo 1 times, 1 iterations
  Preprocessing in 0.00 s
  Iteration 0 (0.69 s, search 0.39 s): objective=53190.2 imbalance=1.456 nsplit=0       




0.6
# sklearn聚类
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# KMeans聚类
model = KMeans(n_clusters=5)
pred = model.fit_predict(data)
# 精度
sum(label==pred)/len(label)
0.6

Generally speaking, when the sample dimension is too high or the amount of data is large, faiss clustering is a very good choice.

Dimensionality reduction

Faiss dimensionality reduction mainly uses PCA, which is much faster than sklearn. The sample code is as follows

# 随机创建数据
# n_features特征维度,n_samples样本数,centers聚类中心也可以理解为label数,cluster每个类别数据的方差,random_state随机种子
data, label = make_blobs(n_features = 512, n_samples=1000, centers= 10, cluster_std= 0.5 , random_state = 42)
data = data.astype('float32')
# 降到2维
mat = faiss.PCAMatrix (512, 2)
mat.train(data)
assert mat.is_trained
tr = mat.apply(data)

# 数据可视化
plt.scatter(tr[:, 0], tr[:, 1], s=80,c=label)
<matplotlib.collections.PathCollection at 0x7fb21413a790>

png

#调用sklearnPCA
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
tr = pca.fit_transform(data)
# 数据可视化
plt.scatter(tr[:, 0], tr[:, 1], s=80,c=label)
<matplotlib.collections.PathCollection at 0x7fb21410a610>

insert image description here

product quantization

Sometimes the original vector dimension or number of vectors is too high, and the calculation of similarity is too slow. Compress the original vector through product quantization, reduce the dimension of the vector, compress the data information to speed up the retrieval speed and save the memory space. For product quantization, see Product Quantization PQ of ANN . In addition, the compressed data can be restored to the original data. Generally, the reconstruction error is not high, but the speed of establishing a quantizer is very slow.

# 数据维度
d = 16
# 编码维度
cs = 4 

# 训练数据
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# 编码数据
n = 5000
x = np.random.rand(n, d).astype('float32')
# 训练量化器
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)
# 编码
codes = pq.compute_codes(x)
codes.shape
# 解码
x2 = pq.decode(codes)

# 计算重建误差
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
avg_relative_error
0.016456977

For the introduction and use of other interfaces in faiss, see Faiss process and principle analysis .

3 Reference

Guess you like

Origin blog.csdn.net/LuohenYJ/article/details/125897842