Hello, Vector DB|Probably the most accessible Faiss tutorial

Do you have any questions like:

How does NetEase Cloud Music recommend similar songs based on my music taste? How does Taobao judge my buying preferences? How does the mobile phone photo album recognize faces in photos and group photos of the same person into the same group?

In fact, the technology behind all this is similarity search (sometimes also called nearest neighbor search). Similarity search plays a central role in many artificial intelligence (AI) and machine learning (ML) applications, being able to find the data most similar to a given query data. For example, Netease Cloud Music can query the system for the most similar song collections based on the user's favorite songs. This process is called similarity search.

Although similarity search seems very powerful, there is a problem behind it. If the amount of data is very large, traditional similarity search methods may be very inefficient. In this case, we need to use Faiss. Faiss is a vector retrieval library developed by Facebook AI, which provides an efficient and reliable large-scale data similarity search solution.

This article is the "Hello, Vector Database" series. It will start from Faiss, the origin of Milvus and Zilliz Cloud, and focus on Faiss's installation method, best practices, and its comparison with vector databases. Let's go to Faiss!

What is FAISS?

The full name of Faiss is Facebook AI Similarity Search, which is Facebook AI Similarity Search. Faiss is a vector retrieval library designed for handling large-scale data.

The core concept in Faiss is "vector similarity". Simply explain, a vector is a string of numbers, and vector similarity is to compare how similar two vectors are. For example, a song contains many elements and features, we can use a number to represent a feature or element. Then, a song can be represented by a string of numbers (that is, a vector). If you need to search for songs similar to your favorite song, then we can get the answer by comparing the similarity between song vectors.

In the process, we can use Faiss to quickly, efficiently, and accurately compare the similarity between millions (or even billions) of vectors. Faiss is a super search engine that scans a large music database at lightning speed to pinpoint songs similar to your favorites.

The magic of Faiss is not limited to the music recommendation system. Faiss may be used in many daily life application scenarios, such as image recognition, text retrieval, data clustering, data analysis, etc. In short, if you want to quickly find similar data from a massive database, you can use Faiss.

Install Faiss

The following tutorial will show how to install Faiss on a Linux system:

  1. Install Conda.

Before installing Faiss, install Conda on your system. Conda is an open source software package and environment management system that runs on Windows, macOS, and Linux operating systems. Follow the steps below to install Conda on a Linux system.

  1. Download the Miniconda installation package from the official website .

  2. Complete the hash verification of the installation package.

  3. Open a terminal and run the following command to start the installer:

bash Miniconda3-latest-Linux-x86_64.sh

  1. During installation, if you are not sure about certain settings, you can choose the default option, and you can change the settings at any time after the installation is complete.

  2. After the installation is complete, close the terminal and reopen it. The main purpose of this step is to ensure that all updates are activated.

  3. Check that the installation is correct. Type conda list in Terminal or Anaconda Prompt and press Enter. If installed correctly, a list of installed packages will appear.

Faiss can then be installed via Conda. Faiss offers 2 versions of the software package: the CPU version (faiss-cpu) and the GPU version (faiss-gpu).

We can choose to install the CPU or GPU version of Faiss as needed in the following two ways:

  1. Install Faiss via PyTorch Conda (recommended).
  • Install the CPU version

conda install -c pytorch faiss-cpu

  • Install the GPU version

conda install -c pytorch faiss-gpu

  1. Install Faiss via conda-forge.
  • Install the CPU version

conda install -c conda-forge faiss-cpu

  • Install the GPU version

conda install -c conda-forge faiss-gpu

To view the source of a Conda package, use the conda list command.

Demonstration using the SQuAD dataset

Now, we can understand the Faiss functionality with an example demo. In this example, we will use the Stanford Question Answering Dataset (SQuAD). SQuAD is a commonly used natural language processing (NLP) dataset. The dataset is based on questions raised by users in Wikipedia. The answer to each question comes from a piece of text corresponding to the reading passage, totaling 100,000 of more than 500 articles. Multiple question and answer pairs.

Before we dive into the example code, please download the SQuAD dataset:

  1. Download the SQuAD dataset ( https://rajpurkar.github.io/SQuAD-explorer/)

The examples in this article will use SQuAD 1.1. You can download the SQuAD 1.1 dataset here. After the download is complete, please save the downloaded JSON file (train-v1.1.json) in the common files directory.

  1. Read the downloaded JSON file, you can use Python JSON to load the data.
with open('train-v1.1.json', 'r') as file:
    squad_data = json.load(file)
  1. import library

First, import all the necessary libraries, NumPy for number crunching, Faiss for similarity searches, JSON for loading datasets, and NLTK for tokenizing text.

import numpy as np
import faiss
import json
from nltk.tokenize import word_tokenize
  1. Load and preprocess data

In the second step, the SQuAD dataset is first loaded. The dataset is a JSON file, so we can use the load function of the JSON module to load the data.

with open('train-v1.1.json', 'r') as file:
    squad_data = json.load(file)

After the dataset is loaded, the data needs to be preprocessed. We will tokenize each paragraph using NLTK's word_tokenize function. That is, using this function, we can split a sentence into individual words. Subsequently, we represent each word as a one-hot encoded vector.

vocabulary = set(word for article in squad_data['data'] for paragraph in article['paragraphs'] for word in word_tokenize(paragraph['context']))
word_to_index = {word: index for index, word in enumerate(vocabulary)}

def convert_text_to_vector(text):
    words = word_tokenize(text)
    bow_vector = np.zeros(len(vocabulary))
    for word in words:
        if word in word_to_index:
            bow_vector[word_to_index[word]] = 1    return bow_vector

paragraph_vectors = [convert_text_to_vector(paragraph['context']) for article in squad_data['data'] for paragraph in article['paragraphs']]
  1. build index

After loading and preprocessing the data, we can build a Faiss index on the data. This example will use an index type of IndexFlatL2 (the basic L2 index).

dimension = len(vocabulary)
index = faiss.IndexFlatL2(dimension)
# 将 NumPy 数组转换为 1 个二维数组
paragraph_vectors = np.stack(paragraph_vectors).astype('float32')
index.add(paragraph_vectors)
  1. similarity search

Once the index is built, you can start looking for passages of text in the dataset that are most similar to the search you entered.

Here is the search function:

def search_for_paragraphs(search_term, num_results):
    search_vector = convert_text_to_vector(search_term)
    search_vector = np.array([search_vector]).astype('float32')
    distances, indexes = index.search(search_vector, num_results)
    for i, (distance, index) in enumerate(zip(distances[0], indexes[0])):
        print(f"Result {i+1}, Distance: {distance}")
        print(squad_data['data'][index]['paragraphs'][0]['context'])
        print()

Specify the search term as "What is the capital of France?" and return the 5 most similar results.

search_term = "What is the capital of France?"search_for_paragraphs(search_term, 5)

search_for_paragraphs()First convert our search terms into encoded vectors. Then, use the search method on the index. When using the search method, we need to specify how many query results we need (that is, the value of num_results). The search method returns 2 two-dimensional arrays: one is the vector distance of the most similar result, and the other is its index. We can use the index to find the actual passage in the dataset, and the final print contains the rank, vector distance, and paragraph text.

The above example shows how to use Faiss to find similar text fragments. Of course, Faiss can be used in more complex application scenarios, such as searching in high-dimensional spaces.

Best Practices and Tips

Get familiar with the data: Before using Faiss, you need to spend a little time understanding the data. You can ask yourself some questions, such as: How big is this data set? Is the data information complete? Familiarity with the data will help in choosing the correct Faiss index type and determining the best way to handle the data.

Data preprocessing: Data preprocessing will greatly affect the use of Faiss. For text data, consider smarter ways to convert words to numbers, such as models like TF-IDF or Word2Vec. For image data, you can try to use convolutional neural network (CNN) to process.

Choose the most suitable index type: Faiss provides a variety of index types, each of which has different applicable scenarios. Some indexes can efficiently handle high-dimensional data, some indexes are suitable for processing binary vectors, and some indexes are designed to handle large amounts of data. Therefore, you can choose the most suitable index type according to your needs and actual situation.

Batch query: If there are multiple queries that need to be run at the same time, Faiss can be used to process them together. It is more efficient to run batch queries at one time, and Faiss is optimized for batch processing.

Adjustment parameters: Faiss supports flexible adjustment of parameters, for example, the number of data clusters and the number of queries (nprobe) can be adjusted when building an index. The default value does not necessarily give full play to the maximum performance of an index. Therefore, you can try to adjust the parameter values ​​to find the most suitable parameter settings.

Vector Database VS Faiss

Faiss is an efficient Approximate Nearest Neighbor (ANN) search solution, but its capabilities are limited when faced with tens of millions of vectors to store and retrieve while requiring real-time response or more advanced features.

Vector databases can effectively solve these problems:

  • The vector database supports basic data addition, deletion, modification and query (CRUD) operations, can adjust the consistency level, and provides functions such as conditional filtering;

  • The vector database can provide stronger data persistence and perform better in disaster recovery and system availability;

  • The vector database adopts a distributed architecture that separates computing and storage, supports load balancing, flexible expansion, and higher system availability;

  • Vector Database provides advanced functions such as multi-tenancy, indicator monitoring, and RBAC, and provides SDKs and RESTful APIs in various programming languages.

Milvus is the world's first open source vector database that can store, index and manage billions of vector data. Zilliz contributed Milvus to the LF AI & Data Foundation as an incubation project, and in June 2021, Milvus graduated from the foundation. Milvus Lite is a lightweight version of Milvus contributed by Ji Bin, an active community member.

Zilliz Cloud is a fully managed vector database built on Milvus. Using Zilliz Cloud, the speed of vector retrieval can be increased by more than 10 times, which is convenient for developers to easily deploy and flexibly expand vector search applications.

Summarize

This article describes what Faiss is, how to install Faiss, and how to use Faiss. At the end of this article, we also compared the difference between the vector database and Faiss. Faiss is a very useful tool that can help us search massive amounts of data efficiently and is applicable to different scenarios. I hope that everyone can be inspired by this article and start a new journey of exploration and learning. Of course, you are also welcome to continue to study the various index types provided by Faiss, explore more complex data preprocessing techniques, or try to build some vector similarity search applications yourself!

Huawei officially released HarmonyOS 4 miniblink version 108 successfully compiled. Bram Moolenaar, the father of Vim, the world's smallest Chromium kernel, passed away due to illness . ChromeOS split the browser and operating system into an independent Bilibili (Bilibili) station and collapsed again . HarmonyOS NEXT: use full The self-developed kernel Nim v2.0 is officially released, and the imperative programming language Visual Studio Code 1.81 is released. The monthly production capacity of Raspberry Pi has reached 1 million units. Babitang launched the first retro mechanical keyboard
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4209276/blog/10090807