Faster, stronger and more stable: Tencent vector database evaluation

Vector database: the new foundation for the AI ​​era

Artificial intelligence affects our lives everywhere, and behind the rapid development of artificial intelligence is the need to process more and more massive data. Traditional databases are no longer able to support large-scale complex data processing. Especially with the emergence of large models, vector databases emerged. NVIDIA CEO Jensen Huang mentioned the vector database for the first time in his NVIDIA GTC Keynote speech and emphasized its importance in building proprietary large-scale language model organizations. As a new generation of AI processors, large models provide powerful data processing capabilities, and vector databases have become the key foundation of storage capabilities.

Vector database is a database system specially used for storing and querying vector data. It adopts vectorized computing and can quickly process large-scale and complex data. Compared with relational databases, it has higher performance. Compared with traditional databases, vector databases can handle high-dimensional data, such as images, audio and video, etc., thus solving the bottleneck of traditional relational databases. In addition, the vector database can be easily extended to multiple nodes by using technologies such as distributed, cloud computing, and edge computing, so as to realize the expansion of data processing scale and improve the stability of vector data storage, management, and query.

We can't help but ask, what core technologies are behind the outstanding performance of vector databases compared to traditional databases?

Core technology of vector database

What is vector data?

Vector data is a sequence composed of multiple values ​​that can represent the size and direction of a data volume. Through Embedding technology, images, sounds, and texts can all be expressed as a high-dimensional vector. For example, a picture can be converted into a vector composed of pixel values.

In computer science and data science, vector data is widely used in various fields, such as machine learning, image processing, natural language processing, etc. For example, in machine learning, data are usually represented in the form of vectors, with each dimension representing a feature, allowing for the training and prediction of various models. By calculating vector data, tasks such as similarity comparison, cluster analysis, classification and prediction can be performed. The processing and analysis of vector data is critical to data science and artificial intelligence applications in many fields.

What is a vector database?

A vector database is a database system specifically designed to store and query vector data. The vector database supports various operations on vector data, such as vector retrieval, and finds the most similar vector in the database according to a given vector; for example, vector clustering: according to a given similarity measure, the vector in the database is Classification, such as dividing pictures into different topics based on their content or style.

What are the technical difficulties with vector databases?

Vector databases mainly have four major difficulties : high dimensionality, sparsity, heterogeneity and dynamics . High dimensionality means that vector data usually contains a large number of elements and has higher dimensions. The higher the dimension, the higher the requirements for database performance; sparsity means that the values ​​of many elements in vector data may be zero or close to zero, and only A few elements have significant non-zero values, and the sparser the distribution, the more difficult it is to process; heterogeneity means that elements in vector data may have different types or meanings, representing different characteristics or attributes; dynamics means that vector data may change with Changes occur due to changes in time or environment, and can be updated in real time. The higher the frequency of updates, the higher the requirements for database query and retrieval.

Distributed system architecture and load balancing

In order to solve the problem of large scale of vector data, a single machine cannot meet the storage and computing requirements, so a distributed system must be used. For example, the following figure is the distributed architecture diagram of Tencent Cloud Vector Database. Customer requests are distributed to each node through load balancing. Each node can directly perform read/write operations and is responsible for data calculation and storage.

vector indexing technology

The dimensionality of vector data is very high, and direct full scan or tree structure-based indexing will lead to inefficiency or memory explosion. Vector indexing technology is an indexing method specifically used for vector data and is designed to speed up similarity search and query operations on vector data. First, the vector data is vectorized and each vector is mapped into a high-dimensional space. The index structure is then built based on the definition of similarity between vectors. Commonly used vector indexing methods include:

  1. FLAT: In a FLAT index, vectors are stored as floating point without any compression. The search vector goes through all vectors and compares them with the target vector. It is the most efficient indexing method when the number of queries is small. When the data set is very large, the performance of FLAT degrades significantly. FLAT is suitable for scenarios where the amount of data is small and precise matching is required.
  2. HNSW: It is a graph-based algorithm that can maintain high accuracy in high-dimensional space. HNSW organizes the data set by building a multi-level graph structure and uses random walks to search for nearest neighbors. The construction process of HNSW is more complex than FLAT and requires more computing resources, but the retrieval speed is faster. HNSW is suitable for scenarios that require fast approximate matching.
  3. IVF series: The core idea of ​​the IVF series index is to divide the high-dimensional space into multiple clusters and build an inverted file for each cluster. In this way, the IVF series index enables efficient similarity searches in large-scale high-dimensional vector data.

Advantages of Tencent Cloud Vector Database

The above technical difficulties have hindered the large-scale implementation and use of vector databases to a certain extent. Many top high-tech companies and organizations around the world are investing in research in this field. For example, in the international market, Zilliz cooperates with Nvidia, IBM, Mircosoft and other companies; Pinecone has been launched on Google Cloud and AWS, gradually opening up the market. In the domestic market, Tencent Cloud VectorDB is undoubtedly one of the most eye-catching products, attracting more and more users with its powerful performance, ultra-high stability and reliability, and ultra-high cost performance.

Powerful performance

The single index of Tencent Cloud Vector Database supports a vector data scale of 1 billion. Under the same vector dimension and data magnitude, it has a great performance improvement compared with the open source vector database. The blogger also personally tested and experienced the powerful in the next section. Performance.

Such a powerful performance capability is based on the cloud-native distributed architecture, and a lot of load balancing optimization, vector retrieval optimization, vector analysis optimization, etc. have been done, which reflects the profound technical foundation of Tencent Cloud.

Super cost-effective

Tencent Cloud's vector database has been designed based on cloud native from the very beginning. Thanks to Tencent Cloud's complete infrastructure, users can directly operate on the cloud, which can greatly reduce machine costs and operation and maintenance costs. Therefore, choosing Tencent Cloud Vector Database is undoubtedly a very cost-effective choice.

High compatibility

The vector database supports various types and formats of vector data, and provides interfaces and tools in multiple languages ​​and platforms, with a high degree of compatibility, which is convenient for users to integrate and use. Tencent Cloud Vector Database provides two SDKs: Python SDK and HTTP SDK. For example, to use the Python SDK, you only need to execute the following pip command to install the tcvectordb library.

Shell
pip install tcvectordb-0.0.2-py3-none-any.whl

Python SDK provides API interfaces such as creating tables, writing data, querying data, and similarity retrieval, which are very convenient to use. Users can find more tutorials and examples in the official manual .

Actual performance measurement: 128- dimensional vector query

Preparation

Recently, Tencent Cloud Vector Database has also opened the product internal testing function. You can go to the official website and follow the guidance of the product manual to experience the internal testing for yourself! The blogger was lucky to be the first to experience the Tencent Cloud vector database, and personally tested the performance, stability and reliability of the database that everyone cares about most. Friends who have not had time to experience it, please follow the blogger's perspective to see it first!

Before the formal evaluation, let us first introduce our testing tool ann-benchamrk below. This is a performance testing tool for evaluating the approximate nearest neighbor (ANN) search library, which contains multiple real data sets, including data in images, text, bioinformatics and other fields. Each dataset has a known set of nearest neighbors that can be used as a criterion for performance evaluation. In addition, ANN-Benchmarks also provides some commonly used evaluation indicators, such as accuracy, query time and memory consumption, to measure the performance of different algorithms in approximate nearest neighbor search tasks. The specific data set information is shown in the following table:

We first create a new vector database in the console:

After creating the new database, we can click on the instance list to view the basic information, specification information and architecture diagram of the newly created instance:

For testing needs, we also need to purchase a CVM cloud server. You can purchase servers with different configurations according to your own needs:

The blogger chose the pay-as-you-go model when purchasing, and the configuration is as follows:

Execute tests

After activating CVM, we start the machine, open our testing tool ann-benchamrk, and install the packages that the test environment depends on:

Then we copy the configuration file in the ann_benchmarks/algorithms/vector_db/config.yml path and rename it to mytest.yml. Check the vector database we just created and configure the internal IP address and port of the database instance.

Next run run.py to perform the 128-dimensional vector performance test:

Multi-process stress test results

We use L2 euclidean Euclidean distance as a measure, and under the conditions of data set magnitude 100w and 128-dimensional vector retrieval, obtain the most similar Top10 documents and compare QPS data. The HNSW index of Tencent Cloud Vector Database can achieve a recall rate of more than 99%, and the QPS is about 13,800 or more. Under the same test conditions, open source vector databases such as Faiss and Elasticsearch have a QPS of no more than 4,000. Therefore, Tencent Cloud Vector Database has achieved At least 3 times more performance improvement.

epilogue

The craze of large models announces that another wave of artificial intelligence has arrived. As the foundation of the artificial intelligence era, vector databases will surely usher in a new stage of rapid development. The blogger personally experienced Tencent Cloud Vector Database, which can effectively solve many difficulties of traditional databases. I was shocked by its powerful performance and stability. The future must be an era of AI. Let us embrace AI and vector databases together!

 

Guess you like

Origin blog.csdn.net/qq_41895747/article/details/132646636