Exploration and practice of BES in large-scale vector database scenarios

picture

Introduction 

This article is compiled from the keynote speech of the same name "The Exploration and Practice of BES in Large-Scale Vector Database Scenarios" at the QCon Global Software Development Conference 2023·Beijing Station on September 5, 2023 - Vector Database Sub-forum.

The full text is 5989 words, and the estimated reading time is 15 minutes.

A vector database is a database system specifically designed to store and query vector data. Through Embedding technology, features of images, sounds, texts and other data can be extracted and expressed in the form of vectors. The distance between vectors expresses the similarity of features between original data. Therefore, feature vectors such as original data can be stored in a vector database, and then similar original data can be found through vector retrieval technology, such as image search applications.

1. Introduction to vector database applications

The technology of vector retrieval has been well developed before the emergence of large models. With the development of deep learning technology, vector databases are also widely used in image, audio, video search and recommendation scenarios, as well as semantic retrieval, text question and answer, face recognition and other scenarios.

picture

The emergence of large models has changed the way of human-computer interaction, brought about a new revolution in artificial intelligence technology, and spawned a large number of AI native applications.

However, it is still in the early stages of development and there are still many problems in practical application.

First, the knowledge capability is not strong enough. Although large models can answer general questions, in vertical fields, due to limited training data, there is still room for improvement in the professionalism of answers. Moreover, large models also suffer from hallucination problems, and the answers will distort the facts.

In addition, the training cycle and cost of large models are very high, so they cannot be trained frequently, which makes it difficult to obtain real-time data and can only answer some general questions that are not very time-sensitive;

In addition to knowledge, it is also difficult to ensure the security of private data in large models. For example, I provide some private data about myself to a large model. When this large model provides services to others, this private information is likely to be spoken out as an answer.

So how to enhance the knowledge capabilities of large models while also protecting the security of private data?

As shown on the right side of the figure below, bringing the weather warning news issued by the Meteorological Bureau when asking questions can help the large model accurately answer our questions.

This technical system that enhances the capabilities of large models through external data materials and tools is called prompt word engineering.

From the above discussion, we can see that prompt word engineering is of great significance to the application of large models. Below, we will introduce in detail how to build a prompt word project.

picture

Let’s look at the core workflow of a large model application.

By regularly embedding external real-time news, professional knowledge, industry data and other data into vector form and storing them in the vector database, a large-model external knowledge base can be incrementally constructed. When users ask questions, they can first obtain the most relevant content from the knowledge base, and then splice prompt words to enhance the answering ability of the large model.

This can introduce external real-time data and knowledge without retraining the large model, enhance the answering ability of the large model, and reduce the deviation of the large model's answers from the facts. Building an external knowledge base through a vector database can also effectively ensure the privacy and security of industry data.

In addition, the user's conversation history is stored in the vector knowledge base, and highly relevant historical conversations can be extracted during subsequent conversations, thus solving the problem of the lack of long-term memory in large language models caused by the limited number of tokens in large language models.

picture

2. Engineering practice of Baidu Intelligent Cloud BES

Elasticsearch is a distributed search and analysis engine based on Apache Lucene. It ranks first in the field of search engine databases and is the most popular open source solution in the world. It supports multiple data types, including structured and unstructured data, and The interface is simple and easy to use, the documentation is complete, and there are a large number of practical cases in the industry.

Baidu Smart Cloud Elasticsearch (BES) is a mature public cloud product built based on open source Elasticsearch, with resource guarantee and operation and maintenance capabilities on the cloud. BES was released in 2015 and was the first ES hosting service provided by public cloud vendors. In 2018, we introduced Baidu NLP word segmentation plug-in and supported snapshot and recovery based on object storage BOS. In 2020, we supported capabilities such as hot and cold separation architecture based on object storage BOS, and provided vector retrieval capabilities, which are widely used within Baidu and have sufficient engineering accumulation. In 2023, we will optimize the vector engine, package resources and other aspects for vector retrieval scenarios to meet the scenario needs of large models.

picture

The architecture of BES consists of two parts: the management and control platform and the BES cluster instance. The management and control platform is a platform for unified cluster management, monitoring and alarming, cluster expansion and contraction, and hot and cold separation scheduling at the global level. The BES cluster instance is a set of Elasticsearch cluster services built on cloud hosts and cloud disks. Node load balancing is done through the BLB four-layer proxy. The data on the disk can be periodically stored on the object storage BOS through policies to reduce storage costs.

picture

The index architecture of BES adopts the Shared Nothing + MPP computing architecture as a whole, which reuses the data flow of ES. The data is organized according to indexes and shards. Multiple copies can be set for shards, which will be distributed in the standard way of ES. and routing. In this way, when performing vector retrieval, local calculations will be performed on nodes with data distribution, and query efficiency will be improved through multi-node parallel computing. By increasing the number of replicas in conjunction with capacity expansion, the number of nodes and resources that can participate in queries can be increased, thereby improving the overall query QPS of the service.

Vector data is managed using ES standard methods, so its use is not much different from scalar data. Vector data can be written together with scalar data, and reuses ES's retrieval and processing capabilities for scalar data. Users can write data in batches through the Bulk interface of ES. After the data is written to each shard, some fragment files will be formed on the disk, and then retrieval capabilities will be provided. ES will schedule the merging of shards regularly in the background to improve retrieval efficiency. When querying, the query request is sent to any node, and this node becomes the coordination node. It will issue the query to each shard for parallel calculation, and then merge the TopK results into the results and return them.

picture

Using vectors in BES is also very simple. Let's look at a usage example below. First, you need to define the Mapping of the index and specify some vector-related parameters. This step is equivalent to creating a table. Then the data can be written through the Bulk interface of ES. In actual scenarios, the original data is usually vectorized through the Embedding capability and then written in batches. Vector retrieval can then be performed through the syntax we defined that is close to ES style.

picture

In terms of the implementation of vector indexing, we chose to implement vector-related capabilities through self-developed plug-ins. We implement the core engine through C++ and call it through JNI in ES.

Choosing self-developed plug-ins, firstly, I hope to achieve extreme performance closer to the bottom layer based on C++ implementation, and facilitate SIMD and other optimizations to accelerate calculations; secondly, I hope that the implementation of the underlying storage format can be rewritten, which is also convenient to achieve more extreme performance; thirdly, It can control the retrieval logic more flexibly and rewrite the execution plan to implement more complex queries.

For the core vector retrieval engine part here, we choose to do secondary development based on the excellent vector library implementation in the community. We compared the unboxing performance of nmslib and faiss on ES (test based on 8cu virtual machine, segments generated by written data are not merged, data set SIFT-1M 128 dimensions), it can be seen that HNSW has a higher recall rate, The overall memory consumption is relatively high, but nmslib's implementation is better.

By transforming the implementation of the vector retrieval engine, we reused our custom Lucene columnar storage data on the HNSW index type for level 0 vector data storage, and loaded it through mmap. On the one hand, this reduces the redundancy of data and reduces the waste of resources; on the other hand, by loading data through mmap, when the memory is insufficient, some pages will be swapped out of the memory, and then loaded into the memory when they need to be read. In the memory, memory + disk hybrid storage media is also supported to some extent.

picture

Before launching into specific optimization, let’s review the current mainstream graph algorithm principles.

Graph algorithm is a relatively new approximate vector retrieval idea. It is based on the navigable small world theory. Specifically, it means that in a fully connected graph, two points can be connected through six hops. By constructing a "small world" network similar to the real world based on the distance relationship between vectors, we can use a greedy algorithm to establish connections based on distance and approach the target vector fixed point one hop at a time.

picture

As points in the small world continue to be inserted, according to the idea of ​​selecting nearby points to build edges mentioned earlier, the newly inserted points will become more and more limited to a small circle. In this case, it will take a lot of time to establish a connection with a far away point. The number of hops is up.

In order to solve this problem, the industry proposed the HNSW algorithm, which uses an idea similar to the skip list in the linked list search algorithm. We build graphs at different levels and exponentially decrease the number of fixed points upward to form a sparse graph. In this way, the sparser the graph, the better it can connect to distant places.

picture

Although the query response of the HNSW index is fast and the recall rate is relatively high, it also has shortcomings such as slow composition speed and high CPU and memory overhead. In the HNSW composition process, each point inserted requires retrieval and calculation. Inserting a large number of points is also a huge computational expense, so importing data will be very slow, causing the front desk to be blocked.

Therefore, we transformed the vector index construction into a background asynchronous construction mechanism, and the data can be returned directly after being written to the disk; then the background triggers the construction of the HNSW index in the background through ES's merge strategy or the user's timing or active triggering. The streaming construction method reduces the memory consumption in the composition process. And it is built using an independent thread pool to avoid affecting front-end query requests.

picture

At the same time, we have also optimized the fragment merging strategy of ES.

ES's default merge strategy will select a set of fragments to participate in the merge based on some conditions, such as the number of fragments participating in a merge, the size of the new fragments to be generated by the merger, etc., and then calculate the scores of different combinations to select the optimal one. Fragment combinations are merged. Generally, multiple rounds of merging are performed to achieve a balance between resource consumption and query efficiency. However, because the construction cost of vector indexes is often higher than that of ES's native data types, we have adjusted the merging strategy of vector indexes and changed the multiple rounds of merging to one merging to reduce the process overhead of fragment merging.

In addition, we also support turning off the automatic construction of the target index before writing data. After writing data in batches and merging fragments, the vector index construction can be completed in one go through the API, which is more suitable for batch database filling scenarios.

picture

In addition, BES also supports multiple vector index types and distance calculation algorithms. We also provide process support for vector indexes that need to be generated by training. For example, the IVF series index, let us first review the principles of the IVF algorithm.

IVF stands for inverted index. Inverted index is a search engine terminology, which refers to extracting keywords from document web pages to establish an inverted search structure and find the original document through keywords. So what are the keywords for vector data? Vectors in the real world are generally distributed in clusters in spatial distribution and have clustering characteristics. Please see the picture below.

Extract the clustering center of the vector through the k-means algorithm. Then the clustering center where the vector is located is the keyword of the vector. Use this to create an inverted index, and you can hit the clustering center first like a search engine. , and then violently search each vector below the cluster center, which can filter a large amount of data compared to global search. If you feel that finding one clustering center is not accurate enough, you can also find several more. The more you find, the more accurate the result will be.

picture

For vector indexes such as IVF, the BES construction process is to first write the training data and then call the API to train the model. In this step, clustering is performed based on the training data and the center of each cluster is calculated. After the model is trained, create a new index to write the actual data, and build a vector index based on the trained model. The specific vector index construction and merging mechanism is still the same as described previously.

picture

There are many scenarios that require filtering data according to scalar conditions before performing vector retrieval, such as labeling vector data to ensure that the retrieved vectors can match the labels.

To support such a demand, there are two methods: post-filter and pre-filter. Post-filter refers to performing ANN retrieval first, and then executing filter based on the retrieval results to obtain the final result. However, it may cause the result set to be significantly larger in size. Less than K. Pre-filter refers to filtering the data first, and then performing nearest neighbor retrieval based on the filtering results. This method generally ensures K results.

Therefore, we modified the original HNSW implementation so that the algorithm only considers vectors that meet the filter conditions in the process of traversing the graph to select nearest neighbors. The specific process is to first execute the filter based on the scalar data index of ES to obtain the id list of the recalled documents; then construct a bitmap based on the id list, and pass the bitmap data to the vector engine through JNI calls. During the HNSW retrieval process, according to The bitmap filters nearby vertices to obtain a list of vectors that meet the filter conditions.

picture

During the actual test, it was found that the performance and recall were not very ideal. Through test data and research on some information, we found that when the filtering ratio increases, the number of connected paths decreases because the vertices are filtered, which directly affects the convergence speed of the HNSW algorithm, and it is easy to reach a dead end, resulting in a low recall rate. Data shows that when the filtering ratio reaches more than 90%, the performance and recall rate will drop sharply.

Therefore, here we choose to rewrite the execution plan and combine scalar filtering and brute force retrieval algorithms. When the filtering ratio reaches more than 90%, brute force retrieval of the filter result data can achieve satisfactory results. At the same time, SIMD instructions are also used to accelerate the efficiency of violent retrieval.

picture

BES’s subsequent development plan mainly focuses on the following aspects.

The first is ease of use. Because the vector database is mainly aimed at large model application developers, it is hoped to provide an out-of-the-box product experience and lower the threshold for users to understand and use. Currently, the use of BES's self-developed vector engine requires users to be familiar with ES, for example, being able to use ES's DSL to express query logic. If a more general and easy-to-use method can be provided, such as supporting knn retrieval through SQL, it will be more user-friendly.

The second is functional features, which need to support more indexing algorithms and similarity algorithms, such as DiskAnn and Baidu's self-developed Puck&Tinker algorithm, to cope with a variety of needs and scenarios. And consider supporting heterogeneous computing capabilities to improve index construction and retrieval efficiency.

In terms of performance cost, large-scale application scenarios require more in-depth optimization to reduce system overhead and optimize resource usage efficiency. For example, ES runs on JVM, and the vector engine is developed in C++ and executed through JNI calls. So how can we manage global memory resources more flexibly, adapt to different customer workloads, and maximize resource utilization? This will be one of our main optimization directions in the future.

Finally, BES is currently in the form of a managed cluster. Users need to evaluate cluster resources according to their own business levels and choose reasonable packages. This brings a certain usage cost to users. How can we make resources more dynamic and flexible and help users reduce costs? Using thresholds and reducing costs is also our optimization direction.

picture

The Puck&Tinker algorithm just mentioned is Baidu's 100% self-developed vector retrieval algorithm, which has been verified in large-scale actual searches in the search business.

In the first international vector retrieval competition BigANN, Puck ranked first in all four participating data sets. On the BIGANN-10M data set, with similar recall rates, Puck's performance reaches 1.7 times that of HNSW, while its memory consumption is only about 21% of HNSW. Puck has been open source, everyone is welcome to pay attention to https://github.com/baidu/puck/tree/main/docs.

picture

3. Case sharing

3.1 Multimodal data retrieval

The following introduces practical application cases of BES.

The first is the application of vector capabilities in multi-modal retrieval scenarios on video websites. The specific scenario is to divide the video into frames, perform image feature processing and time modeling on the frames, convert the frame sequence into vectors through neural networks, write them into BES, and build a material vector library. Then, through vector retrieval in BES, the results with the highest similarity are recalled and entered into upstream business services, supporting scenarios such as video marking, short-length videos, and personalized recommendations.

picture

3.2 Qianfan large model platform

The knowledge base of Qianfan Large Model Platform is a product specifically designed for large language model knowledge question and answer scenarios, designed to manage the knowledge uploaded by customers and provide fast query and retrieval functions.

Based on Baidu Intelligent Cloud BES, users can store and retrieve a large number of knowledge base documents in an efficient manner, achieve rapid management of corporate private domain knowledge, and build the ability to build knowledge question and answer applications. And it can ensure the privacy and security of customer data.

There are two application methods here. The first method is to independently deploy a large model knowledge base to build a knowledge retrieval application in the local domain; the second method is to directly bind plug-in applications to the Qianfan platform to support three types of applications: question and answer, generation, and tasks.

picture

The above is all the content shared.

Click to read the original text and learn more product information

——— END ———

Recommended reading

Large-scale practice of Wenshengtu: Revealing the story behind Baidu’s search for AIGC painting tools!

Application practice of large models in the field of code defect detection

Support OC code reconstruction practice through Python scripts (2): Data items provide code generation for module access data paths

Talk to InfoQ about Baidu’s open source high-performance search engine Puck

A brief discussion on search presentation layer scenario technology-tanGo practice

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10141723