Technology and Practice of Vector Retrieval in Large Model Application Scenarios

1. Introduction to Vector Retrieval Applications

A vector is a point in a multidimensional mathematical space, a string of numbers with coordinates in each dimension. This point is the projection in the mathematical space after digitization of objects from the real world. Then there is a mathematical relationship between different points, that is, distance, and the distance represents the similarity of two objects.

picture

The process of converting unstructured data into vectors is called embedding. Through the training of deep learning, the discrete features after digitization of the real world can be extracted, projected onto the mathematical space, and become a vector in a mathematical sense, while miraculously retaining the ability to express semantic similarity through the distance between vectors , this is the effect of embedding.

picture

Before the emergence of large language models (before 2020), the technology of vector retrieval has been mature. With the deep learning technology, it is widely used in traditional artificial intelligence applications such as image, audio, video search and recommendation, face recognition, and speech recognition.

picture

The emergence of large models has changed the way of human-computer interaction and brought a new revolution in artificial intelligence technology. It has become popular all of a sudden and entered an era of large models. Of course, it is still in its infancy, and there are still many problems in practical application.

The first is that the knowledge ability is not strong enough, and the memory of large models is relatively limited, such as 600 million parameters and 13 billion parameters, just like human brain cells, which can remember limited things.

Secondly, the large model training cycle and cost are very high, so it is impossible to keep up with real-time hotspots. For example, we say that chatgpt only knows what happened before 2021, because he only trained data before 2021. In addition, large models have hallucination problems, and the answers will distort the facts.

In addition to knowledge, large models are also difficult to guarantee the security of private data. For example, I provided the big model with some private data about myself. When the big model provides services for others, this private information is likely to be spoken out as an answer. Finally, the reasoning cost of large models is also very high, and the cost of answering a question is very high.

So how to enhance the knowledge capability of large models while also protecting the security of private data? We can look at this example on the right side of the image below. For example, when asking questions, attach some information from the Meteorological Bureau, and the large model will accurately answer the questions based on these additional information.

This technical system that enhances the capabilities of large models through external data and tools is called prompt word engineering.

picture

From the above discussion, it can be seen that the application of large models is inseparable from the prompt word engineering. How to do the prompt word project? The main thing is to organize a database for the large model, and then in the access process, first find the most matching content from the massive database, and splice prompt words to enhance the answer, which is essentially a search engine.

Vector retrieval technology can perfectly support this scenario. Now everything can be embedded, and the vector is the general form for AI to understand the world, and the engineering technology of vector retrieval connects AI and these materials to provide knowledge and memory for AI.

The following two pictures are the process of engineering vector technology to support prompt word engineering.

Therefore, large-scale model applications are inseparable from vector retrieval technology at this stage, and vector databases will also be components that must be used in every large-scale model application, just like relational databases in web applications.

picture

2. Overview of Vector Retrieval Technology

The core of vector-related engineering technology is of course the vector retrieval algorithm, that is, how to find the K most similar to the target vector in a large number of vectors, also called topK.

The simplest is of course the brute force algorithm, which calculates the distance between the target vector and each vector in the set, and then arranges them in order, taking the first K ones. This has an obvious problem. As the amount of data increases, the time-consuming will increase linearly to an unacceptable level. For example, if my data accumulates over time, it only takes 10 milliseconds today, and it may take 100 milliseconds tomorrow, and it may take one second the day after tomorrow. After a week, it may be at the minute level. However, the speed of vector retrieval directly affects the speed of answering questions of large models, which is difficult to accept.

How to solve? The simplest solution is through distributed computing. For example, 100 million vectors are divided into ten parts, and they are calculated on ten machines in parallel, which is ten times faster than using one machine. And even if the data grows, I just need to add more machines, and the result is guaranteed to be accurate. Of course, the cost of computing power will be high.

Therefore, the industry explores whether it is possible to abandon the global optimal solution and find a local optimal solution to save the amount of calculation. From this, an approximate nearest neighbor algorithm has been developed. There are four algorithms in the industry: hash, tree search, inversion, and graph search. Hash, tree search, and inversion are similar. They all use a certain classification method to pre-classify data, divide the space, and place numbers in order to reduce the amount of calculation. Graph search is a relatively new idea.

We evaluate the effect of the algorithm mainly by looking at two indicators, one is performance, that is, the time-consuming query and the acceptable QPS; the other is the recall rate, which represents the accuracy of the query. We compare the results detected by the approximation algorithm with the set of results from a global perspective. The degree of overlap is the recall rate. For example, the brute force algorithm has a recall rate of 100%, while the approximate algorithm has high and low rates, and some can approach more than 99%.

Next, I will introduce the specific algorithm ideas.

picture

IVF is an inverted index. Inverted index is one of the core technologies of search engines. It refers to extracting keywords from document pages to establish an inverted search structure. So what is the key word for vector data? The vectors in the real world are generally clustered in spatial distribution and have clustering characteristics, please see the figure below.

The cluster center of the vector is extracted through the k-means algorithm, and the cluster center where the vector is located is the keyword of the vector. Using this to build an inverted index, you can hit the cluster center first like a search engine , and then violently search each vector below the cluster center, which can filter a large amount of data compared to global search. If you feel that finding one cluster center is not accurate enough, you can also find several more. The more you find, the more accurate the result will be.

picture

Can the idea of ​​clustering be further optimized? The answer is yes, that is, using the product quantization technology, a D-dimensional floating-point vector can be reduced into an M-dimensional binary vector, which not only compresses the storage space, but also compresses the memory usage and calculation amount.

The idea is to first decompose the D-dimensional vector into M sub-vectors, then use the sub-vectors to train, get the cluster centers, and code them into numbers to replace the sub-vectors, so that a series of floating-point numbers can be coded into a small integer to achieve a compression effect . In the example, D=128, M=8, calculated to be 64 times compressed.

picture

Graph algorithm is a relatively new idea of ​​approximate vector retrieval. It is based on the minimum world theory, which means that any two people in the world can be connected through six hops. If the vector is regarded as a person, a "small world network" similar to the real world is constructed according to the distance relationship between the vectors, and the greedy algorithm is used to establish a connection according to the distance and approach the target vector point by jump.

picture

With the continuous insertion of points in the small world, according to the idea of ​​selecting neighboring points (acquaintances) to build edges mentioned above, the newly inserted points will be more and more limited to a small circle. It takes a lot of hops. So how to break the circle? The key is to connect the newly inserted point to the distant point through some "chance coincidence".

The industry has explored the HNSW algorithm, which uses the idea of ​​skipping lists similar to the linked list search algorithm. A linked list is one-dimensional, and a graph is two-dimensional. Let's build graphs of different levels, and exponentially decrease the number of fixed points to form a sparse graph, so that the sparser the graph, the more connected the distance will be.

picture

Vector retrieval needs to be combined with a set of distributed system design for project implementation. Related topics are very mature in the engineering theory of data-based distributed storage systems, and there is nothing special about vector data at the storage level, so it is only necessary to implement vector capabilities on the storage system.

picture

Therefore, the architecture of the vector retrieval system is a set of vector data storage system plus vector retrieval capabilities, and additional storage and filtering capabilities for some scalar fields. For example, I can add some string tags to the vector data for filtering in specific scenarios.

Considering that the volume of vector storage will not be particularly large, even if tens of billions of vectors are calculated in terms of storage space, it is not a very large scale. Therefore, the main consideration is the scalability of computing power, and the pursuit of compactness and extreme performance. Therefore, the architecture shown on the right side of the figure below is adopted, and at the same time, focus on the realization of the columnar data storage format of a single node.

With the growth of data scale, in order to more conveniently and elastically expand computing power, the technical trend here will inevitably develop in the direction of cloud-native and storage-computing separation, and can quickly pull up the computing power of multiple copies when encountering traffic peaks to undertake.

picture

3. Vector retrieval engineering practice

Baidu's large model scenario is similar to the industry's common large model application scenario, mainly using vector retrieval technology to enhance knowledge and prompt word engineering. In Baidu's large model scenario, vector retrieval technology will face a practical large-scale engineering challenge.

picture

Baidu Smart Cloud has already integrated vector retrieval capabilities on ElasticSearch in 2020, and has been widely used in Baidu's own business lines. It has sufficient engineering practice and can provide cloud-native operation and maintenance guarantees based on public cloud resources. Today, in large model scenarios, technical optimization is required for new challenges.

picture

Baidu Smart Cloud is the first cloud vendor in China to provide hosted ES services. After so many years of development, it has continued to iteratively enhance product capabilities.

picture

It is very simple to use the function of vector retrieval in ES, here is a demonstration for you. First use the ES standard method to build a table, specify some parameters related to the vector index, import data in batches through the ES standard bulk method, and then query through the syntax we defined that is close to the ES style.

picture

The architecture of Baidu ElasticSearch consists of two parts: the control platform and the BES cluster instance. The management and control platform is a platform for unified cluster management, monitoring and alarming at the global level, as well as for performing tasks of regularly creating and deleting merge indexes. The BES cluster instance is a set of ElasticSearch cluster services built on the cloud server BBC and the cloud disk CDS, and the node load balancing is done through the BLB four-layer agent in front. The data on the disk can be periodically sunk and stored on the object storage BOS through policies to reduce storage costs.

Vector data is organized and accessed as shown in the figure below under the standard architecture of ES.

picture

We choose the self-developed plug-in. First, we hope to obtain the ultimate performance closer to the bottom layer based on C++ implementation. Second, it is convenient to achieve more extreme performance by rewriting the storage format. Third, we can more flexibly control the retrieval logic and rewrite the execution plan. to implement more complex queries.

For the core vector retrieval engine part, we choose to implement secondary development based on the excellent vector library in the community. We compared the out-of-box performance of nmslib and faiss on ES (without optimization). It can be seen that hnsw consumes more memory as a whole, and the implementation of nmslib is even better. So we make further optimization on this basis.

picture

In the process of HNSW composition, each point needs to be retrieved and calculated, and inserting a large number of points is also a large calculation overhead, so the import of data will be very slow, resulting in front-end blocking.

Here, it is transformed into an asynchronous index construction mechanism, and the data can be returned directly after being written to the disk. Then the background triggers the construction of the HNSW index in the background through the merge strategy of ES or the timing or active triggering of the user, and uses an independent thread pool to build it, so as not to affect the query request in the foreground.

At the same time, the fragment merging strategy of ES is also optimized, and the composition is unified in the final merging to reduce the intermediate calculation overhead caused by merging the composition one by one.

picture

In many scenarios, it is required to filter data according to scalar conditions before performing vector retrieval, such as labeling vector data, and ensuring that the retrieved vectors can match the labels. The standard HNSW implementation cannot do this. If HNSW is used first and then filtered, the returned result cannot be guaranteed to be complete. Therefore, we modified the HNSW implementation so that the algorithm only considers the fixed points that meet the filter conditions when traversing the graph to select neighbors.

During the actual test, it was found that the performance and recall were not very ideal. Through testing data and studying some data, it is found that when the filtering ratio increases, the number of connected paths decreases due to filtering of vertices, which directly affects the convergence speed of the HNSW algorithm, and it is easy to enter a dead end, resulting in a low recall rate. The data show that when the filtering ratio reaches more than 90%, the performance and recall rate will drop sharply.

Therefore, here we choose to rewrite the execution plan and combine scalar filtering and violent retrieval algorithms to achieve satisfactory results.

picture

4. Summary and prospect

Let's take a look at a picture that has been circulating for a long time. Just like humans evolved from monkeys to Homo erectus, the semi/unstructured data storage system has also experienced the development of various more complex structural systems from the simplest key-value, and finally looks more and more like traditional database.

picture

With the popularity of large models, vector databases in the industry have also emerged one after another. There are these dedicated vector databases, commercial and open source ones entered by entrepreneurial teams. Some traditional open source storage systems have also added vector capabilities.

picture

Combined with its own business scenarios, Baidu has developed its own Puck/Tinker vector retrieval algorithm, which has certain advantages over the industry's open source vector library, and has won the first place in the BigANN competition.

Baidu Smart Cloud will combine Baidu's self-developed algorithm to launch a dedicated vector database to better support large-scale model business and related applications on the cloud, so stay tuned!

picture

The above is all the content shared today.

——END——

Recommended reading:

Baidu APP iOS terminal package size 50M optimization practice (5) HEIC picture and useless class optimization practice

Baidu Knows Cloud and Architecture Evolution

Baidu APP iOS terminal package size 50M optimization practice (4) code optimization

Baidu App Startup Performance Optimization Practice

Application practice of light sweeping motion effect on mobile terminal

Android SDK security hardening issues and analysis

Ministry of Industry and Information Technology: Do not provide network access services for unregistered apps Go 1.21 officially released Ruan Yifeng released " TypeScript Tutorial" Bram Moolenaar, the father of Vim, passed away due to illness The self-developed kernel Linus personally reviewed the code, hoping to calm down the "infighting" driven by the Bcachefs file system. ByteDance launched a public DNS service . Excellent, committed to the Linux kernel mainline this month
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10094336