An open source artifact that is more vector database than vector database!

ChatGPT has made large models popular, and it has also made vector databases popular. The training cost of large models is high, and the cycle of learning new knowledge is too long. The vector database can just serve as the "memory" module of the large model. It can find old problems similar to new problems and hand them over to the large model for processing, which greatly expands the application of the large model. Scope, this is the reason for this round of popularity of vector databases.

In fact, vector databases have been used in traditional AI and machine learning scenarios, such as face recognition, image search, speech recognition, etc.

The main task of the vector database is to find vectors similar to the target vector from massive vector data. This is not logically difficult, it is just a TopN query. It is easy to write this kind of query using SQL of traditional relational databases.

However, the order of high-dimensional vectors cannot be defined, so it is impossible to create indexes in advance. Traditional databases can only perform hard traversal, which requires a huge amount of calculation and poor performance. The vector database has made special optimizations for this matter, providing some efficient methods for finding similar vectors, greatly improving performance.

It seems that the vector database is a special-purpose database. As long as vector data is loaded into it, vector search can be realized. After all, this is how we use relational databases. We put structured data in it and then query it with SQL. We usually don't need to worry about how SQL is parsed and optimized by the database.

However, the vector database is not that simple, and the process of using it to implement vector lookup is complex and personalized.

1. Vector databases usually do not have the simple and easy-to-use SQL syntax like relational databases. They only have APIs for basic algorithms. You need to use more basic programming languages ​​​​(C/C++, Java, and preferably Python interfaces) to call these APIs to complete the vector by yourself. For search tasks, these languages ​​​​except Python do not have mature and popular vector data types, and it is not convenient to develop vector-related calculations.

Even though Python has vector data types, it is limited by its limited big data processing capabilities and parallel capabilities, and there are many difficulties in doing vector calculations. Relational databases do not have this problem. You can directly query the data using SQL, and usually do not need to rely on the computing power of the basic programming language.

2. Vector calculation requires personalized data preprocessing. Without this process, subsequent operations will be meaningless. For example, floating-point vectors require normalization, dimensionality reduction, transformation, etc.; binary vectors require data conversion, dimension sorting and selection, etc.

Even for data in the same format, the preprocessing methods may be different. For example, text vectors may only require normal dimensionality reduction, while image data may also require convolution. These preprocessing methods not only have many methods, but also have many parameters that need to be adjusted. They are often related to specific data and are highly personalized.

General vector databases do not provide such preprocessing methods, or even if they do, there are only a few fixed methods that cannot meet individual needs. As a result, this work can only be completed by relying on peripheral programming languages ​​or introducing third-party technologies. In contrast, using structured data in relational databases is much simpler. Preprocessing is usually not required, and even if it is, it is very simple. It is nothing more than encoding and data type conversion, etc., which SQL itself can do.

3. You need to choose an algorithm to create an index or even a vector database. The key to efficiently finding vectors is to have an index that adapts to the characteristics of the data distribution. However, there are too many index algorithms, such as k-means clustering, IVF-PQ algorithm, Faiss algorithm, HNSW algorithm, LSH algorithm, etc. Just understanding these methods requires considerable effort. Mathematical knowledge is required, and it is usually necessary to repeatedly adjust the algorithm parameters (such as the parameter k of k-means. It is almost impossible for developers who do not understand the data or have inexperience to set a suitable k) to achieve high search efficiency and higher accuracy. High clustering algorithm.

There is even worse news. Vector databases usually only provide a few indexing algorithms, and these methods are not "one size fits all" and need to be determined according to the data distribution characteristics. If you choose a vector database at will After adjusting the algorithm parameters for a long time, I found that the index of the vector database was not suitable for my data at all (the search efficiency or accuracy could not meet the requirements). At this time, changing the database would probably be a disaster. The figure below shows the index creation methods of several top-ranked vector databases:

4633de6795d5b3112c335dd5a9903f6f.png

4. There is no conclusion on the similarity evaluation method. Vector databases can provide several commonly used ones. For example, Pinecone, the most popular vector database at present, only provides Euclidean distance, cosine similarity and dot product similarity. However, the method for evaluating similarity needs to be defined based on the application scenario and data characteristics.

Commonly used methods include Pearson correlation coefficient, XOR distance, etc. Each evaluation method has different applicable scenarios, and the corresponding index creation methods may also be different. For example, for floating-point number vectors, Euclidean distance or cosine similarity is usually used to evaluate similarity, and the HNSW algorithm is used to create indexes, but binary Vectors may need to be evaluated for similarity using XOR distance and indexed using the IVF algorithm.

The choice of similarity evaluation directly affects the efficiency and accuracy of vector search, so it is usually necessary to repeatedly combine and adjust parameters for these index creation methods and similarity evaluation methods until the desired query effect is achieved (similar to when creating an index, and sometimes may Need to try another vector database).

When we use relational databases for performance testing, we usually only need to randomly generate a sufficient amount of data. At most, we only consider the data type and value range of the fields. But vector calculation is not possible. The space where high-dimensional vectors live is too large, and the data generated completely randomly has almost no aggregation. If you use the clustering method to create an index, the number of each category will not be very large, making it difficult to create an index. To the effect of reducing the amount of traversals, it is impossible to improve search efficiency.

Actual vector data (such as fingerprint data, face data) are usually gathered at certain locations in high-dimensional space, which makes it easy to create effective indexes to ensure the efficiency and accuracy of querying similar vectors. These The test can only be completed after getting the actual data. This phenomenon can also illustrate the personalization of vector calculations from the side.

From this point of view, the vector database is not a ready-to-use product like the relational database. The process of completing the task is accompanied by too much personalization and experimentation, and is more like a programming practice. In this way, the vector database is more like an implementation project than a directly usable product .

Since you are programming on a basic class library, you can just choose a programming language that is easy to develop and has related algorithm libraries. There is not much need to purchase a vector database specifically.

SPL is a programming language that can perform this function.

Compared with commonly used host languages ​​​​for vector databases (C++, Python, etc.), SPL has obvious advantages in program logic, writing programs, and data storage. As long as there are enough basic class libraries, it will be more convenient to implement these vector search tasks.

1. Vector calculation and preprocessing

SPL itself provides rich and flexible vector calculation methods and preprocessing methods, which can quickly perform operations between vectors.

For example, the product of vectors:

A=[[1,2],[2,3],[3,4]]

B=[[1],[2]]

M=mul(A,B)

Euclidean distance of vectors:

A1=10.(row())

A2=10.(rand())

D= dis(A1, A2)

Cosine similarity after vector normalization:

C=( A1** A2).sum()

PCA dimensionality reduction:

A=[[1,2,3,4],[2,3,1,2],[1,1,1,-1],[1,0,-2,-6]]

A’=pca(A,2)

2. Program logic matching the vector data type

The basic algorithms of current vector databases usually require host programming languages ​​(C++, Python, etc.) to be driven, resulting in low development efficiency. SPL has complete program logic, which is integrated with vector data types and basic algorithms, making development more efficient. However, C++ and Java do not have vector data types in the strict sense, and only provide class libraries for basic algorithms. This Class languages ​​are difficult to use when doing vector calculations.

3. Convenient development and debugging

SPL is a grid programming language, which makes the code very neat and intuitive. The results of each grid calculation will be saved and can be viewed in the results panel on the right side of the IDE. Programmers can click on a grid to view the calculation results of that step (grid) in real time. Whether the calculation is correct or not is clear at a glance (there are no other programs print Dafa troubles). SPL also has very convenient debugging functions, including execution, debug execution, execution to cursor, single-step execution, and later functions such as setting breakpoints and calculating the current grid, which can fully meet the needs of editing and debugging programs. The picture below is the IDE interface of SPL:

f603d24608f0da7fde7542eb3f849ad9.png

The red box is the debugging function button of SPL, the blue box is the result panel, and the picture shows the result data of A3.

4. Beyond vector calculations

Similar vector search is not just one thing, it usually requires some other work, such as associating other structured data or processing other text data. SPL has its own unique understanding in this aspect, which is not only simple to process but also very efficient. For this aspect, you can refer to the relevant documents in Qian Academy (https://c.raqsoft.com.cn/). However, general vector databases do not have these functions. They still need to use the host language and sometimes even the cooperation of a relational database to complete these tasks, which is too troublesome.

5. Storage and high performance

SPL has complete storage capabilities and can complete storage tasks that are conducive to efficient query according to personalized data requirements. SPL provides a specialized binary file storage format and provides mechanisms such as compression, column storage, ordering, and parallel segmentation to fully ensure computing performance. The storage of these binary files is very flexible. The storage can be designed according to any algorithm. It can not only take advantage of the advantages of file storage itself, but also adjust according to the algorithm. It is not surprising that it can achieve high performance.

Speaking of high performance, the language you use must be convenient for parallelism. This is difficult for Python to achieve. Because of the existence of global interpretive locks, Python's parallelism is actually pseudo-parallelism. If parallelism is necessary, it may only be possible to run multiple processes in parallel. , this kind of overhead is not comparable to multi-threading. SPL has complete parallel capabilities, and in many cases the writing method is similar to that of single thread, except that the @m option is added. For example, calculate the Euclidean distance between 100 vectors and 1 vector.


A
1 =100.(10.(rand()))
2 =10.(rand())
3 =A1.(dis(~,A2))
4 =A1.@m(dis(~,A2))

Among them, A3 is a single-thread writing method, and A4 is a multi-thread writing method. The difference between the two writing methods is only @m, but the computing efficiency is improved several times.

" SPL Practice: High-Dimensional Binary Vector Search " (hereinafter referred to as "Practice") is a case about similar vector search that we just used SPL to do recently. From this practice, it is not difficult to see that the degree of personalization of high-dimensional vector search is very high. First of all, the selection of clustering method is a difficult problem. The commonly used k-means algorithm needs to determine the parameter k. In this case, this parameter cannot Sure;

In the clustering process, there is also a process of moving the center of mass, and the center of mass of a binary vector is not easy to define. In order to create a more effective index, we can only customize a personalized clustering algorithm for this example data - split clustering and stepwise clustering. The second is the choice of similarity evaluation method. The most commonly used method in the industry is cosine similarity. However, for the binary vector in this example, cosine similarity will violate people's intuitive feelings. For example, when there are many 1's, if only a few dimensions are different, , the calculated cos similarity will be very large;

When there are many zeros or a small number of different dimensions, the calculated cos similarity will be much smaller. Therefore, we choose to use XOR distance to evaluate the similarity. The calculated similarity is more in line with people's intuitive feelings, and it also brings another benefit, that is, the calculation amount of XOR distance is smaller.

The entire practice process perfectly demonstrates the convenience of SPL programming. In terms of basic algorithms, SPL provides the bits() function to convert a binary sequence into a long integer (long) stored in bitwise. It is this step that makes the calculation become Bitwise operations are eliminated, which greatly improves the computing efficiency (including index creation and vector comparison processes); the bit1() function is used to calculate the XOR distance of two vectors, which further improves the computing efficiency.

In addition to these basic algorithms, SPL also easily completes two clustering methods-split clustering and stepwise clustering. The logic of the two clustering algorithms is relatively complex, and the core code in SPL only uses If you use C++ or Java to complete more than a dozen code grids, I'm afraid it won't be possible without hundreds or thousands of lines of code.

In terms of high performance, this practice uses the efficient calculation methods provided by SPL, such as group@n(), where the @n option groups by sequence number, which is much faster than ordinary hash grouping. In fact, SPL has many other efficient algorithms, such as bisection method, ordered calculation, etc., and they are also very convenient to use. Generally, you only need to add the @b or @o option.

In terms of parallelism, parallelism is used in many places during the clustering process. SPL can achieve parallelism by adding the @m option to ordinary functions (such as group@nm()), making full use of the computing resources of multi-core CPUs. Python is lacking in this aspect, so it is not as easy to use as imagined.

Although big data is not involved in practice, in fact, even for big data operations, SPL is not a problem. It has a complete cursor mechanism, combined with an efficient storage solution, and can easily write efficient vector calculation programs.

Through this practical process, the most intuitive feeling is that the search process of similar vectors is not that simple. It is usually not accomplished by permuting and combining the algorithm APIs provided by the vector database (in fact, even the permutations and combinations of APIs require considerable statistics). knowledge and mathematics, and can only be implemented after trial and error), in this case, it may be a better choice to use SPL to complete it.

When deployed, general vector databases are usually very heavy and complicated to deploy, debug, and maintain. Some of them must be used on the cloud, but data in many scenarios cannot be uploaded to the cloud due to security reasons. SPL is very light, making deployment, debugging, and maintenance easier. It can even be embedded in applications, and vector computing capabilities can be found anywhere.

SPL is currently not rich enough in the completeness of clustering algorithms, but as SPL continues to supplement these basic algorithms, SPL will be more vector databases than existing vector databases .

GitHub:https://github.com/SPLWare/esProc

bb00ac31fc37830ce160b808ab880a38.png

Heavy! The open source SPL exchange group was established

The simple and easy-to-use SPL is open source!

In order to provide a platform for interested technical personnel to communicate with each other,

A specially opened communication group (the group is completely free, no advertising, no selling of courses)

Friends who want to join the group can long press and scan the QR code below

ed33187f2507657ac712d82a1d737d02.png

Friends who are interested in this article, please read the original text and save it ^_^

Guess you like

Origin blog.csdn.net/weixin_47080540/article/details/132680494