Cloud-native vector computing engine PieCloudVector: Provides unique memory for large models

PieData Computing System (PieData Computing System, abbreviation: πDataCS) was officially released at the 2023 PieData Computing System Annual Technology Forum on Programmer's Day on October 24. πDataCS uses cloud-native technology to reconstruct data storage and computing, "one storage, multi-engine data computing", making AI models larger and faster, and comprehensively upgrading big data systems to the era of large models. In addition to the cloud-native virtual data warehouse PieCloudDB Database, the second computing engine supported by πDataCS: the cloud-native vector computing engine PieCloudVector has also been officially released . PieCloudVector supports massive vector data storage and efficient vector data query, facilitating multi-modal large model AI applications.

AI will lead the next wave of global GDP growth. According to a June 2023 McKinsey report, generative AI (based on large models) will contribute approximately 2.6 to 4.4 trillion U.S. dollars to global GDP every year, which is equivalent to the total GDP of the United Kingdom in 2021 (3.1 trillion U.S. dollars). Goldman Sachs also pointed out in its April 2023 report that generative AI can contribute 7% to global GDP growth. The rapid rise of large models has led to the continuous innovation of generative AI applications based on large models, and the increasing application requirements for large-scale vector data processing, similarity search, etc. have also promoted the further development of vector databases.

PieCloudVector, PieCloudVector's self-developed vector cloud native computing engine, is the second computing engine of πDataCS. It is an analytical database dimensionality upgrade in the era of large models. The goal is to facilitate multi-modal large model AI applications and further realize the storage and processing of massive vector data. Efficient query. PieCloudVector supports and cooperates with the Embeddings of large models to help the rapid adaptation and secondary development of basic models in scene AI.

1 Large model and vector

With the explosive growth of data and the improvement of computing power, large models have become an important tool for handling complex problems and analyzing massive data. Large models refer to machine learning models with large parameter scales, high complexity, and powerful learning capabilities. These models typically consist of millions or even billions of parameters and are trained on large-scale data to acquire knowledge and reasoning capabilities. The emergence of large models has enabled significant breakthroughs in tasks in various fields, such as natural language processing, image recognition, speech recognition, and recommendation systems.

Vectorized representation of features

In mathematics and computer science, a vector is a quantity that has magnitude and direction. A vector represents a set of "features" using a set of floating point numbers. This feature is extracted from the binary representation (text, picture, audio, video, etc.) of a real object (cat, flower, etc.) (as shown in the figure above). Generally Extracted from large models. By converting real objects into vector representations, calculations and comparisons can be performed in vector space, such as similarity calculations, cluster analysis, classification tasks, etc. Vector representation also provides the basis for building recommendation systems, sentiment analysis, information retrieval and other tasks.

2 What is a vector database

Vector database is a database system specially used to store and manage vector data. It can provide efficient storage, indexing and query functions for vectors.

In vector search, different distance measures (such as Euclidean distance, cosine similarity, Manhattan distance, etc.) can be used to calculate the distance between two vectors. The closer the distance, the more similar the two vectors are. As shown in the figure below, the distance measure between "Paipai" and "Sloth" can be calculated by cosine similarity to determine their degree of similarity.

Calculate cosine similarity of vectors

Traditional databases are better at exact matching, but lack storage and processing capabilities for floating-point numbers and cannot efficiently process vector data . In order to efficiently store and query vector data, vector databases came into being.

Vector database can meet the specific needs of storing and processing vector data, and can efficiently store vectors and original entities (text/image/speech) and associate them. This provides efficient similarity search, large-scale data management, complex vector calculations, and real-time recommendations, helping users better utilize and analyze vector data and facilitate large model applications.

Tuoshupai believes that in addition to efficient vector storage and similarity search functions, an excellent vector database must also meet transaction ACID guarantees and user permission control to ensure that the insertion, update, and deletion operations of vector data can be performed correctly. Execution, while ensuring data consistency during concurrent access, provides users with stable, reliable and secure services, and is suitable for various data management and application scenarios. This is also the design idea of ​​PieCloudVector.

3 Cloud native vector computing engine PieCloudVector

After comparing various open source implementations and performances such as pgvector and pgembedding, the Tuoshupai team did not choose this type of open source implementation. Instead, it completely independently developed PieCloudVector  to meet the user's usage scenarios. PieCloudVector has features such as efficient storage and retrieval of vector data, similarity search, vector indexing, vector clustering and classification, high-performance parallel computing, strong scalability and fault tolerance.

3.1 PieCloudVector architecture

In terms of architecture design, the Tuoshupai team used its experience and advantages accumulated in the fields of eMPP (elastic MPP) and distributed architecture when building PieCloudDB, πDataCS's first computing engine cloud-native virtual data warehouse, to create vector computing . Engine PieCloudVector's eMPP distributed architecture . As shown in the figure below, each Executor of PieCloudVector corresponds to a PieCloudVector instance, thereby realizing high-performance, scalable and reliable vector storage and similarity search services. The converted vector representation will be stored in the unified storage engine "Jianmo" of πDataCS.

PieCloudVector’s eMPP distributed architecture

Users only need one client to perform similar searches in any language. With the help of PieCloudVector, users can not only store and manage the vectors corresponding to the original data, but also call PieCloudVector related tools to perform fuzzy search. Compared with global search, some precision is sacrificed to achieve millisecond-level search, further improving query efficiency.

3.2 PieCloudVector function

PieCloudVector can provide two search modes: precise search and fuzzy search. Currently, PieCloudVector provides users with the following features:

  • Supports approximate vector search KNN-ANN
  • Supports mainstream ANN algorithms, such as IVFFlat and HNSW, etc.
  • Support vector compression (PQ)
  • Parallel + Distributed
  • SIMD/GPU acceleration
  • Support Langchain framework

Next, we will introduce the first two of these functions in detail:

3.2.1 Approximate search KNN-ANN

K-Nearest Neighbor (KNN) is one of the basic problems of vector search. This problem is to find the K vectors closest to the given vector among the existing N vectors. Through the K nearest neighbor algorithm, applications such as similar image retrieval, related news recommendation, and user portrait matching can be realized. It allows to quickly find the most similar vector to a given vector based on the distance or similarity between vectors, thus providing efficient similarity search and recommendation services.

However, as the amount of data gradually increases, accurate query requires comparing the input vector with each record, and the computational cost will increase exponentially. In order to solve this problem, PieCloudDB builds vector indexes to obtain the approximate relationship between data in advance and speed up query efficiency. PieCloudVector introduces the Approximate Nearest Neighbor (ANN) algorithm to build vector indexes. Through ANN, PieCloudVector can save global search time, sacrifice part of the accuracy to accelerate query speed, further improve query efficiency, achieve millisecond-level query speed, and achieve fuzzy query.

PieCloudVector provides a variety of ANN algorithms when establishing vector indexes, including the most popular IVFFlat (Inverted File with Flat) algorithm and HNSW (Hierarchical Navigable Small World) algorithm. Users can choose according to the characteristics of the data:

  • IVFFlat algorithm (left): Vector indexing algorithm based on inverted files. It groups vector data ahead of time and builds an inverted index for each group. During fuzzy queries, the IVFFlat algorithm retrieves data contained in groups that are similar to the target vector, thereby speeding up the search and reducing memory consumption. However, due to the use of grouping, the accuracy of the IVFFlat algorithm is generally relatively low.
  • HNSW algorithm (right picture): vector indexing algorithm based on hierarchical navigation. It builds an index structure by establishing a "network of relationships" between data. This process requires a certain amount of time and memory resources. However, the accuracy of the HNSW algorithm is generally better than that of the IVFFlat algorithm. It can better capture the local structure and similarities between data and support efficient approximate search.

 

3.2.2 Vector compression

Vector similarity search requires a large amount of memory to support when processing large-scale data. For example, an index containing 1 million dense vectors typically requires several GB of memory to store. High-dimensional data makes the memory usage problem even more serious because as the dimensions increase, the vector representation space becomes extremely large and requires more memory to store.

To solve this memory pressure problem, vector compression (Product Quantization, PQ) is a common method. It can compress high-dimensional vectors, thereby significantly reducing memory usage. By dividing each vector into several subspaces and quantizing each subspace, PQ can convert the original high-dimensional vector into multiple low-dimensional codebooks, thereby reducing memory requirements.

After using PQ, the memory required to store indexes can be reduced by up to 97%, allowing PieCloudVector to manage memory more efficiently when processing large-scale data sets and speed up similarity searches. In addition, PQ can also improve the speed of nearest neighbor search, usually increasing the search speed by 5.5 times. In addition, the IVF+PQ composite index formed by combining PQ with Inverted File (IVF) can further increase the search speed by 16.5 times without affecting the search accuracy. The overall search speed can be increased by 92 times compared to not using quantized indexing.

Vector compression (Product Quantization)

4 PieCloudVector typical application scenarios

According to the actual use process of vectors, the application scenarios of PieCloudVector can be roughly divided into four layers, which correspond to different scenarios in the actual use of vectors.

4.1 Prepare data and segmentation (images, text, audio, etc.)

At this level, data preparation and segmentation are involved. For example, in the form of images, text, audio, etc. The original data needs to be preprocessed, cleaned and feature extracted to obtain a vector representation suitable for subsequent processing. This step is usually done to transform raw data into input for creating embeddings.

4.2 Create Embeddings

At this layer, the data is converted into vector representation through appropriate algorithms or models. This vector representation reflects the characteristics and semantic information of the data. For example, models such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Transformer, etc. can be used to generate embedded representations of images, text, or audio.

4.3 Storing vectors

At this layer, the created vector representation is stored for subsequent vector searches. PieCloudVector supports distributed vector storage, can flexibly expand storage resources, and reduces memory usage through vector compression.

4.4 Vector Search

At this level, a similarity search is performed based on stored vectors. PieCloudVector provides an efficient vector search function. Through vector search algorithms such as KNN and ANN, it supports L2 distance, Inner Product, and Cosine Distance vector distance measurement methods, and can quickly find the vector most similar to a given query vector. This vector search function is widely used in similar image retrieval, related news recommendation, user portrait matching and other scenarios.

The figure below shows the application process architecture of PieCloudVector in a knowledge base system, which includes six steps from text segmentation to the application returning answers to the user. The knowledge base system utilizes PieCloudVector to support semantic search and answer retrieval functions in the knowledge base system. It converts text into vector representations and searches for vector similarity to find relevant answers. This architecture can efficiently process large-scale text data sets and provide accurate answers back to users .

Application process architecture of knowledge base system

In the future, PieCloudVector will continue to iterate and develop to provide unique memory and support for large models. As generative AI and large models continue to evolve, PieCloudVector will more deeply integrate the advantages of vector databases and tightly integrate with other technologies and algorithms.

PieCloudVector will continue to improve its storage, indexing and query capabilities to cope with increasingly complex and large vector data. It will explore new quantization algorithms, approximate search methods, and parallel computing strategies to improve query efficiency and accuracy.

At the same time, PieCloudVector will be committed to integrating with application scenarios in different fields, and will gradually expand its support for multi-modal data processing and analysis capabilities to provide more comprehensive and flexible solutions.

References:

Tang Xiaoou, founder of SenseTime, passed away at the age of 55. In 2023, PHP stagnated . Hongmeng system is about to become independent, and many universities have set up "Hongmeng classes". The PC version of Quark Browser has started internal testing. ByteDance was "banned" by OpenAI. Zhihuijun's startup company refinanced, with an amount of over 600 million yuan, and a pre-money valuation of 3.5 billion yuan. AI code assistants are so popular that they can't even compete in the programming language rankings . Mate 60 Pro's 5G modem and radio frequency technology are far ahead No Star, No Fix MariaDB spins off SkySQL and forms as independent company
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5944765/blog/10321799