Table of contents
Preface
Gai Guoqiang, chairman of the China Database Alliance, said: The emergence of vector technology has opened a door for the management of unstructured data. Conceptually, a vector database is a database that can store data as high-dimensional vectors. As a data structure, each vector contains multiple dimensions, and each dimension represents a different feature or attribute, ensuring the integrity of the information. In this case, the characteristics of unstructured data are accurately described through vector embedding functions, and operations such as query, deletion, modification, and metadata filtering can be quickly completed. Hence the comparison. Compared with traditional relational databases, vector databases can perform similarity retrieval quickly and accurately by using vector similarity algorithms.
Vector database is not actually a new database technology, but it has not attracted much attention, so it seems somewhat unknown. However, as vector retrieval became a typical application scenario and became a common requirement, the true value of vector databases gradually emerged. This article will take you to understand what a vector database is, and what are the current vector database products?
What is a vector database?
Today in the 21st century, information is diverse. Some information is unstructured, such as text documents, rich media, and audio, while other parts are structured, such as application logs, tables, and charts. The development of artificial intelligence and machine learning (AI/ML) allows us to build a type of machine learning model called an embedding model. Embedding models encode various types of data into vectors in order to capture the meaning and context of an asset. This allows us to find similar assets by searching for adjacent data points. The vector search method provides a unique experience, such as taking a photo with a smartphone and then searching for similar images.
Vector databases can store vectors as high-dimensional points and retrieve them. These databases have the additional capability to efficiently and quickly find nearest neighbors in N-dimensional space. Typically, these features are supported by k-nearest neighbor (k-NN) indexes and are built using algorithms such as Hierarchical Navigable Small Worlds (HNSW) and Inverted File Indexing (IVF). Vector database also provides other features such as data management, fault tolerance, authentication and access control, and query engine.
What role do vector databases play in large models?
The vector database is the storage and essence of large models, and is crucial to solving the "illusion" problem of large models. With the continuous development of artificial intelligence technology, the application scenarios of large-scale models in various industries continue to increase, and the multi-modal data that needs to be processed becomes more complex. 向量数据库作为人工智能理解世界的通用数据形式
, will play a key role in many fields. In the future, multimodal vectorization will become an important trend in vector databases. 通过将多模态数据转化为向量形式并压缩
, which allows large models to be called more efficiently during learning and training, thereby making large models more intelligent and capable of answering questions.
Vector databases have broad development prospects and are expected to become an important infrastructure in the field of artificial intelligence and promote artificial intelligence technology to a higher level of development.
Amazon OpenSearch Serverless vector engine
A few months ago, Amazon Cloud Technology launchedAmazon OpenSearch Serverless
the vector engine, which provides users with simple, scalable and high-performance similarity search capabilities. This enables users to easily create modern, machine learning-enhanced search experiences and generative AI applications without having to manage the underlying vector database infrastructure. Although it is still in preview, its performance and functionality are already very powerful.
Amazon OpenSearch Serverless
is a distributed, community-driven, Apache 2.0 licensed100% open source search and analytics suite A wide range of usage scenarios, including real-time application monitoring, log analysis, and website search. OpenSearch
Provides a highly scalable system that makes large amounts of data quickly accessible and responsive through an integrated visualization tool OpenSearch
control panel, allowing users to easily explore them The data. OpenSearch
Provide technical support with the Apache Lucene
search library, supporting a variety of search and analysis functions, including k-nearest neighbor (KNN) search, SQL
, anomaly detection, Machine Learning Commons
, Trace Analytics
, full-text search, etc. Has the following characteristics:
1. Achieve better operations with popular open source solutions hosted by AWS
2. Audit and protect data with built-in certified data center and network architecture to systematically detect potential Threats and react based on system status using technologies such as machine learning, prompts, and visualization.
3. Systematically detect potential threats and use technologies such as machine learning, prompts, and visualization to react based on system status.
4. Optimize time and resources to ensure focus on strategic work.
Amazon OpenSearch Service
can help you easily perform interactive log analysis, real-time application monitoring, website search and other tasks. It is derived from Elasticsearch
’s open source distributed search and analysis suite, which has tens of thousands of active users. customers, hosting hundreds of thousands of clusters and processing hundreds of billions of requests every month. It is enough to show thatOpenSearch
the product is very powerful and attracts many users. Its working principle is shown in the figure below:
scenes to be used
Amazon OpenSearch Serverless
There are too many usage scenarios, including: image search, document search, music retrieval, product recommendation, video search, location-based search, fraud detection, anomaly detection, etc. For example: Hybrid search powered by the vector engine enables users to query vector embeddings, metadata and descriptive information in a single query call, making it easy to provide more accurate, contextually relevant search results without building complex application code. search results.
Other vector databases
Faiss
Faiss
is an open source library focused on efficient similarity search and dense vector clustering. It is written in C++ and provides a complete Python/numpy wrapper. In addition, some common algorithms also have GPU implementation. The library provides a variety of indexing algorithms capable of building different types of indexes. It supports the similarity calculation function of Euclidean distance or dot product. Some index types are simple structures based on precise searches. For most available index structures, there are trade-offs between search time, search quality, and the memory used by each index vector.
The kite
Milvus
is an open source distributed vector database with the characteristics of high availability, high performance and easy scalability, and is used to retrieve massive vector data in real time. It is built based on Faiss
, Annoy
, HNSW
and other vector search libraries, and plays a core role in solving dense vector similarity retrieval. In addition to the functions of the vector retrieval library, Milvus
also supports features such as data partitioning and sharding, data persistence, incremental data ingestion, scalar-vector hybrid queries, and time shuttle. At the same time, Milvus
has greatly optimized the performance of vector retrieval, which can meet the application needs of various vector retrieval scenarios.
Chroma
Chroma
is a lightweight vector database implemented based on the vector retrieval library. It integrates all the elements needed by beginners and provides a simple API. Currently only CPU calculations are supported, but the product quantification method is used to divide the vector dimension into multiple segments and perform clustering separately to reduce storage space and improve retrieval efficiency. It can also be integrated with to implement language model-based applications. has the advantage of being easy to use, lightweight and smart, but its function is relatively simple and does not support GPU acceleration. k-means
LangChain
Chroma
elasticsearch
Elasticsearch is an open source distributed search engine that can be used to implement search, log statistics, analysis, system monitoring and other functions, and can solve various emerging use cases. At the heart of the Elastic Stack, Elasticsearch centralizes data, performs searches on the fly, fine-tunes correlations, performs powerful analytics, and scales easily.
Tencent Cloud VectorDB
Tencent Cloud Vector Database (Tencent Cloud VectorDB) is a fully managed, self-developed enterprise-level distributed database service dedicated to storage ,retrieve and analyze multidimensional vector data. The database supports multiple index types and similarity calculation methods. A single index supports a billion-level vector scale, and can support millions of QPS and millisecond-level query latency. Tencent Cloud Vector Database can not only provide an external knowledge base for large models and improve the accuracy of large model answers, but can also be widely used in AI fields such as recommendation systems and natural language processing.
Application scenarios of vector database
The application scenarios of vector databases are very wide, mainly including the following aspects:
Image and video processing
In image and video processing scenarios, a large amount of image and video data needs to be processed, where image and video data are often represented by vectors. Vector databases can be used to store and manage image and video feature vector data, and use vector similarity algorithms to achieve efficient image and video processing.
natural language processing
In natural language processing scenarios, a large amount of text data needs to be processed, and text data is often represented by vectors. Vector databases can be used to store and manage text vector data and use vector similarity algorithms to achieve efficient natural language processing.
Recommended system
In the recommendation system scenario, a large amount of user behavior data and product feature data need to be processed, among which product feature data is often represented by vectors. Vector databases can be used to store and manage product feature vector data, predict customer needs and provide personalized experiences tailored to their interests.
search engine
In the search engine scenario, a large amount of text data needs to be processed and mapped into vector space for search. Vector databases can be used to store and manage text vector data and use vector similarity algorithms to enable efficient searches.
Facial recognition and identity verification
In face recognition and identity verification scenarios, a large amount of face data needs to be processed and mapped into vector space for comparison. Vector database can be used to store and manage facial feature vector data, and use vector similarity algorithms to achieve efficient face recognition and identity verification.
Personalized chatbot based on “facts”
Provide interactive responses and assistance to better support your customers.
Summarize
The vector database accelerates large model training and inference, and large models also make the vector database more important. Vector data improves developer productivity with production-grade vector-embedded search support and is extremely scalable and efficient. The vector databases listed above all have their own advantages. However, in my personal opinion, they have powerful performance and scalability and can meet the needs of various applications. If you want to learn about or use vector databases, Amazon Cloud Technology recently also provided a free trial service of vector databases, which is worth a look. Amazon OpenSearch Serverless