Vector Database Comparison and Selection Guide

6e6d49f9fe59235e02c098e6850fcfab.png

来源:DeepHub IMBA
本文约3400字,建议阅读6分钟
本文将研究存储/检索向量数据和执行相似性搜索的实用方法。

Vector databases are designed for efficient storage, retrieval and similarity search of high-dimensional vector data. Represent vector data as a continuous, meaningful high-dimensional vector using a process called embedding.

174115dbe7c41992b5399cbbd7678b41.png

This article will examine practical ways to store/retrieve vector data and perform similarity searches. Before we delve further, we will first introduce two key functions of vector databases:

1. Ability to perform searches

When given a query vector, a vector database can retrieve the most similar vectors based on a specified similarity measure such as cosine similarity or Euclidean distance. This allows applications to find related items or data points based on their similarity to a given query.

2. High performance

Vector databases usually use indexing techniques, such as the Approximate Nearest Neighbor (ANN) algorithm, to speed up the search process. These indexing methods aim to reduce the computational complexity of searching in high-dimensional vector spaces, while traditional methods such as space decomposition become impractical due to the high dimensionality.

Introduction

The field of vector database is expanding rapidly now. How to weigh the choices? Here I have sorted out 5 main directions:

  • Pure vector databases, such as Pinecone, are also based on Faiss below

  • Full-text search databases, such as ElasticSearch, used to be used as search engines and now add vector storage and retrieval functions

  • Vector libraries, such as Faiss, Annoy and Hnswlib, cannot be used as databases yet, only vector processing

  • NoSQL databases that support vectors, such as MongoDB, Cosmos DB, and Cassandra, are old-fashioned data stores, but add vector functions

  • SQL databases that support vectors, such as SingleStoreDB or PostgreSQL, are different from the above in that these databases support SQL statements

In addition to the five main approaches mentioned above, there are also methods such as Vertex AI and Databricks, whose capabilities go beyond databases, which we will not discuss.

1. Pure vector database

c84880041389f0e56a0be8e057fda156.png

Pure vector databases are specifically designed for storing and retrieving vectors. Including Chroma, LanceDB, Marqo, Milvus/ Zilliz, Pinecone, Qdrant, Vald, Vespa, Weaviate, etc. Data is organized and indexed based on vector representations of objects or data points. These vectors can be numerical representations of various types of data, including images, text documents, audio files, or any other form of structured or unstructured data.

advantage

  • Efficient Similarity Search Using Indexing Technology

  • Scalability for large datasets and high query workloads

  • Support for high-dimensional data

  • Supports HTTP and JSON based APIs

  • Native support for vector operations including addition, subtraction, dot product, cosine similarity

shortcoming

Pure Vector Database: A pure vector database can store vectors and some metadata, but nothing else. For most use cases, data including descriptions such as entities, attributes and hierarchies (graphs), locations (geospatial), etc. may also need to be included, requiring integration of other stores.

Limited or no SQL support: Pure vector databases often use their own query language, which makes it difficult to run traditional analytics on vectors and related information, and to combine vectors with other data types.

No full CRUD: Pure vector databases are not really designed for create, update and delete operations. So the data must be vectorized and indexed first. The focus of these databases is to obtain vector data and query the nearest neighbor based on vector similarity, and indexing is time-consuming. Indexing vector data is computationally intensive, costly, and time-consuming. This makes real-time operation basically impossible. For example, Pinecone's IMI index (Inverted Multi-Index, a variant of ANN) incurs storage overhead and is computationally intensive. It is primarily designed for static or semi-static datasets, where vectors are added, modified, or deleted frequently, which is basically unlikely. The index used by Milvus is called Product Quantification and Hierarchical Navigable Small World (HNSW), which is an approximate technique that trades off between search accuracy and efficiency. Its indexing needs to be configured with various parameters, and using incorrect parameter choices may affect the quality of search results or cause inefficiencies.

Not very functional: Many vector databases are seriously behind in basic features, including ACID transactions, disaster recovery, RBAC, metadata filtering, database manageability, observability, etc. This could lead to serious business problems that we would have to deal with ourselves if we had to solve them, which would lead to a lot of development.


2. Full-text search database

Such databases include Elastic/Lucene, OpenSearch, and Solr.

7d4c12a34352af1622007b20a9352e7c.png

advantage

  • High scalability and performance, especially for unstructured text documents

  • Rich text retrieval features such as built-in foreign language support, customizable tokenizers, stemmers, stoplists and N-grams

  • Mostly based on open source libraries (Apache Lucene)

  • Mature and large ecosystem of integrations, including vector libraries

shortcoming

  • No optimization vector search or similarity matching

  • Primarily designed for full-text search, not semantic search, so applications built on top of it will not have the full context of Retrieval Augmentation Generation (RAG) and others. To enable semantic search capabilities, these databases need to be augmented with additional tools and a host of custom scoring and correlation models.

  • Limited application of other data formats (images, audio, video)

  • Basically does not support GPU

Generally, these libraries are selected because new functions are added to previous projects, and the amount of data is small, and it will not have much impact on the main business. Not recommended if large projects need to be re-architected.

3. Open source vector library

28e0c2a69bf9f4d9e36dbba3ecc827e6.png

For many developers, open source vector libraries such as Faiss, Annoy, and Hnswlib are a good starting point. Faiss is a library for similarity search and clustering of dense vectors. Annoy (Approximate Nearest Neighbors Oh Yeah) is a lightweight library for artificial neural network search. Hnswlib is a library implementing the HNSW ANN search algorithm.

advantage

  • Fast Neighbor Search

  • built for higher dimensions

  • Supports indexing structures for artificial neural networks, including inverted files, product quantification, and random projections

  • Supports use cases for recommender systems, image search, and natural language processing

  • SIMD (Single Instruction, Multiple Data) and GPU support to speed up vector similarity search operations

shortcoming

  • Maintenance and integration hassle

  • May sacrifice search accuracy compared to exact methods

  • Need to deploy and maintain yourself: You need to build and maintain complex infrastructure to provide sufficient CPU, GPU and memory resources for application requirements.

  • Limited or no support for metadata filtering, SQL, CRUD operations, transactions, high availability, disaster recovery, and backup and restore

The reason why they are called libraries (or packages) rather than databases is that they only provide few but very specialized functions. If you want to get started or do a simple demo, they are a good start, but it is not recommended to apply directly to production.

4. Support vector NoSQL database

f6bf077d583d4a5a7ecb9f7cd3777716.png

These databases include: NoSQL databases such as MongoDB, Cassandra, DataStax Astra, CosmosDB and Rockset. There are also key-value databases like Redis and other special purpose databases like Neo4j (graph database).

Almost all of these NoSQL databases are vector capable only recently with the addition of vector search extensions, so be sure to test them if you use them.

advantage

For specific data models, NoSQL databases offer high performance and scalability. Neo4j can be used with llm for social network or knowledge graph. A vector-capable time-series database such as kdb might be able to combine vector data with financial market data.

shortcoming

The vector capabilities of NoSQL databases are basic/nascent/untested. This year, many NoSQL databases added vector support. for example:

In May, Cassandra announced plans to add vector search.

In April, Rockset announced support for basic vector searches,

In May, Azure Cosmos DB announced support for MongoDB vCore's vector search.

DataStax and MongoDB announced vector search capabilities this month (June) (both in preview)!

Vector search performance on NoSQL databases can vary widely, depending on the supported vector functions, indexing methods, and hardware acceleration. Moreover, the query efficiency of NoSQL database is not high, and with the function of vector, it will not be fast.

My point of view has not changed, that is, if complex data must be stored in a relational database, it is no problem to use something like MongoDB as an auxiliary storage, but as the main storage and main query, it is the work of the so-called "full-stack" front-end. Because they don't understand anything, they think everything is simple.

5. Support vector SQL database

These libraries are similar to the above, but they are basically relational databases and support sql queries, such as SingleStoreDB, PostgreSQL, Clickhouse and Kinetica's pgvector/Supabase Vector (test version).

c127fe314753c1756bf8d24f0da24eba.png

Adding basic vector functionality to an established database is not a difficult task. For example, the vector database Chroma comes from ClickHouse

advantage

Contains vector search functions such as dot product, cosine similarity, Euclidean distance and Manhattan distance.

Find k-Nearest Neighbors Using Similarity Scores Multi-model SQL databases provide hybrid queries and can combine vectors with other data for more meaningful results.

Most SQL databases can be deployed as a service, fully managed on the cloud.

shortcoming

SQL databases are designed for structured data. Whereas vectors are unstructured data such as images, audio, and text. While relational databases can often store text and blobs, most don't vectorize this unstructured data for machine learning.

Most SQL databases are not (yet) optimized for vector searches. The indexing and query mechanisms of relational databases are mainly designed for structured data, not for high-dimensional vector data. While the performance of SQL databases for vector data processing may not be particularly good, SQL databases that support vectors may add extensions or new features to support vector searches.

Traditional SQL databases cannot scale out, and their performance degrades as the data grows. Using SQL databases to process large datasets of high-dimensional vectors may require additional optimizations, such as partitioning the data or using specialized indexing techniques to maintain efficient query performance.

Summarize

So, how to choose?

1. If you are getting started or demo, you can directly use the open source vector library. For example, Faiss can support local billion-level data, but cannot provide external services.

2. For products, if you want to develop new functions and launch them online, you must separate the vector storage from the existing storage. Professionals can choose pure vector databases or open source vector libraries to develop by themselves (if the functions are simple) to ensure the stability of the system.

3. If you must use the vector function on the existing system, such as storing and retrieving a large amount of vector data on Elastic and MongoDB, then you must do a good job of testing and ask yourself for blessings. Maybe the problem you encounter is not only unknown to chatgpt, but also not on stackoverflow.

4. Now the vector storage is still in the development stage, so some functions are not perfect, so try to use the mature version, and don't take risks in the production environment.

Finally, let's talk about the architectural suggestions:

Microservice architecture is a style of software architecture in which an application is split into a set of small, independent services, each service is focused on providing a specific business function, and each microservice should focus on solving a specific business problem or providing a specific function. This fine-grained division enables each microservice to be independently expanded, deployed, and maintained as needed.

Vector search is no exception and should be independent as a separate service. If the services are independent, the storage should also be independent.

Of course, if you have to put vector storage and business data together, I don’t have any opinions. Anyway, it’s not me who solves the problem, so I just watch the fun.

Editor: Yu Tengkai

Proofreading: Wang Yuqing

9ac583f837eefaa5d46746134cd7bd9a.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131862241