A database born for AI: Milvus detailed explanation and actual combat

1 vector database

1.1 Origin of vector database

In today's digital age, artificial intelligence (AI) is rapidly changing the way we live and work. From smart assistants to self-driving cars, AI is becoming an innovation engine in every industry. However, the rise of this AI also brings with it a key challenge: how to efficiently process and analyze increasingly rich and complex data. In this context, vector database technology emerged as the times require, providing a powerful acceleration engine for AI.

Data challenges in the AI ​​era: With the expansion of the scope of AI applications, a large amount of data is pouring into various industries. Various data forms such as images, text, and audio have become the input of AI. These data are characterized as multimodal, high-dimensional, complex and highly correlated. Although traditional relational databases are still useful in some scenarios, they are unable to handle such multi-modal, high-dimensional data. Therefore, a database technology more suitable for AI application needs is needed, which is vector database.

Acceleration Engine for AI: Vector Database is a database specially designed for storing and retrieving vector data. Its core idea is to map data into vector space, so that the similarity calculation, clustering, classification and retrieval of data become more efficient and accurate.

The vector database is a database specially used to store and query vectors, and the stored vectors come from the vectorization of text, voice, image, video, etc. Compared with traditional databases, vector databases can not only complete basic CRUD (add, read query, update, delete) and other operations, but also perform faster similarity searches on vector data

1.2 The Empowerment of Vector Databases to Large Models

The vector database expands the boundary of the large model. This expansion includes two aspects, the time boundary and the space boundary:

Expansion of time boundaries: vector databases enable large model LLMs to have "long-term memory".

The current large models (whether it is the GPT series in the NLP field or the ResNET series in the CV field) are pre-trained large models of Pretrain, with a very clear cut-off date for training, which leads to the fact that these models are not suitable for training after the cut-off date. know nothing about. With the introduction of the vector database, the latest information vectors stored in it can greatly expand the application boundary of the large model. The vector database can keep the large model quasi-real-time, improve the applicability of the large model, and make the large model dynamic Adjustment. Thus vector databases enable long-term memory for large models.

Assuming a pre-trained news summary model finishes training at the end of 2021, by 2023, many news events and trends have changed. To enable large models to handle this new information, a vector database can be used to store and query news article vectors for 2023.

In the recommendation system, the pre-trained large model may not be able to recognize the features of new users and new products. Through the vector database, the feature vectors of users and products can be updated in real time, so that the large model can provide users with more accurate information based on the latest information. recommend.

In addition, the vector database can also support real-time monitoring and analysis. For example, in finance, a pre-trained stock prediction model may not have access to stock price information after the training deadline. By storing the latest stock price vectors in the vector database, the big model can analyze and predict future stock price trends in real time. Also in the field of customer service, vector databases will allow large models to be traced back to the beginning of the conversation.

Expansion of spatial boundaries: Vector databases can help solve the problem of privacy leakage of large models that is currently the most worrying issue in the corporate world.

The prompt given by the user may contain some sensitive information. According to media reports, employee A uses ChatGPT to help him check a piece of code for bugs, and this piece of source code is related to the measurement data of semiconductor equipment; employee B wants to use ChatGPT to help him optimize a piece of code, and directly related to the output and yield recording equipment A piece of code from is entered into it.

These behaviors directly led to the leakage of Samsung's key data, and ChatGPT itself has actually had privacy leaks, making a small part of conversation history/payment data viewable by other users. These data are extremely sensitive, and through local deployment, Vector Databases can solve this problem to a large extent.

After the vector database is deployed locally, it can store a large amount of private data related to the enterprise. After the large model is deployed locally or on the private cloud, the private data of the vector database can be accessed under the condition of protection through a special Agent large model, and then it can be accessed at any time. When the company's privacy is exposed to the Internet, the company's business is assisted by a large model.

1.3 Vector database to realize multi-modal search

The vector database has its own multimodal function, which means that it can process and understand multiple modal information from different sources through machine learning methods, such as text, images, audio and video, etc. The data vectorization process makes these different modalities The hidden information inside the data is exposed to support multimodal applications.

 A typical application scenario is multilingual search. The vector database supports cross-language information retrieval. Users can search the library in multiple languages ​​such as English, French, and Chinese without prior multilingual translation processing of book titles. This is due to the fact that vector representations can capture semantic similarity, enabling queries and content from different languages ​​to be matched to each other.

1.4 Vector Data Ranking

2 Milvus database introduction

2.1 Overview of Milvus

Milvus is a cloud-native vector database with high availability, high performance, and easy expansion, and is used for real-time recall of massive vector data.

Milvus official website address: Milvus

Milvus is built on the basis of FAISS, Annoy, HNSW and other vector search libraries, and its core is to solve the problem of dense vector similarity retrieval. Based on the vector retrieval library, Milvus supports functions such as data partitioning, data persistence, incremental data ingestion, scalar-vector hybrid query, time travel, etc., and greatly optimizes the performance of vector retrieval to meet any vector retrieval scenario application requirements. Generally, it is recommended that users deploy Milvus with Kubernetes for best availability and resilience.

Milvus adopts a shared storage architecture, ​storage and computing are completely separated​, and computing nodes support horizontal expansion. From an architectural point of view, Milvus follows the separation of data flow and control flow, and is divided into four layers as a whole, namely access layer, coordinator service, worker node and storage layer. . Each level is independent of each other, independent expansion and disaster recovery.

Milvus vector database can help users easily deal with massive unstructured data (picture/video/voice/text) retrieval. A single-node Milvus can complete a billion-level vector search within seconds, and the distributed architecture can also meet the user's horizontal expansion requirements.

The characteristics of milvus are summarized as follows:

  • High performance: It has high performance and can perform vector similarity retrieval on massive data sets.
  • High availability and reliability: Milvus supports expansion on the cloud, and its disaster recovery capability can ensure high service availability.
  • Hybrid query: Milvus supports scalar field filtering during vector similarity retrieval to achieve hybrid query.
  • Developer friendly: Milvus ecosystem that supports multi-language and multi-tools.

2.2 Milvus key concepts

Unstructured data: Unstructured data refers to data whose data structure is irregular, there is no unified predefined data model, and it is inconvenient to use the two-dimensional logical table of the database to represent the data. Unstructured data includes pictures, videos, audio, natural language, etc., accounting for 80% of all data. The processing of unstructured data can be processed by converting it into vector data through various artificial intelligence (AI) or machine learning (ML) models.

Feature vector: Vector, also known as embedding vector, refers to a continuous vector transformed from discrete variables (such as pictures, videos, audio, natural language and other unstructured data) by embedding technology. In mathematical representation, a vector is an n-dimensional array composed of floating-point numbers or binary data.

Through modern vector transformation techniques, such as various artificial intelligence (AI) or machine learning (ML) models, unstructured data can be abstracted into vectors in an n-dimensional feature vector space. In this way, the nearest neighbor algorithm (ANN) can be used to calculate the similarity between unstructured data.

Vector similarity retrieval: Similarity retrieval refers to comparing the target object with the data in the database and recalling the most similar results. Similarly, vector similarity retrieval returns the most similar vector data. The Approximate Nearest Neighbor (ANN) algorithm can calculate the distance between vectors, thereby improving the speed of vector similarity retrieval. If two vectors are very similar, it means that the source data they represent are also very similar.

Collection-Collection: Contains a set of entities, which can be equivalent to tables in relational database systems (RDBMS).

Entity-entity: contains a set of fields. field corresponds to the actual object. field can be structured data representing object properties, or a vector representing object characteristics. The primary key is a unique value used to refer to an entity. Note: You can customize the primary key, otherwise Milvus will automatically generate the primary key. Currently Milvus does not support primary key deduplication, so entities with the same primary key may appear in a collection.

Field - Field: An integral part of Entity. Fields can be structured data, such as numbers and strings, or vectors. Note: Milvus2.0 now supports scalar field filtering. Also, Milvus 2.0 supports only one primary key field in a collection.

The correspondence between Milvus and relational databases is as follows:

Partition-partition: Partition is a partition of the collection (Collection). Milvus supports dividing collected data into multiple parts on physical storage. This process is called partitioning, and each partition can contain multiple segments.

Segment: Milvus automatically creates data files by merging data when data is inserted. A collection can contain multiple segments. A segment can contain multiple entities. In the search, Milvus searches each segment and returns the merged results.

Sharding-sharding: Sharding refers to the distribution of data writing operations to different nodes, so that Milvus can make full use of the parallel computing power of the cluster for writing. By default, a single Collection contains 2 shards. At present, Milvus adopts the sharding method based on the primary key hash , and will support more flexible sharding methods such as random sharding and custom sharding in the future. Note: The significance of partitioning is to reduce data reading by dividing partitions, and the significance of sharding is to write operations in parallel on multiple machines.

Index: The index is built based on the original data, which can improve the speed of searching the collection data. Milvus supports multiple index types. To improve query performance, you can specify an index type for each vector field. Currently, only one index type is supported for a vector field. When switching the index type, Milvus will automatically delete the previous index. A similarity search engine works by comparing an input object with objects in a database to find the most similar object to the input. Indexing is the process of effectively organizing data, which greatly speeds up queries on large datasets, and plays an important role in the implementation of similarity search. After indexing a large vector dataset, queries can be routed to clusters or subsets of the data that are most likely to contain vectors similar to the input query. In practice, this means sacrificing some degree of accuracy to speed up queries on truly large-scale vector datasets.

PChannel: PChannel represents a physical channel. Each PChannel corresponds to a log storage topic. By default, a set of 256 PChannels will be allocated to store logs that record data insertions, deletions, and updates when the Milvus cluster starts.

VChannel: VChannel represents a logical channel (virtual channel). Each collection will be allocated a set of VChannels for recording data insertions, deletions, and updates. VChannels are logically separate but physically share resources.

Binlog: binlog is a binary log, or a smaller segment unit, which records and processes updates and changes of data in the Milvus vector database. Data for a segment is stored in multiple binary logs. There are three binlogs in Milvus: InsertBinlog, DeleteBinlog and DDLBinlog.

Log broker (Log broker): The log broker is a publish-subscribe system that supports playback. It is responsible for stream data persistence, reliable asynchronous query execution, event notification and return of query results. It also ensures the integrity of incremental data when worker nodes recover from system crashes.

Log subscriber: The log subscriber updates the local data by subscribing to the log sequence, and provides the service in the form of a read-only copy.

Log sequence (Log sequence): The log sequence records all operations that change the state of the collection in Milvus.

Regularization: Regularization refers to the process of transforming embeddings (vectors) so that their norm is equal to 1. If inner product (IP) is used to calculate embeddings similarity, all embeddings must be regularized. After regularization, the inner product is equal to the cosine similarity.

2.2 Milvus architecture

 Milvus document address: Milvus doc

The whole system is divided into four levels:

  • Access Layer: The facade of the system, consisting of a set of stateless proxies. The endpoint that provides user connections externally is responsible for verifying client requests and merging returned results.
  • Coordinator Service: the brain of the system, responsible for assigning tasks to execution nodes. There are four roles in the coordination service, namely root coord, data coord, query coord and index coord.
  • Execution node (Worker Node): The extremities of the system, responsible for completing the instructions issued by the coordination service and the data manipulation language (DML) commands initiated by the proxy. Execution nodes are divided into three roles, namely data node, query node and index node.
  • Storage service (Storage): The skeleton of the system, responsible for the persistence of Milvus data, is divided into three parts: metadata storage (meta store), message storage (log broker) and object storage (object storage).

Each level is independent of each other, independent expansion and disaster recovery.

2.2.1 Access layer

The access layer consists of a group of stateless proxies, which are the facade of the entire system and provide external endpoints for user connections. The access layer is responsible for validating client requests and reducing returned results.

  • Proxy itself is stateless, and generally provides a unified access address and provides services through load balancing components (Nginx, Kubernetes Ingress, NodePort, LVS).
  • Since Milvus adopts a massively parallel processing (MPP) architecture, the proxy first performs global aggregation and post-processing on the intermediate results returned by the execution nodes, and then returns them to the client.

2.2.2 Coordination Service

The Coordinator Service is the brain of the system, responsible for assigning tasks to execution nodes. Its tasks include cluster topology node management, load balancing, timestamp generation, data declaration and data management, etc.

The Coordination Service has four roles:

  • Root coordinator (root coord) : Responsible for processing data definition language (DDL) and data control language (DCL) requests. For example, create or delete collections, partitions, indexes, etc., and be responsible for the promotion of TSO and time windows of the maintenance center's timing service.
  • Query coordinator (query coord) : responsible for managing the topology and load balancing of query node and switching from growing segment to sealed segment. The segment in the Query node has only two states: growing and sealed, corresponding to incremental data and historical data respectively.
  • Data coordinator (data coord) : Responsible for managing the topology of the data node, maintaining the metadata of the data, and triggering background data operations such as flush and compact.
  • Index coordinator (index coord) : Responsible for managing the topology of index nodes, building indexes and maintaining index meta information.

2.2.3 Execution Node

Execution nodes are the limbs of the system, responsible for completing the instructions issued by the coordination service and the data manipulation language (DML) commands initiated by the proxy. Due to the separation of storage and computing, the execution node is stateless, and can quickly realize expansion and contraction and fault recovery in cooperation with Kubernetes.

Execution nodes are divided into three roles:

  • Query node: Query node obtains incremental log data by subscribing to message storage (log broker) and converts it into growing segment, loads historical data based on object storage, and provides hybrid query and search functions of scalar + vector.
  • Data node: Data node obtains incremental log data by subscribing to message storage, processes change requests, and packs and stores log data on object storage to achieve log snapshot persistence.
  • Index node: Index node is responsible for performing index building tasks. Index nodes do not need to reside in memory, and can be implemented in a serverless mode.

2.3.4 Storage service

The storage service is the skeleton of the system, responsible for the persistence of Milvus data, and is divided into three parts: metadata storage (meta store), message storage (log broker) and object storage (object storage).

Metadata storage: Responsible for storing snapshots of meta information, such as collection schema information, node status information, checkpoints for message consumption, etc. Meta-information storage requires extremely high availability, strong consistency, and transaction support, so etcd is the best choice for this scenario. In addition, etcd also undertakes the responsibilities of service registration and health check.

Object storage: Responsible for storing log snapshot files, scalar/vector index files, and intermediate processing results of queries. Milvus uses MinIO as object storage, and also supports deployment on AWS S3 and Azure Blob, two of the most widely used low-cost storage. However, due to the high latency of object storage access and the need to be billed according to the query, Milvus plans to support memory or SSD-based cache pools in the future, and improve performance by separating hot and cold to reduce costs.

Message storage: Message storage is a publish-subscribe system that supports playback, which is used to persist stream-written data, and reliably asynchronously execute queries, event notifications, and result returns. When performing node downtime recovery, the integrity of incremental data is guaranteed by replaying message storage.

Currently, the distributed version of Milvus relies on Pulsar for message storage, and the stand-alone version of Milvus relies on RocksDB for message storage. The message storage can also be replaced by streaming storage such as Kafka and Pravega.

The entire Milvus is designed around logs and follows the principle that logs are data . Therefore, in version 2.0, no physical tables are maintained, but data reliability is guaranteed through log persistence and log snapshots.

As the backbone of the system, the log system undertakes the role of data persistence and decoupling. Through the publish-subscribe mechanism of logs, Milvus decouples the reading and writing components of the system. An extremely simplified model is shown in the figure above. The whole system is mainly composed of two roles, namely the message store (log broker) (responsible for maintaining the "log sequence") and the "log subscriber". The "log sequence" records all operations that change the state of the library table, and the "log subscriber" updates local data by subscribing to the log sequence, providing services as read-only copies. The publish-subscribe mechanism also provides room for scalability of the system in terms of change data capture (CDC) and full distributed deployment.

2.3 main components of milvus

Milvus supports two deployment modes, stand-alone mode (standalone) and distributed mode (cluster). The two modes have exactly the same capabilities, and users can choose the mode that suits them according to factors such as data size and access volume. Milvus deployed in Standalone mode does not support online upgrade to cluster mode for the time being.

2.3.1 Stand-alone Milvus

The stand-alone version of Milvus includes three components:

  • Milvus is responsible for providing the core functionality of the system.
  • etcd is a metadata engine for managing metadata access and storage of Milvus internal components, such as proxy, index node, etc.
  • MinIO is a storage engine responsible for maintaining Milvus data persistence.

 

2.3.2 Distributed version of Milvus

The distributed version of Milvus consists of eight microservice components and three third-party dependencies, and each microservice component can be deployed independently using Kubernetes.

Microservice components

  • Root coord
  • Proxy
  • Query coord
  • Query node
  • Index coord
  • Index node
  • Data coord
  • Data node

third party dependence

  • etcd is responsible for storing the metadata information of each component in the cluster.
  • MinIO is responsible for handling the data persistence of large files in the cluster, such as index files and full binary log files.
  • Pulsar is responsible for managing the logs of recent change operations, outputting stream logs and providing log subscription services.

 

 2.4 Milvus application scenarios

Using the Milvus vector database, you can quickly build a vector similarity retrieval system that meets your own scene needs. The usage scenarios of Milvus are as follows:

  • Image retrieval system: search for images by image, and instantly return the image most similar to the uploaded image from the massive database.
  • Video retrieval system: Convert video key frames into vectors and insert them into Milvus to retrieve similar videos or make real-time video recommendations.
  • Audio retrieval system: quickly retrieve massive audio data such as speeches, music, and sound effects, and return similar audio.
  • Molecular formula retrieval system: ultra-high-speed retrieval of similar chemical molecular structures, superstructures, and substructures.
  • Recommendation system: recommend relevant information or products based on user behavior and needs.
  • Intelligent question-and-answer robot: The interactive intelligent question-and-answer robot can automatically answer questions for users.
  • DNA sequence classification system: By comparing similar DNA sequences, genes can be accurately classified in milliseconds.
  • Text search engine: Help users search for the desired information from the text database through keywords.

3 Milvus deployment and use

3.1 Milvus installation

wget https://github.com/milvus-io/milvus/releases/download/v2.2.13/milvus-standalone-docker-compose.yml -O docker-compose.yml

sudo docker-compose up -d

sudo docker-compose ps

View the displayed information through the command as follows:

      Name                     Command                  State                            Ports
--------------------------------------------------------------------------------------------------------------------
milvus-etcd         etcd -advertise-client-url ...   Up             2379/tcp, 2380/tcp
milvus-minio        /usr/bin/docker-entrypoint ...   Up (healthy)   9000/tcp
milvus-standalone   /tini -- milvus run standalone   Up             0.0.0.0:19530->19530/tcp, 0.0.0.0:9091->9091/tcp

 Verify connection:

docker port milvus-standalone 19530/tcp

Stop Milvus

sudo docker-compose down

Delete data after stopping

sudo rm -rf  volumes

3.2 Milvus visualization tool Attu

Attu 美国:Attu

Correspondence between Milvus and Attu:

Milvus Version Recommended Attu Image Version
v2.0.x v2.0.5
v2.1.x v2.1.5
v2.2.x v2.2.6

Excuting an order:

docker run -p 8000:3000  -e MILVUS_URL={your machine IP}:19530 zilliz/attu:v2.2.6

After starting docker, visit "http://{your machine IP}:8000" in the browser, and click " Connect " to enter the Attu service. Connection method username and password. 

 After connecting connect, the display is as follows:

 

3.3 Using Milvus through python

Install pymilvus

pip install pymilvus==2.2.15

3.2.1 Create database

from pymilvus import connections, db

conn = connections.connect(host="192.168.1.156", port=19530)
database = db.create_database("sample_db")

Toggle and display db

db.using_database("sample_db")
db.list_database()

3.2.2 Creating collections

from pymilvus import CollectionSchema, FieldSchema, DataType
from pymilvus import Collection, db, connections

conn = connections.connect(host="192.168.1.156", port=19530)
db.using_database("sample_db")

m_id = FieldSchema(name="m_id", dtype=DataType.INT64, is_primary=True,)
embeding = FieldSchema(name="embeding", dtype=DataType.FLOAT_VECTOR, dim=768,)
count = FieldSchema(name="count", dtype=DataType.INT64,)
desc = FieldSchema(name="desc", dtype=DataType.VARCHAR, max_length=256,)
schema = CollectionSchema(
  fields=[m_id, embeding, desc, count],
  description="Test embeding search",
  enable_dynamic_field=True
)

collection_name = "word_vector"
collection = Collection(name=collection_name, schema=schema, using='default', shards_num=2)

View the creation result through Attu:

3.2.3 Create Index

from pymilvus import Collection, utility, connections, db

conn = connections.connect(host="192.168.1.156", port=19530)
db.using_database("sample_db")

index_params = {
  "metric_type": "IP",
  "index_type": "IVF_FLAT",
  "params": {"nlist": 1024}
}

collection = Collection("word_vector")
collection.create_index(
  field_name="embeding",
  index_params=index_params
)

utility.index_building_progress("word_vector")

 View the results via Attu:

 Index method:

  •  FLAT: High accuracy, suitable for small amount of data, similar to violent solution.
  •  IVF-FLAT: quantitative operation, balance of accuracy and speed
  •  IVF: inverted file clusters the points in the space first, compares the distance between the cluster centers when querying, and then finds the nearest N points.
  •  IVF-SQ8: Quantization operation, disk cpu GPU friendly
  •  SQ8: Perform scalar quantization on the vector, convert the floating-point number representation to int type representation, 4 bytes -> 1 byte.
  •  IVF-PQ: Fast, but the accuracy rate is reduced. The vector is divided into m segments, and each segment is clustered; when querying, the distance between the query vector and the cluster center is calculated after the query vector is divided, and the final distance is obtained after adding each segment. Using the symmetric distance (the distance before the cluster center) does not need to calculate the direct lookup table, but the error is larger.
  • HNSW: Graph-based indexing, efficient search of scenes, and construction of multi-layer NSW.
  • ANNOY: Tree-based indexing, high recall

3.2.4 Insert data

from pymilvus import Collection, db, connections
import numpy as np

conn = connections.connect(host="192.168.1.156", port=19530)
db.using_database("sample_db")
coll_name = 'word_vector'

mids, embedings, counts, descs = [], [], [], []
data_num = 100
for idx in range(0, data_num):
    mids.append(idx)
    embedings.append(np.random.normal(0, 0.1, 768).tolist())
    descs.append(f'random num {idx}')
    counts.append(idx)

collection = Collection(coll_name)
mr = collection.insert([mids, embedings, descs, counts])
print(mr)

operation result:

(insert count: 100, delete count: 0, upsert count: 0, timestamp: 443639998144839682, success count: 100, err count: 0)

View via Attu:

3.2.5 Retrieving data

from pymilvus import Collection, db, connections
import numpy as np

conn = connections.connect(host="192.168.1.156", port=19530)
db.using_database("sample_db")
coll_name = 'word_vector'

search_params = {
    "metric_type": 'IP',
    "offset": 0,
    "ignore_growing": False,
    "params": {"nprobe": 16}
}

collection = Collection(coll_name)
collection.load()

results = collection.search(
    data=[np.random.normal(0, 0.1, 768).tolist()],
    anns_field="embeding",
    param=search_params,
    limit=16,
    expr=None,
    # output_fields=['m_id', 'embeding', 'desc', 'count'],
    output_fields=['m_id', 'desc', 'count'],
    consistency_level="Strong"
)
collection.release()
print(results[0].ids)
print(results[0].distances)
hit = results[0][0]
print(hit.entity.get('desc'))
print(results)

The result of the operation is as follows:

[0, 93, 77, 61, 64, 79, 22, 43, 25, 35, 83, 49, 51, 84, 75, 36]
[0.7047597169876099, 0.5948767066001892, 0.54373699426651, 0.5294350981712341, 0.5216281414031982, 0.5035749673843384, 0.41662347316741943, 0.4026581346988678, 0.40143388509750366, 0.3841533362865448, 0.371593713760376, 0.35352253913879395, 0.3377170264720917, 0.33591681718826294, 0.32786160707473755, 0.3214406967163086]
random num 0
['["id: 0, distance: 0.7047597169876099, entity: {\'m_id\': 0, \'desc\': \'random num 0\', \'count\': 0}", "id: 93, distance: 0.5948767066001892, entity: {\'m_id\': 93, \'desc\': \'random num 93\', \'count\': 93}", "id: 77, distance: 0.54373699426651, entity: {\'m_id\': 77, \'desc\': \'random num 77\', \'count\': 77}", "id: 61, distance: 0.5294350981712341, entity: {\'m_id\': 61, \'desc\': \'random num 61\', \'count\': 61}", "id: 64, distance: 0.5216281414031982, entity: {\'m_id\': 64, \'desc\': \'random num 64\', \'count\': 64}", "id: 79, distance: 0.5035749673843384, entity: {\'m_id\': 79, \'desc\': \'random num 79\', \'count\': 79}", "id: 22, distance: 0.41662347316741943, entity: {\'m_id\': 22, \'desc\': \'random num 22\', \'count\': 22}", "id: 43, distance: 0.4026581346988678, entity: {\'m_id\': 43, \'desc\': \'random num 43\', \'count\': 43}", "id: 25, distance: 0.40143388509750366, entity: {\'m_id\': 25, \'desc\': \'random num 25\', \'count\': 25}", "id: 35, distance: 0.3841533362865448, entity: {\'m_id\': 35, \'desc\': \'random num 35\', \'count\': 35}"]']

3.2.6 Delete data

from pymilvus import Collection, db, connections

conn = connections.connect(host="192.168.1.156", port=19530)
db.using_database("sample_db")
coll_name = 'word_vector'

collection = Collection(coll_name)

ids = [str(idx) for idx in range(10)]
temp_str = ', '.join(ids)
query_expr = f'm_id in [{temp_str}]'
result = collection.delete(query_expr)

print(result)

The running result shows:

(insert count: 0, delete count: 10, upsert count: 0, timestamp: 443640854673883146, success count: 0, err count: 0)

In order to provide performance for retrieval, Milvus introduces bitset. When calling to delete data, Milvus performs soft deletion on the data. Soft-deleted vectors still exist in the database, but are not computed during vector similarity searches or queries. Each bit in the bit set corresponds to an index vector. If a vector is marked as in the bitset 1, it means that the vector is soft deleted and will not be touched during vector searches. 

4 Summary

Vector database technology is an innovation born for AI. It gives full play to the advantages of vector representation and provides an efficient solution for the storage, retrieval and analysis of multi-modal, high-dimensional and complex data. With the continuous development of AI applications, the vector database will become one of the important tools to promote AI accelerated innovation, bringing more efficient and intelligent solutions to all walks of life. Undoubtedly, in this AI-driven era, vector databases will continue to play an important role.

Guess you like

Origin blog.csdn.net/lsb2002/article/details/132222947