An article explaining dynamic Schema in detail

Author of this article: Zilliz developer evangelist Yujian Tang; Zilliz chief engineer Cao Zhenshan

In databases, Schema is common, but dynamic Schema is not common.

For example, SQL databases have predefined Schemas, but these Schemas usually cannot be modified. Users can only define the Schema when creating it. The role of Schema is to tell the database user the desired table structure and ensure that each row of data conforms to the Schema of the table. NoSQL databases usually support dynamic Schema or do not need to create a Schema (that is, there is no need to define properties for each object when creating the database).

In the Milvus community, supporting dynamic Schema is also one of the most requested features. In order to better meet user needs, Milvus released this feature in 2.2.9, so that the database Schema can "dynamically change" according to users adding data. Since then, users do not need to strictly follow the predefined Schema when inserting data as before, and can add data in JSON format just like in NoSQL databases.

However, we found that many users still have many questions about the A and B aspects of using dynamic Schema in vector databases and their functions. This article will answer them one by one.

01.What is database Schema?

What is a database schema? Let’s take an example:

Schema defines how to insert and store data in the database. The above figure shows how to create a standard Schema for a relational database.

In the database in the picture above, there are 4 tables in total, each table has its own Schema. The table in the middle of the picture has 4 columns of data, and the other three tables have 2 columns of data.

In addition, we also need to define the data type in the Schema. The "Employee", "Title" and "DeptName" columns will all be strings (i.e. VARCHAR), "CourseID" is also a string, the "EmpID" and "DeptID" column data are integers, and the "Date" column data type can be Date or VARCHAR.

02.What is vector database Schema?

Continuing from our previous article, "How to build a chatbot using LlamaIndex?" 》For example, the following figure shows 1 piece of data in 1 Zilliz Cloud instance:

If the Schema is defined in a traditional database, we need to create an 11-column Schema for this piece of data.

Among them, the data types of the six columns "id", "paragraph", "subtitle", "publication", "article_url" and "title" are VARCHAR; the data types of the three columns "reading_time", "responses" and "claps" are Integer (INT); the data type of "date" column is date (DATE); the data type of the remaining last column "embedding" is floating point vector (FLOAT_VECTOR), which is used to store Embedding vector data.

How to use the Dynamic Schema function in Milvus vector database?

The following code snippet shows how to enable the dynamic Schema feature in Milvus, insert data into dynamic fields and perform filtered searches.

from pymilvus import (
   connections,
   FieldSchema, CollectionSchema, DataType,
   Collection,
)
DIMENSION = 8
COLLECTION_NAME = "books"
connections.connect("default", host="localhost", port="19530")
fields = [
  FieldSchema(name='id', dtype=DataType.INT64, is_primary=True),
  FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=200),
  FieldSchema(name='embeddings', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
Schema = CollectionSchema(fields=fields, enable_dynamic_field=True)
collection = Collection(name=COLLECTION_NAME, Schema=Schema)
data_rows = [
   {"id": 1, "title": "Lord of the Flies","embeddings": [0.64, 0.44, 0.13, 0.47, 0.74, 0.03, 0.32, 0.6],"isbn": "978-0399501487"},
   {"id": 2, "title": "The Great Gatsby","embeddings": [0.9, 0.45, 0.18, 0.43, 0.4, 0.4, 0.7, 0.24],"author": "F. Scott Fitzgerald"},
   {"id": 3, "title": "The Catcher in the Rye","embeddings": [0.43, 0.57, 0.43, 0.88, 0.84, 0.69, 0.27, 0.98],"claps": 100},
]
collection.insert(data_rows)
collection.create_index("embeddings", {"index_type": "FLAT", "metric_type": "L2"})
collection.load()
vector_to_search = [0.57, 0.94, 0.19, 0.38, 0.32, 0.28, 0.61, 0.07]
result = collection.search(
   data=[vector_to_search],anns_field="embeddings",
   param={},limit=3,expr="claps > 30 || title =='The Great Gatsby'",
   output_fields=["title", "author", "claps", "isbn"],consistency_level="Strong")

for hits in result:for hit in hits:print(hit.to_dict())

In the created Collection "books", we defined Schema, which contains 3 fields: id, titleand embeddings. id is the primary key column - a unique identifier for each row of data, and the data type is INT64. titleRepresents the book title, the data type is VARCHAR. embeddingsis a vector column, and the vector dimension is 8. Note that the vector data in the code of this article is randomly set and is for demonstration purposes only.

Schema = CollectionSchema(fields=fields, enable_dynamic_field=True)
collection = Collection(name=COLLECTION_NAME, Schema=Schema)

CollectionSchemaWe enable dynamic Schema by passing a field to the object when defining it . In short, just add enable_dynamic_fieldand set its parameter value to True.

data_rows = [
   {"id": 1, "title": "Lord of the Flies","embeddings": [0.64, 0.44, 0.13, 0.47, 0.74, 0.03, 0.32, 0.6],"isbn": "978-0399501487"},
   {"id": 2, "title": "The Great Gatsby","embeddings": [0.9, 0.45, 0.18, 0.43, 0.4, 0.4, 0.7, 0.24],"author": "F. Scott Fitzgerald"},
   {"id": 3, "title": "The Catcher in the Rye","embeddings": [0.43, 0.57, 0.43, 0.88, 0.84, 0.69, 0.27, 0.98],"claps": 100},
]

In the above code, we have inserted 3 rows of data. id=1The data includes dynamic fields isbn, id=2includes author, id=3includes claps. These dynamic fields have different data types, including string types ( isbnand author) and integer types ( claps).

result = collection.search(data=[vector_to_search],anns_field="embeddings",param={},limit=3,expr="claps > 30 || title =='The Great Gatsby'",output_fields=["title", "author", "claps", "isbn"],consistency_level="Strong")

In the above code, we have made a filter query. The filter query combines ANNS (Approximate Nearest Neighbor) search and scalar filtering based on dynamic and static fields. The purpose of the query is to retrieve exprdata that meets the conditions specified in the parameters. The output includes title, author, clapsand isbnfields. exprThe parameters allow based on Schema fields (or so called Filter for static fields) titleand dynamic fields .claps

After running the code, the output is as follows:

{'id': 2, 'distance': 0.40939998626708984, 'entity': {'title': 'The Great Gatsby', 'author': 'F. Scott Fitzgerald'}}
{'id': 3, 'distance': 1.8463000059127808, 'entity': {'title': 'The Catcher in the Rye', 'claps': 100}}

How does Milvus implement dynamic Schema functionality?

Milvus supports users to add dynamic fields with different names and data types to each row of data by using hidden metadata columns. $metaWhen a user creates a table and turns on dynamic fields, Milvus will create a hidden column named in the table's Schema . JSON is a language-independent data format that is widely supported by modern programming languages, so the dynamic actual columns hidden by Milvus use JSON as the data type.

Milvus organizes data in a columnar structure. During the data insertion process, the dynamic field data in each row of data is packaged into JSON data, and the JSON data in all rows together form a hidden dynamic column $meta.

03. Side A and B of dynamic Schema

Of course, the function of dynamic Schema may not be suitable for all users. You can choose to turn on or off dynamic Schema according to your own scenarios and needs.

On the one hand, dynamic Schema is easy to set up, and dynamic Schema can be turned on without complicated configuration; dynamic Schema can adapt to changes in the data model at any time, and developers do not need to reconstruct or adjust the code.

On the other hand, using dynamic Schema for filtered search is much slower than fixed Schema; batch insertion on dynamic Schema is more complicated, and it is recommended that users use the row-based insertion interface to write dynamic field data.

Of course, in order to address the above challenges, Milvus has integrated a vectorized execution model to improve filtering search efficiency. The idea of ​​vectorized execution is to no longer call an operator to process one row of data at a time like the volcano model, but to process a batch of data at a time. This computing mode also has better data locality during calculations, significantly improving overall system performance.

04. Summary

After seeing this, I believe everyone has a deeper understanding of how to use dynamic Schema in Milvus. I need to remind everyone that the dynamic Schema function has two sides, A and B. On the one hand, it provides easy setting of dynamic Schema, providing users with flexibility and high efficiency. But on the other hand, filtered search using dynamic Schema is slower than fixed Schema, and batch insertion on dynamic Schema is more complicated. Milvus utilizes a vectorized execution model to deal with some of the disadvantages of dynamic Schema, thereby optimizing overall system performance.

In the future, we will also enhance the scalar index capability in Milvus 2.4, accelerate filtering queries through inverted indexes of static and dynamic fields, and improve the performance and efficiency of dynamic Schema management and queries.

Bilibili crashed twice, Tencent’s “3.29” first-level accident... Taking stock of the top ten downtime accidents in 2023 Vue 3.4 “Slam Dunk” released MySQL 5.7, Moqu, Li Tiaotiao… Taking stock of the “stop” in 2023 More” (open source) projects and websites look back on the IDE of 30 years ago: only TUI, bright background color... Vim 9.1 is released, dedicated to Bram Moolenaar, the father of Redis, "Rapid Review" LLM Programming: Omniscient and Omnipotent&& Stupid "Post-Open Source "The era has come: the license has expired and cannot serve the general public. China Unicom Broadband suddenly limited the upload speed, and a large number of users complained. Windows executives promised improvements: Make the Start Menu great again. Niklaus Wirth, the father of Pascal, passed away.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4209276/blog/10584950