Transcend Workshop|Zilliz partner and technical director Luan Xiaofan "Decrypting the AI Native database Milvus in the era of large models"


In the Data 3.0 era, through large models + data, it has become an inevitable trend for enterprises/developers to use less code or build applications and deliver services through natural language. Through the dialogue capabilities of large language models, we can directly mine and release the value of data, which can greatly lower the threshold for data users and open up more possibilities.

In November, we held an offline community meetup with the theme of " New Horizons for Data in the Era of Big Models ", inviting experts from industry, academia, and research to focus on several major exploration directions derived from big models and data: vector databases , LLM+ Data, LLM+SQL, and LLM+Tools share their respective technical exploration and practical experience, and jointly discuss key issues and development trends. The content of the next few issues will be shared in text version with students in the DB-GPT community for joint learning and discussion.



    PART 1
Background introduction


This article mainly shares with you the database Milvus for the era of large models. The Milvus project has been open source since 2019. It took four years and was divided into two phases. Since 2021, there has been a label called  Cloud Native, and since 2022, there has also been a new label, AI Native.

The birth of vector database is actually to solve the problem of difficult processing of unstructured data . Because unstructured data not only contains the long text information we are talking about today, but also includes multi-modal information such as videos, pictures, and audios, multi-modal large models are also a very hot topic now.

Starting around 2019, the first scene that vector database entered was the scene of image search and text-image mutual search, so the initial target was to solve the problem of "how to retrieve unstructured data and how to do semantic retrieval". Of course, there are actually many problems here, including on the model side and on the infra side.

You can see that with the development of dynamic models, including the development of large models, there are actually some preliminary solutions to the problem. But on the Infra side, if we are dealing with a very large amount of data, how can we quickly recall the required data from billion-level or tens of billions-level images and texts to provide better capabilities for large models. Whether it is the so-called  long-term memory  in RAG or Agent , or in the process of large model training and inference, there are actually similar demands. This is also the original intention of the birth of vector database.

Why is vector retrieval combined with the database? If you think about how to process data, you will definitely think of using OB, ODPS and various databases to do it. But in the field of unstructured data, since AI became popular, or since AI became popular in the news network, what everyone has done is similar to a chimney-style application. That is to say, an AI team is responsible for training the model above and washing the data below. All data governance, all Infra side, and even pouch are maintained by the algorithm team themselves. Of course, big manufacturers may be better, but I think that if AI develops in the next 5 to 10 years, there will definitely be a layering process, so I hope that I can provide you with a very good Infra on the AI ​​data side. .

What is a vector database? To simply understand, a vector database is a system for managing and querying high-dimensional vectors . It has several notable features:

First, since it is a database, it must have the basic ability to add, delete, modify, and query . Many vector databases now , or the vector retrieval systems of the previous generation, have very weak update and deletion capabilities. It is still an offline import mode similar to traditional search. A batch of data is generated every day and imported after training. From this point of view, it is not a vector database. Since it is a database, basic CRUD must be well supported. There must be a basic schema definition and clear supported data types. Of course, it can be a dynamic scheme, but since As a database, it should meet some basic requirements of the database.

Second, it is obvious that the vector database must have strong vector retrieval capabilities .

Finally, the semantics of the vector database must be rich enough . Is it a vector database to only do ANN? I think yes, the most basic operation of the vector database is the nearest zero match, but in fact, in the process of development, the vector database has evolved a lot of new semantics, which is actually a very interesting topic. All operations that can be implemented in traditional databases, such as join, groupby, and count, have corresponding implementations in vector databases.

The entire process of using the vector database can be divided into three stages. How to get vectors in the first stage? Of course, obtaining vectors relies on embedding models. OpenAI's  Ada and there are also very good open source models in China, including BGE and Alibaba's GTE, are actually very good embedding models. After the embedding is generated, it is inserted into the vector database. In the middle, there will be something that everyone is doing in traditional search, which is the second stage of index construction. Indexing is off-line and consumes a lot of computing power. Therefore, a core challenge of vector databases is how to reduce the cost of indexing. Another core cost is how to mine some connections and semantics within the data as much as possible during the indexing process to reduce the query load during online serving. The third stage is to build the query after indexing. There are also many core challenges in Query. For example, how to deal with streaming version, how to deal with delete, how to deal with various complex semantics, performance is also a very important challenge.

The picture above is the architecture diagram of Milvus 1.0. It was originally defined in 2019 as what the vector database should look like. A simple understanding is that there is a write-ahead log, a file system to store files, an engine such as vector index Faiss/Annoy/HNSW, some meta and schema definitions, and some basic filtering operations that can support simple Deleted, but the performance is not good and it can support multiple SDKs. This is the earliest definition of a vector database, and it can be said that it is also the first definition of a vector database in the world. We once published an article  SIGMOD 21 , which talked about what a vector database is.

From the shallower to the deeper, what is the core of the vector database? In fact, it is a vector index . Vector index is also a key to the vector database being able to index and query data with high performance.

There are many different types of implementations of vector indexes. The above figure lists 4 of the more mainstream ones, such as number-based, the most typical one is Annoy, and hash-based. Due to performance and accuracy issues, it is no longer a very mainstream query mode in the industry. The most mainstream one is FAISS based on quantization. FAISS is a good implementation, but in this link of Quantitation, there is now a more competitive competitor, Google's  scan , which is also a framework with better performance in open source. The last one is the sprite chart. The most well-known in the field of sprite charts is HNSW, which is actually a design similar to a jump table. The bottom layer is a complete picture, and some index pictures are built on the upper layer. In this way, when you search, it may be like a skip list. First find some neighbor nodes from above, and then search down layer by layer to find the final query result at the bottom. There are many variations of graphs, such as Yahoo's open source NGT and Microsoft's open source disk-based graph solutions. If you are interested, you can read the relevant papers. Basically, there is a corresponding paper behind every project. You can also check out a link to vector retrieval.

We need a database and vector retrieval, so why not PGvector or Elasticsearch? This is the relationship between Tesla and classic cars. The main reason for this analogy is the following points:

First, there is a big difference in the distribution of data between vector data and standard data . For example, all traditional databases are basically based on hashing for routing, or range-based sharding, so that primary key indexes and secondary indexes can be used more efficiently during queries. But in vector databases, vectors cannot be used for sharding. This is the most basic logic. Therefore, for most vector databases, the search needs to check all shards, which is more like an OLAP. scenes to be used. It is somewhat similar to the MPP architecture, but it is not exactly the MPP architecture, so many OLTP databases are naturally not suitable for vector retrieval.

Second, Not hundred percent accurate matters . What does it mean to not be completely correct? This is because the large model itself is not a very stable system, and it cannot give deterministic answers. The search of large models does not require a 100% correct answer, it requires an answer that is similar to the match. This is what everyone calls semantics. So what is the effect of this semantics? Everyone knows that the OB system can be applied in transaction usage scenarios. It relies on stability and correctness. No transaction can be wrong. In the field of vector retrieval, constraints can be removed. It is not critical at all. As long as we can get a relatively reasonable answer, it will be ok. Therefore, there will be a very large space for optimization behind it. If all databases do not need to be 100% correct, you can imagine that the database can be optimized 10 times or 100 times, and the space is very large. The concept of AI for DB mentioned in the past will be very difficult to apply in traditional databases, but it can work in vector databases and works very well. Use the model to do Auto Tune, the effect is very good, and the loss is very low.

Third, computing power requirements . Traditional database bottlenecks may be on IO or the network. Some databases may be on the CPU. For most databases, especially OLTP databases, their bottlenecks are generally not on the CPU. But for vector databases, its bottleneck is in the CPU, or to be more precise, in the memory bandwidth, which means that the challenges it faces are completely different from traditional databases. Therefore, many methods need to be tried, including optimizing bandwidth, optimizing Cache, and heterogeneous computing power. This is also one of the reasons why vector databases can be well combined with GPUs.

The last point is that the semantic complexity will become higher and higher . In the past, vector databases were used for ANN, and Elasticsearch/PG can also be used for ANN, but vector databases are more than that.

You can see that there are many query patterns, such as filtering based on clustering. For example, if you search for photos of dogs, but the expression is "do not want photos of cats", what should you do if this type of query is placed in a fast  ? I find it difficult to do it. I can only label the cat, which is back to the old path of traditional databases. But can we completely filter out the clustering of cats, or can we do a more interesting scenario called KNN join. Just give two tables, one table contains male guests, one table contains female guests, and the male guests and female guests are matched together through vector approximation. These are actually some very interesting scenes. In other words, vector databases can do many things that traditional databases can do. The key is how to do it.

As shown in the figure above, some of the more popular vector databases are listed here, such as Milvus, Redis, PGvector, Elasticsearch, etc. How should you choose? As shown on the right side of the picture, I have set some basic ability goals and some advanced ability goals. As a database, basic capability goals must be met, and there is no way to avoid advanced capabilities if you want to run within real productivity. When you choose a vector database, you can make a table for comparison according to the indicators in the picture above.


                     PART 2
When AI Native meets Cloud Native


Why do we make 2.0 products? The product architecture of Milvus 1.0 described above is a very naive database implementation. Starting in 2021, we decided to redo the database, one label is cloud native and scalability . 

Of course, if you want to be cloud native, there must be some key points that cannot be ignored. First, how to integrate it with cloud infrastructure? After separation of storage and calculation, streaming data is stored in a distributed WL. The second very important point is that as the amount of data increases, user data cannot be accommodated. This is also an important reason why we must build a distributed system. Third, how to integrate with public cloud. In 2021, K8S has become a very mature system, so the team has been thinking about how to use K8S to better run a stateless database. Finally, the fourth point is that serverless is a very important point in AIGC usage scenarios. Because most large models are API services, for the majority of developers, they do not want to maintain the underlying infrastructure themselves. Finally, feelings. Putting aside commercial factors, Zilliz hopes to make a top-notch database product, hoping to make a distributed vector database, and the result is indeed achieved.

As shown above, these are some core capabilities provided by Milvus2.0 to users.

First, for cloud-native distribution , we hope to support tens of billions of vector expansion and have sufficient elasticity to separate storage and computing. All data is stored in underlying storage, such as object storage, message queues, etc.

第二,流批一体,这并不是所谓的这种传统的 lambda 里面的流式,而是希望真正在一套系统里面能够很好的去解决用户的流式插入,并且能够实时查询的这种能力。

第三,引擎可插拔。大家可以看到整个 record index 有非常多的选择,不同的选择之间是有不同 trade-off。有的性能更好,有的内存占用更低,没有办法说出一个完美的索引,满足所有人的需求。现在来看,可能对于大公司来讲,性能是一个很重要的 concern。但对小公司来讲,可能成本或者内存的占用,这些其实变成非常重要的指标,因此希望引擎本身是可以可插拔的。

最后,云端一体,大家可以非常容易的在自己的笔记本上去做部署,也可以在自己公司的 K8S 里面去跑,更为重要的是可以在云上去跑。

未来在大模型的生态中,API 会成为最重要的一个武器。大家已经看到了OpenAI 的 assistant,包括它的 GPTS,本质上其实就是 functional call。虽然也提供了很多 retrieve 的能力,但至少 function calling 是最重要的。它可以把所有很多东西串在一起,能够帮助开发者最快速的去构建。因此API可能会变成未来数据库产品的一个飞轮。有 SQL 的支持固然很好,但如果没有 SQL 支持,API 足够的 popular,大模型学会了如何去写 API 的话,开发者的门槛一样是很低的。

如上图,这是 Milvus2.0 的最终架构,整个的设计理念其实就两点,第一存算分离,第二所有的计算节点微服务化。所有索引节点,所有的 query 节点,所有的数据节点,包括前面的代理全部都是 K8S 的 pod,全部都是微服务化。所有的数据全部都存在中间这一层,基于 Kafka 和 S3 去存,系统可以跟着 K8S 非常快的去做 scale,只需要去做一些内存的变化,一旦 scale 以后,数据直接从 S3 里面拉起来。

Milvus 还提供了面向 AIGC 场景的一些丰富能力。这些是在上一代的向量检索中所缺失的。

Sparse embedding,就是大家非常熟悉的 TF-IDF(Term Frequency-Inverse Document Frequency)和 BM25(Best Match 25)。但今天的稀疏向量有基于 AI 的提取方式,它可以更好地去帮大家做关键词的匹配,标量和向量的混合查询能力以及丰富的 API 支持。

支持多租户如果今天要构建一个 knowledge base,可能有 100万个用户,在数据库里面它的 schema 到底应该如何设计?如果使用传统的一个表一个用户的方式,可能的表的数目会爆炸。如何能在一个向量数据库里面很好的去支持多租户是一个挑战,不过 Milvus 已经具备了这种能力。

海量数据离线导入的能力,类似于 hbase 里面的 Bulk Insert、Bulk Load。有这种非常快速可以把亿级甚至 10亿级别的数据导入到向量数据库里面并立即提供查询的一个能力。

另外,Dynamic SchemaRange search磁盘索引,基于 MMAP 的把数据放在磁盘上的能力,这是绝大多数向量数据库不具备的能力。

除此之外还有很多其他能力,比如说 CDC、多向量的支持、标量索引的倒排等,这些都是在的设计的计划里,预计会在今年陆续上线。

最后说一下性能。提起向量数据库,用户最关心的一定就是性能。如果大家感兴趣或者是做向量数据库,可以跑一下的 VectorDB Benchmark,它是完全开源的,有比较丰富的测试集,包括过滤的测试集、各种各样不同参数的搜索、不同大小的数据集。

那么如何去优化数据集呢?其实主要就三件事情。第一是算力,如何找到最便宜最高效的算力,除了 GPU 以外,ARM 是可以去深度挖的一个点,包括所有 Intel CPU 新的指令集,现在用 AVX-512 VNNI 以及最新一代的 amx 指令集,其实对性能有非常好的提升。目前,支持的 ARM SVE 的厂商比较少,我们是在 amx 上面去做的。另外,支持NVIDIA 最新的 GPU 图索引,我们也把它贡献给了社区,性能比传统的 GPU index 要好很多。

第二个其实是算法侧,算法包括怎么去优化图,怎么去提升图的质量,怎么在搜索的过程中尽可能去剪枝,是优化性能一个很重要的方式。

最后,是查询的调度,包括 dynamic batching,如何做请求的合并,如何让集群的负载变得更加的均衡,回到传统数据库的领域。所以向量检索事情本质上是一个高性能计算加数据库,这是大家要去做的一个事情。


     PART 3
使用场景


这些传统的场景大家可以看下,不再过多去展开了,大家如果是做这些相关业务的话,可以具体去聊

第一个应用比较多的场景是 RAG ,主要解决了四个问题。第一个是大模型的幻觉问题,第二个是数据的新鲜度问题,第三个是数据的安全问题,最后一个是用大模型去输出结果如何验证的一个问题。因为 RAG 是会给一个 reference link,无论大家用图数据库来做,还是向量数据库来做,两者之间并不冲突,是一个很好的补充。

第二个有意思的场景叫 Semantic Cache。在 github 上面有一个蛮火项目叫 GPT Cache。最简单的一个思路是用 redis 去缓存 mysql 的数数据,有没有可能去缓存一下大模型的输出的结果,所以就做了这么一个项目。

其实整个的思路没有很复杂,用向量数据库做了语义的检索,如果问题语义匹配的话,就认为答案可能是相似的。目前可能还不完全是一个非常 production ready 的场景。但是确实给了很好的思路,并且在大模型的推理阶段,大家也会用这种类似的思路,基于召回,再去给大模型做加工,可以省掉很多 token,也是蛮常见的一个思路


                          PART 4
OpenA Dev Day,究竟意味着什么?

最后给大家分享一个比较新的话题,对于 openapi dev day 怎么看?大家也都知道 11月6日新开的发布会,推了几个比较重要的功能。第一个就是构建自己的 GPT,第二个就是关于 GPT-4 turbo支持非常长的 token,第三就是支持召回,支持function call。比较关注的是多模态 API,我们做了很多测试,结果可以给大家简单分享一下。我觉得 openapi 做召回还是在处于非常简单的一个阶段,大家解决不好的事情,它也没有解决的很好,比如说长文本的 summarize,就是给一本书,告诉书是做什么的,其实用 RAG 来解决是非常难的,包括很具体的问题,比如说有 100个 document,每个 document 有个 ID,问 document 的 ID50 讲什么事情?ChatGPT 会告诉搜不出来。我觉得搜索和大模型之间的深度结合,是一个非常好的 topic,本质上基于概率,因此搜索和和生成这两个问题一定是要一起去解决的。但是现在其实无论对谁来讲,都是非常早期的一个阶段。自己更希望大模型的公司,能去构建更好的一个生态,通过这种 function calling,以 agent 作为一个中心,所有的周边厂商提供更好的 API。比如说的图数据库可以提供一套 API,向量数据库提供一套 API,以 agent 为中心去把业务逻辑给串起来,是对未来的一个期望。

附录
0 1
  DB-GPT 框架

https://github.com/eosphoros-ai/DB-GPT

0 2
Text2SQL 微调

https://github.com/eosphoros-ai/DB-GPT-Hub

0 3
  DB-GPT 前端可视化项目

https://github.com/eosphoros-ai/DB-GPT-Web

0 4
  DB-GPT 插件仓库
https://github.com/eosphoros-ai/DB-GPT-Plugins
0 5
 Text2SQL学习资料和前沿跟踪
https://github.com/eosphoros-ai/Awesome-Text2SQL

推荐阅读




本文分享自微信公众号 - ZILLIZ(Zilliztech)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

商汤科技创始人汤晓鸥离世,享年 55 岁 2023 年,PHP 停滞不前 鸿蒙系统即将走向独立,多家高校设立“鸿蒙班” 夸克浏览器 PC 版开启内测 字节跳动被 OpenAI “封号”事件始末 稚晖君创业公司再融资,金额超 6 亿元,投前估值 35 亿元 AI 代码助手盛行,编程语言排行榜都没法做了 Mate 60 Pro 的 5G 调制解调器和射频技术遥遥领先 No Star, No Fix MariaDB 拆分 SkySQL,作为独立公司成立
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4209276/blog/10322208