Lower the threshold for retrieval system construction and easily implement RAG applications! Zilliz Cloud Pipelines surprise launch

Zilliz Cloud officially launches Pipelines!

Zilliz Cloud Pipelines can convert unstructured data such as documents, text fragments, and images into searchable vectors and store them in Collections, helping developers simplify engineering development, help them implement RAG applications in a variety of scenarios, and integrate complex production systems. Setup and maintenance are simplified into API calls.

01. Why do we need Zilliz Cloud Pipelines?

Semantic-based information retrieval systems are widely used in many applications and Internet services, from the well-known web search, e-commerce image search to the recently very popular Retrieval Enhanced Generation (RAG) application. The latest retrieval systems usually use deep learning models to extract features from unstructured data such as text and images and convert them into high-dimensional vectors. This process is called "Embedding" in the industry. The extracted vectors need to be stored and retrieved using dedicated vector databases such as Zilliz Cloud and Milvus. With the development of deep learning, the method of using vectors for retrieval has become more and more common in recent years.

However, building the above retrieval system requires deep expertise and engineering experience. Many developer friends want to try vector retrieval but suffer from the need to build complex data processing and model reasoning systems to implement Embedding. Now, Zilliz Cloud Pipelines can solve this problem conveniently and effectively! Zilliz Cloud Pipelines provides an easy-to-use API to convert unstructured data such as documents, text fragments, and images into searchable vectors and store them in Collections.

Reasons to choose Zilliz Cloud Pipelines:

  • Simplifying the development process, developers can convert unstructured data into searchable vectors and perform data retrieval in the vector database without building complex systems.

  • Even if you do not have professional experience in deep learning and retrieval systems, you can effectively generate high-quality Embeddings to meet business retrieval needs.

  • There is no need to worry about scalability. Even if the data volume and query frequency increase by several orders of magnitude, the system can easily cope with it.

We are currently releasing a public preview version of Zilliz Cloud Pipelines that supports semantic search for documents. We will launch more types of Pipelines in the future to meet more diverse information retrieval scenarios, such as more flexible data preprocessing, image and video search, multi-modal search, etc.

02. How Zilliz Cloud Pipelines work

Zilliz Cloud Pipelines consists of three types: Ingestion pipeline, Search pipeline, and Deletion pipeline:

Ingestion Pipeline

The Ingestion pipeline can convert unstructured data into searchable vectors and import the vectors into the Zilliz Cloud vector database for subsequent queries.

Multiple functions can be configured in an Ingestion pipeline, which are used to pass input fields through conversion logic to generate output fields. For example, we can take documents as input, and the function will automatically segment and convert these documents into vectors. At the same time, the function can also retain some additional information given by the user to the document to filter the search results during subsequent vector searches.

In Zilliz Cloud, one Ingestion pipeline corresponds to one Collection. When creating an Ingestion pipeline, Zilliz Cloud will automatically create a corresponding Collection and automatically specify the data format (Schema) for the newly created Collection according to the configuration.

INDEX_DOC Function

The INDEX_DOC function splits the input text document into fragments and converts each fragment into a vector. It doc_namemaps the input field ( ) to four output fields ( doc_name, chunk_id, chunk_textand embedding). These four fields constitute the scalar and vector fields in the new Collection, and the field names cannot be changed.

Note that one Ingestion pipeline needs to be added and only one INDEX_DOC function can be added.

PRESERVE Function

The PRESERVE function stores user-defined input fields as additional scalar fields in the new Collection, which are used to store some additional information to describe the characteristics of a document. This information is stored in the entry for each document fragment. A PRESERVE function only saves one scalar field, and up to 5 PRESERVE functions can be added to an Ingestion pipeline.

Example: Create a knowledge base

With the help of the Ingestion pipeline, we can easily build a knowledge base that supports semantic retrieval based on existing documents and related data (such as document author, publication date, etc.). The original text of the document fragment and its vectors and additional information about the document are stored in the vector database.

Search Pipeline

The search pipeline converts the query text (string) into a vector and performs a vector similarity search in the vector database to obtain Top-k similar vectors, corresponding fragment original texts, and additional information about the document. We can use Search pipeline to implement semantic retrieval. Only one function, SEARCH_DOC_CHUNK, can be added to one Search pipeline.

SEARCH_DOC_CHUNK Function

The SEARCH_DOC_CHUNK function converts the query text into a vector and retrieves the k document fragments most relevant to the query vector in the vector database.

Example: Semantic-based retrieval

If the user has created an Ingestion Pipeline, the Search pipeline can be used in the corresponding Collection to retrieve similar text fragment vectors. The characteristics of the Embedding model ensure that they are the fragments most similar to the query text semantics in the knowledge base.

Deletion Pipeline

Deletion pipeline Deletes all fragments of the specified document from the Collection. Only 1 PURGE_DOC_INDEX function can be added to 1 Deletion pipeline.

PURGE_DOC_INDEX Function

PURGE_DOC_INDEX function deletes all document fragments with the specified `doc_name. Users can use the PURGE_DOC_INDEX function to efficiently delete documents from vector databases.

Example: Efficiently delete document data

If you have created an Ingestion Pipeline, you can use the Deletion pipeline to specify doc_name` in its corresponding Collection to easily delete the corresponding document without performing separate deletion operations on each fragment.

Click the link to view the Zilliz Pipelines demo in the article

03. Summary

As a platform designed specifically for developers, Zilliz Cloud Pipelines brings more possibilities to AI application development:

  • By supplementing domain-specific or private knowledge, user questions are converted into vectors in the vector matching knowledge base, and highly relevant knowledge is supplemented, which improves the accuracy of large-scale language models (LLM) in RAG applications and effectively solves the potential for over-reliance on LLM. Issues with outdated data. By converting user questions into vectors that match vectors in a knowledge base, accuracy and relevance can be improved, especially in applications such as chatbots and content generation systems.

  • Improve the recall capabilities of applications based on keyword retrieval. Keyword retrieval often suffers from the inability to effectively perceive semantic approximation. Many traditional applications, such as page search on independent websites, are built based on keyword retrieval. Using Embedding and vector recall can greatly increase the probability of hitting key information and improve search quality.

Currently, developers can use this feature for free by creating a Serverless Cluster in Zilliz Cloud . In the next step, this feature will gradually cover the Standard Edition and Enterprise Edition Cluster. In the future, we will continue to improve the customization functions of Zilliz Cloud Pipelines and expand to retrieval scenarios such as images and videos. Everyone is welcome to try it!

Tang Xiaoou, founder of SenseTime, passed away at the age of 55. In 2023, PHP stagnated . Hongmeng system is about to become independent, and many universities have set up "Hongmeng classes". The PC version of Quark Browser has started internal testing. ByteDance was "banned" by OpenAI. Zhihuijun's startup company refinanced, with an amount of over 600 million yuan, and a pre-money valuation of 3.5 billion yuan. AI code assistants are so popular that they can't even compete in the programming language rankings . Mate 60 Pro's 5G modem and radio frequency technology are far ahead No Star, No Fix MariaDB spins off SkySQL and forms as independent company
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4209276/blog/10320931