Azure Machine Learning - Integrated data chunking and embedding in Azure AI Search

In Indexer-based indexing, Azure AI Integrated Vectorization adds data chunking and text-to-vector embedding to the skills, and it also adds text-to-vector conversion for queries.

Follow TechLead and share all-dimensional knowledge of AI. The author has 10+ years of Internet service architecture, AI product development experience, and team management experience. He holds a master's degree from Tongji University in Fudan University, a member of Fudan Robot Intelligence Laboratory, a senior architect certified by Alibaba Cloud, a project management professional, and research and development of AI products with revenue of hundreds of millions. principal.

file

1. Component diagram

The figure below shows the integrated vectorized components.

file

Here is a list of components responsible for integrating vectorization:

  • Supported data sources for indexer-based indexing.
  • An index specifying the vector field, and a vectorizer definition assigned to the vector field.
  • A skillset that provides text splitting skills for data chunking, and a vectorization skill (AzureOpenAiEmbedding skill, or a custom skill that points to an external embedding model).
  • (Optional) Index projection used to push chunked data to the secondary index (also defined in skills group)
  • An embedded model, deployed on Azure OpenAI or available via an HTTP endpoint.
  • An indexer for end-to-end driven processes. The indexer also specifies the plan, field mapping, and properties used for change detection.

This list is primarily related to integrating vectorization, but your solution is not limited to this list. Additional AI augmentation skills can be added, knowledge stores can be created, semantic rankings can be added, relevance optimization and other query capabilities can be added.

2. Availability and Pricing

Integrated vectorization availability is based on the embedding model. If you are using Azure OpenAI, checkregional availability.

If you are using custom skills and Azure hosting mechanisms (such as Azure Function Apps, Azure Web Apps, and Azure Kubernetes), please reviewthe product pages available in each region for feature availability.

Data chunking (text splitting skill) is free and available in all Azure AI services in all regions.

3. What solutions does integrated vectorization support?

  • Divide large documents into chunks, useful for vector and non-vector scenarios. For vector scenarios, blocks help you satisfy the input constraints of the embedding model. For non-vector scenarios, you might use a chat-based search application where GPT assembles responses from indexed chunks. Chat-style searches can be done using vectorized or non-vectorized blocks.

  • Generates a vector store in which all fields are vector fields and only the document ID (required for search indexing) is a string field. Query the vector index to retrieve the document ID, and then send the document's vector fields to another model.

  • Combine vector and text fields to perform hybrid searches with or without semantic ranking. Integrated vectorization simplifies [all scenarios supported by vector search]

4. When to use integrated vectorization

We recommend using Azure AI Studio's built-in vectorization support. If this approach doesn't meet your needs, you can create indexers and skillsets that call integrated vectorization using Azure AI Search's programming interface.

5. How to use integrated vectorization

For query-only vectorization:

  1. Add a [vectorizer] to the index. It should be the same embedding model used to generate the vectors in the index.
  2. Assign [vectorizer] to the vector field.
  3. [Build Vector Query] specifies the text string to be vectorized.

More common scenario - data chunking and vectorization during indexing:

  1. [Establish a data source connection] to a supported data source for indexer-based indexing.
  2. [Create a skill group] to call [Text Splitting Skill] for chunking, and call [AzureOpenAIEmbeddingModel] or a custom skill to vectorize the chunks.
  3. [Create an index] to specify a query-time [vectorizer] and assign it to a vector field.
  4. [Create an indexer] to drive the entire process from data retrieval to skillset execution to indexing.

6. Restrictions

Make sure you understand [Azure OpenAI quotas and limits for embedded models]. Azure AI Search has a retry policy, but if quota is exhausted, retries fail.

Azure OpenAI token per minute limits are calculated per model, per subscription. Keep this in mind if you use the embedded model for query and indexing workloads. Where possible [follow best practices]. Provide an embedding model for each workload and try deploying it in different subscriptions.

Keep in mind that in Azure AI Search, there are per-tier and per-workload [service limits].

Finally, the following features are currently not supported:

  • [Customer-managed encryption key]
  • [Shared private link connection] with vectorizer
  • Currently, batch processing of integrated data chunking and vectorization is not provided

7. Advantages of integrated vectorization

Here are some important advantages of integrated vectorization:

  • There are no separate data chunking and vectorization pipelines. Code is easier to write and maintain.

  • Automate end-to-end indexing. When data changes in the source (such as Azure Storage, Azure SQL, or Cosmos DB), the indexer can deliver these updates throughout the entire pipeline (from retrieval to document cracking to optional AI enrichment, data chunking, vector ization and indexing).

  • Project chunked content to a secondary index. Secondary indexes are created just like any search index (a schema containing fields and other constructs), but the indexer populates them along with the primary index. During the same indexing run, the content of each source document flows to fields in the primary and secondary indexes.

    Secondary indexes are suitable for data chunking and retrieval augmentation generation (RAG) applications. Assuming a large PDF file is used as the source document, the primary index might contain basic information (title, date, author, description), while the secondary index contains blocks of content. Block-level vectorization makes it easier to find relevant information (each block is searchable) and return relevant responses, especially in chat-based search applications.

8. Blocked index

Chunking is the process of dividing content into smaller manageable parts (chunks) that can be processed independently. Chunking is necessary if the source document is so large that it exceeds the maximum input size of the embedding or large language model, but you may find that chunking provides a better index structure for [RAG mode] and chat-style searches.

The following diagram shows the components of block indexing.

file

Follow TechLead and share all-dimensional knowledge of AI. The author has 10+ years of Internet service architecture, AI product development experience, and team management experience. He holds a master's degree from Tongji University in Fudan University, a member of Fudan Robot Intelligence Laboratory, a senior architect certified by Alibaba Cloud, a project management professional, and research and development of AI products with revenue of hundreds of millions. principal.

Guess you like

Origin blog.csdn.net/magicyangjay111/article/details/134494658