Applying Large Models to Knowledge Retrieval Technology Architecture

Applying Large Models to Knowledge Retrieval Technology Architecture

Original large model  machine AI learning data AI mining  2023-08-06 17:17

overview

In almost all practical applications of large language models (LLMs), there are situations where you want the language model to generate answers based on specific data , rather than general answers based on the model's training set . For example, a corporate chatbot should be able to cite a specific article on a company website, and an attorney analytics tool should be able to cite previous documents in the same case. The manner in which this external data is brought in is a key design issue.

At a high level, there are two main ways to reference specific data:

  1. Insert data into model prompts as context and direct responses to leverage that information.

  2. Fine-tuning the model is done by providing hundreds or thousands of hints<> .

Shortcomings of Existing Large Language Models for Knowledge Retrieval.

For context-based methods:

The context size of the model is limited, and the latest davinci-003model can only process a maximum of 4000 tokens at a time. Much documentation fails to fit into this context. Processing more tokens equates to longer processing times. In a customer-facing scenario, this degrades the user experience. Processing more tokens equates to higher API costs, not necessarily more accurate responses if the information in context is not targeted.

For the fine-tuning method:

Generating hint<>completion pairs is time-consuming and potentially expensive.

Many repositories of information that you want to reference are very large. For example, if your application is a study aid for medical students who are taking the MLE in the US, you need to provide a complete model of training examples across multiple disciplines.

Some external data sources change rapidly. For example, retraining a model based on a customer support queue that empties daily or weekly is not optimal.

Best practices for fine-tuning are still being developed. LLM itself can be used to assist in generating training data , but may require some complexity to implement effectively.

Simplified solution

picture

The above design has been known by various names, the most common being " retrieval-augmented generation" RA or "RETRO". Related links and concepts:

RAG: Retrieval Augmentation Generation for Knowledge-Intensive Natural Language Processing Tasks RETRO: Improving Language Models by Retrieving from Trillions of Tokens REALM: Retrieval Augmentation Language Model Pre-training Retrieval Augmentation Generation

(a) Retrieve relevant information from outside the language model (non-parametric approach) and augment the LLM with the context in the cues. This architecture can effectively bypass most of the limitations of fine-tuning and context-only methods.

retrieve

picture

Retrieving relevant information deserves further explanation. As you can see, data can come from multiple sources, depending on the use case. For data to be useful, it must be small enough that multiple pieces fit into context, and there must be a way to identify correlations. Thus, a typical prerequisite is to divide the text into parts (e.g., via utilities in the LangChain package), and then compute embeddings on these blocks.

Language model embeddings are numerical representations of concepts in text and have seemingly limitless uses. Here's how they work: Embedding models convert text into large score vectors that can be efficiently compared with other score vectors to assist in tasks like recommendation, classification, and search. We store the results of this computation into what I will generically refer to below as the search index and the entity store - see below for a more advanced discussion.

picture

Back to the flow - when a user submits a question, LLM processes the message in a number of ways, but the critical step is computing another embedding - this time the user's text. We can now perform semantic searches in search indexes and entity stores by comparing new embedding vectors with a precomputed full set of vectors. This kind of semantic search is based on the concept of "learning" of the language model, and is not limited to only searching for keywords. From the results of this search, we can quantitatively identify one or more relevant text blocks that can help users understand their problem.

enhance

Building prompts with related text blocks is straightforward. Prompting starts with some basic prompt engineering, instructing the model to avoid "fantasy", i.e. making up an answer that sounds reasonable but isn't real. We instruct the model to answer the question in a specific format, such as an ordinal rank of "high", "medium", or "low", if applicable. Finally, we provide relevant information that the language model can answer using the specific data. In its simplest form, we simply append ("Document1:" + TextBlock1 + "Document2:" + TextBlock2 + ...) until the context is populated.

Finally, the combined hints are sent to a large language model. The answer is parsed from the completion and passed to the user.

That's it! While this is a simple version of the design, it is cheap, accurate, and suitable for many lightweight use cases. I've used this setup in industry prototypes with great success. The plug-and-play version in the openai-cookbook repository is a convenient starting point.

advanced design

I'd like to take a moment to discuss some research advances that might go into generative architectures for retrieval enhancement . I believe applying the LLM product will enable most of this within 6-9 months.

generate-then-read process

This approach involves using an LLM to process user input and then retrieve relevant data.

Basically, the user's question lacks some relevant pattern of information that would reveal a meaningful answer. For example, "What is the syntax for a list comprehension in Python?" is very different from examples in the code repository (like "newlist = [x for x in tables if "customer" in x]"). One proposed approach is to use Hypothetical Document Embeddings to generate a hypothetical contextual document that may contain spurious details but mimic a real answer. Embedding this document and searching the data store for relevant (real) examples retrieves more relevant results; use these relevant results to generate actual answers for the user to review.

Similarly, another method called "generate-then-read" (GenRead) establishes a practical basis by implementing a clustering algorithm on multiple contextual document generation. It efficiently generates multiple sample contexts and ensures they differ in a meaningful way. This approach biases the language model towards returning more diverse hypothetical contextual document suggestions, which (after embedding) returns more varied results from the data store and leads to a higher chance of completion including accurate answers.

Improved LLM Index and Response Synthesis Data Structure

The GPT Index project is excellent and well worth a read. It utilizes a series of data structures created and optimized by language models. GPT Index supports multiple types of indexes, which are described in more detail below. The basic response composition is "select the top k relevant documents and append them to the context", but there are various implementation strategies to choose from.

List Index - Each node represents a block of text, otherwise unchanged. In the default setting, all nodes are combined into the context (in response to the compositing step).

picture

Vector Store Index - This is equivalent to the simple design I explained in the previous section. Each text chunk is stored with an embedding; the query embedding is compared to the document embedding, returning the k most similar documents for input into the context.

picture

Keyword Index - Supports fast and efficient lexical searches for specific strings.

picture

Tree Index - This is useful when your data is organized into a hierarchy. Consider a clinical documentation application: you might want text to include high-level instructions ("Here are some general ways to improve heart health"), as well as low-level text (references to side effects and instructions for specific blood pressure drug regimens). There are several different ways of traversing the tree to generate the response, two are shown below.

picture

picture

GPT Index provides index composability, which means you can build indexes on top of other indexes. For example, in a code assistant scenario, you could build a tree index on your internal GitHub repository and another tree index on Wikipedia. Then, you add a keyword index on top of the tree index. Extended context size

Some of the methods outlined in this post sound "hacky" because they involve workarounds for the relatively small context sizes of current models . There are substantial research efforts aimed at extending this limitation.

It is expected that GPT-4 will be launched within the next 1-3 months. It is rumored to have a larger context size.

This paper from the Google AI team includes an exploration of many engineering tradeoffs . One of the configurations allows context lengths up to 43,000 tokens.

A new state-space model architecture grows linearly with the context size, rather than quadratically like the Transformer model. Although the model's performance lags in other areas, it shows that research and development aimed at improving model considerations, such as context size, is progressing.

In my opinion, advances in context size will scale with the need for more data retrieval; in other words, it is safe to assume that text splitting and thinning will continue to be required, even as certain configurations evolve.

Persistent state (e.g. conversation history )

When LLM is presented to the user, the main challenge is to keep the conversation history in context.

An overview of the relevant strategies is beyond the scope of this article; for a recent code demonstration example involving progressive summarization and knowledge retrieval, see this LangChain example (https://github.com/langchain-ai/langchain/tree/master /libs/langchain).

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132296738