LlamaIndex: A new document summarization index for QA systems

From: ChallengeHub

In this blog post, we introduce a brand new LlamaIndex data structure: the Document Summary Index. We describe how it can help provide better retrieval performance compared to traditional semantic search, and walk through an example.

https://github.com/jerryjliu/llama_index

Background reply: join the group , join the NLP communication group~

background

One of the core scenarios of large language models (LLMs) is question answering on users' own data. To do this, we pair the LLM with a "retrieval" model that can perform information retrieval on a knowledge corpus, and use the LLM to perform response synthesis on the retrieved text. This overall framework is called Retrieval Enhancement Generation.

Most users building LLM-powered QA systems today tend to do some form of:

Take the source documents, split each document into chunks of text
Store blocks of text in a vector database
During querying, blocks of text are retrieved by embedding similarity and/or keyword filters.
Execute the response and aggregate the answers

For various reasons, this approach provides limited retrieval performance.

Limitations of Existing Methods

Embedded retrieval using text blocks has some limitations.

Text blocks lack a global context. Often questions require context beyond what is indexed in a particular chunk.
Carefully tune the top-k/similarity score threshold. Assume the value is too small and you'll miss the context. Assume that the value is too large, and the cost/latency may increase with more irrelevant context, the noise increases.
Embeddings do not always select the most relevant context for a question. Embeddings are essentially determined separately between text and context.

Adding keyword filters is one way to enhance your search results. But it also presents its own set of challenges. We need to fully identify appropriate keywords for each document either manually or through NLP keyword extraction/topic tagging models. Also, we need to fully infer the correct keywords from the query.

Document Summary Index

A new index is proposed in LlamaIndex that will extract/index unstructured text summaries for each document. The index can help improve retrieval performance beyond existing retrieval methods. It helps index more information than a single text block and has more semantics than keyword tags. It also allows for a more flexible form of retrieval: we can do both LLM retrieval and embedding-based retrieval.

how it works

During construction, we extract each document and use LLM to extract summaries from each document. We also split the document into chunks of text (nodes). Both summaries and nodes are stored in our document storage abstraction. We maintain a mapping from abstracts to source documents/nodes.

During query, we retrieve relevant documents for the query based on the abstract using:

LLM-based retrieval: We feed the LLM with a document summarization set and ask the LLM to determine which documents are relevant + their relevance scores.
Embedding-based retrieval: We retrieve relevant documents based on abstract embedding similarity (using a top-k cutoff).

Note that this method of retrieving document summaries (even with embedding-based methods) differs from embedding-based retrieval of text chunks. The retrieval class for document summary indexing retrieves all nodes for any selected document, rather than returning the relevant chunks at the node level.

Storing abstracts of documents also enables LLM-based retrieval. Instead of feeding the entire document to the LLM in the first place, we could first have the LLM check the concise document summary to see if it is relevant to the query. This leverages the reasoning capabilities of LLM, which is more advanced than embedding-based lookups, but avoids the cost/delay of feeding the entire document to LLM

idea

Document retrieval with summarization can be thought of as a "middle ground" between semantic search and brute force summarization of all documents. We find documents based on their summary relevance to a given query, and then return all nodes corresponding to the retrieved documents .

Why do we do this? By retrieving context at the document level, this retrieval method provides users with more context than top-k on text blocks. However, it's also a more flexible/automated approach than topic modeling; no more worrying about whether your own text has the correct keyword tags!

example

Let's look at an example showing a document summarization index with Wikipedia articles about different cities.

The rest of this guide shows relevant code snippets. You can find the full walkthrough here (here's the notebook link).

We can build a GPTDocumentSummaryIndex set of documents and pass in a ResponseSynthesizer object to synthesize a summary of the documents.

from llama_index import (
    SimpleDirectoryReader,
    LLMPredictor,
    ServiceContext,
    ResponseSynthesizer
)
from llama_index.indices.document_summary import GPTDocumentSummaryIndex
from langchain.chat_models import ChatOpenAI

# load docs, define service context
...

# build the index
response_synthesizer = ResponseSynthesizer.from_args(response_mode="tree_summarize", use_async=True)
doc_summary_index = GPTDocumentSummaryIndex.from_documents(
    city_docs, 
    service_context=service_context,
    response_synthesizer=response_synthesizer
)

Once indexed, we can get a summary of any given document:

summary = doc_summary_index.get_document_summary("Boston")

Next, let's look at an example of LLM-based index retrieval.

from llama_index.indices.document_summary import DocumentSummaryIndexRetriever

retriever = DocumentSummaryIndexRetriever(
    doc_summary_index,
    # choice_select_prompt=choice_select_prompt,
    # choice_batch_size=choice_batch_size,
    # format_node_batch_fn=format_node_batch_fn,
    # parse_choice_select_answer_fn=parse_choice_select_answer_fn,
    # service_context=service_context
)
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")
print(retrieved_nodes[0].score)
print(retrieved_nodes[0].node.get_text())The retriever will retrieve a set of relevant nodes for a given index.`

Note that LLM returns a relevance score in addition to the document text:

8.0
Toronto ( (listen) tə-RON-toh; locally [təˈɹɒɾ̃ə] or [ˈtɹɒɾ̃ə]) is the capital city of the Canadian province of Ontario. With a recorded population of 2,794,356 in 2021, it is the most populous city in Canada...

advanced api

query_engine = doc_summary_index.as_query_engine(
  response_mode="tree_summarize", use_async=True
)
response = query_engine.query("What are the sports teams in Toronto?")
print(response)

underlying api

# use retriever as part of a query engine
from llama_index.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = ResponseSynthesizer.from_args()

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Toronto?")
print(response)

Background reply: join the group , join the NLP communication group~