How to summarize large documents using LangChain and OpenAI

[Live broadcast preview] Will large models replace programmers? "

There are still some limitations when summarizing very large documents. Here are some ways to mitigate these effects.

Translated from How to Summarize Large Documents with LangChain and OpenAI , author Usama Jamil.

Large language models make many tasks easier, such as making chatbots, language translation, text summarization, etc. We used to write models to do summarization and then there were always performance issues. Now we can do this easily using large language models (LLM) . For example, state-of-the-art (SOTA) LLMs can already process entire books within their context window. But there are still some limitations when summarizing very large documents.

LLM limitations on large document summarization

Context limit or context length in LLM refers to the number of tokens that the model can handle. Each model has its own context length, also known as the maximum mark or mark limit. For example, the context length of the standard GPT-4 model is 128,000 tokens. It loses information for more than that number of tags. Some SOTA LLMs have context limits of up to 1 million tags. However, as contextual constraints increase, LLM suffers from limitations such as recency and primacy. We can also delve into ways to mitigate these effects.

The primacy effect in LLM means that the model places more emphasis on information presented at the beginning of the sequence.
The recency effect is when a model emphasizes the latest information it processes.

Both of these effects bias the model toward specific parts of the input data. The model may skip important information in the middle of the sequence.

The second issue is cost. We can solve the first problem of context constraints by splitting the text, but we cannot directly pass the entire book to the model. This will cost a lot. For example, if we had a book with 1 million tags, and we passed it directly to the GPT4 model, our total cost would be about $90 (hint and completion tags). We had to find an eclectic way to summarize our text, taking into account price, contextual constraints, and the full context of the book.

In this tutorial, you will learn how to summarize an entire book considering the price and contextual constraints of a model. let's start.

Set up the environment

To follow this tutorial, you'll need the following:

Python installed
An IDE (VS Code works)

To install dependencies, open your terminal and enter the following command:

pip install langchain openai tiktoken fpdf2 pandas

This command will install all required dependencies.

Load books

You will be using Charles Dickens's David Copperfield, which has been published for use in this project. Let's load this book using PyPDFLoader provided by LangChain.

from langchain.document_loaders import PyPDFLoader

# Load the book
loader = PyPDFLoader("David-Copperfield.pdf")
pages = loader.load_and_split()

It will load the entire book, but we are only interested in the content part. We can skip pages like the preface and introduction.

# Cut out the open and closing parts
pages = pages[6:1308]
# Combine the pages, and replace the tabs with spaces
text = ' '.join([page.page_content.replace('\t', ' ') for page in pages]

Now, we have content. Let's print the first 200 characters.

preprocessing

Let's remove unnecessary content from the text, such as non-printable characters, extra spaces, etc.

import re
def clean_text(text):
   # Remove the specific phrase 'Free eBooks at Planet eBook.com' and surrounding whitespace
   cleaned_text = re.sub(r'\s*Free eBooks at Planet eBook\.com\s*', '', text, flags=re.DOTALL)
   # Remove extra spaces
   cleaned_text = re.sub(r' +', ' ', cleaned_text)
   # Remove non-printable characters, optionally preceded by 'David Copperfield'
   cleaned_text = re.sub(r'(David Copperfield )?[\x00-\x1F]', '', cleaned_text)
   # Replace newline characters with spaces
   cleaned_text = cleaned_text.replace('\n', ' ')
   # Remove spaces around hyphens
   cleaned_text = re.sub(r'\s*-\s*', '', cleaned_text)
   return cleaned_text
clean_text=clean_text(text)

After cleaning the data , we can dive into the summary problem.

Load OpenAI API

Before using the OpenAI API, we need to configure it here and provide credentials.

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key-here"

Enter your API key here and it will set the environment variable.

Let's see how many tags there are in this book:

from langchain import OpenAI
llm = OpenAI()
Tokens = llm.get_num_tokens(clean_text)
print (f"We have {Tokens} tokens in the book")

There are over 466,000 markers in the book, and if we passed them all directly to LLM, it would charge us a lot. Therefore, to reduce the cost, we will implement K-means clustering to extract important chunks from the book.

Note : The decision to use K-means clustering was inspired by a tutorial by data expert Greg Kamradt .

In order to get the important parts of the book, let's first divide the book into different chunks.

Split content into documents

We will use LangChain's SemanticChunker utility to split the book content into documents.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(
   OpenAIEmbeddings(), breakpoint_threshold_type="interquartile"
)
docs = text_splitter.create_documents([clean_text])

SemanticChunker receives two parameters, the first is the embedding model. The embeddings generated by this model are used to split text based on semantics. The second is breakpoint_threshold_type, which determines the point at which the text should be split into different chunks based on semantic similarity.

Note: By processing these smaller, semantically similar chunks, we aim to minimize recency and primacy effects in LLM. This strategy enables our model to handle each small context more efficiently, ensuring more balanced interpretation and response generation.

Find embeds for each document

Now, let's get the embedding of each generated document. You will use the OpenAI default method to obtain the embedding.

import numpy as np
import openai
def get_embeddings(text):
   response = openai.embeddings.create(
       model="text-embedding-3-small",
       input=text
   )
   return response.data
embeddings=get_embeddings([doc.page_content for doc in docs]

get_embeddingsmethod can provide us with embeddings for all documents.

OpenAI specifically released the text-embedding-3-small method, which is considered cheaper and faster.

Data rearrangement

Next, we convert the document content list and its embeddings into a pandas DataFrame for easy data processing and analysis .

import pandas as pd
content_list = [doc.page_content for doc in docs]
df = pd.DataFrame(content_list, columns=['page_content'])
vectors = [embedding.embedding for embedding in embeddings]
array = np.array(vectors)
embeddings_series = pd.Series(list(array))
df['embeddings'] = embeddings_series

Efficient clustering using Faiss

We now convert the document vectors into a Faiss- compatible format, cluster them into 50 groups using K-means, and then create a Faiss index for efficient similarity searches between documents.

import numpy as np
import faiss
# Convert to float32 if not already
array = array.astype('float32') 
num_clusters = 50
# Vectors dimensionality
dimension = array.shape[1] 
# Train KMeans with Faiss
kmeans = faiss.Kmeans(dimension, num_clusters, niter=20, verbose=True)
kmeans.train(array)
# Directly access the centroids
centroids = kmeans.centroids 
# Create a new index for the original dataset
index = faiss.IndexFlatL2(dimension)
# Add original dataset to the index
index.add(array)

This K-means clustering groups documents into 50 groups.

Note : The reason for choosing K-means clustering is that each cluster will have similar content or similar context, since all documents in the cluster have related embeddings, and we will select the document closest to the core.

Select import document

Now we will select only the most important documents from each cluster. To do this we will select only the first closest vector to the centroid.

D, I = index.search(centroids, 1)

This code uses a search method on the index to find the closest document to each centroid in a list of centroids. It returns two arrays:

D, which contains the distance of the nearest document to its respective centroid, and
I, which contains the index of these recent documents. The second parameter 1 in the search method specifies that only the single closest document is found for each centroid.

Now we need to sort the selected document index as the documents are in book order.

sorted_array = np.sort(I, axis=0)
sorted_array=sorted_array.flatten()
extracted_docs = [docs[i] for i in sorted_array]

Get a summary of each document

The next step is to use the GPT-4 model to get a summary of each document to save money. To use GPT-4, we define the model.

model = ChatOpenAI(temperature=0,model="gpt-4")

Define hints and use LangChain to make hint templates to pass them to the model.

from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template("""
You will be given different passages from a book one by one. Provide a summary of the following text. Your result must be detailed and atleast 2 paragraphs. When summarizing, directly dive into the narrative or descriptions from the text without using introductory phrases like 'In this passage'. Directly address the main events, characters, and themes, encapsulating the essence and significant details from the text in a flowing narrative. The goal is to present a unified view of the content, continuing the story seamlessly as if the passage naturally progresses into the summary.
Passage:
```{text}```
SUMMARY:
"""
)

This prompt template will help the model summarize the document more effectively and efficiently.

The next step is to define the LangChain chain using LangChain Expression Language (LCEL).

chain= (
    prompt
   | model
   |StrOutputParser() )

The summary chain uses StrOutputParser to parse the output. There are other output parsers available to explore.

You can finally apply a defined chain on each document to get the summary.

from tqdm import tqdm
final_summary = ""

for doc in tqdm(extracted_docs, desc="Processing documents"):
   # Get the new summary.
   new_summary = chain2.invoke({"text": doc.page_content})
   # Update the list of the last two summaries: remove the first one and add the new one at the end.
   final_summary+=new_summary

The code above applies the chain to each document one by one and connects each summary to the final_summary.

Save abstract as PDF

The next step is to format the summary and save it as PDF.

from fpdf import FPDF

class PDF(FPDF):
   def header(self):
       # Select Arial bold 15
       self.set_font('Arial', 'B', 15)
       # Move to the right
       self.cell(80)
       # Framed title
       self.cell(30, 10, 'Summary', 1, 0, 'C')
       # Line break
       self.ln(20)

   def footer(self):
       # Go to 1.5 cm from bottom
       self.set_y(-15)
       # Select Arial italic 8
       self.set_font('Arial', 'I', 8)
       # Page number
       self.cell(0, 10, 'Page %s' % self.page_no(), 0, 0, 'C')

# Instantiate PDF object and add a page
pdf = PDF()
pdf.add_page()
pdf.set_font("Arial", size=12)

# Ensure the 'last_summary' text is treated as UTF-8
# Replace 'last_summary' with your actual text variable if different
# Make sure your text is a utf-8 encoded string
last_summary_utf8 = last_summary.encode('latin-1', 'replace').decode('latin-1')
pdf.multi_cell(0, 10, last_summary_utf8)

# Save the PDF to a file
pdf_output_path = "s_output1.pdf"
pdf.output(pdf_output_path)

So here is the complete summary of the book in PDF format.

in conclusion

In this tutorial, we explore the complexities of using LLM to summarize large texts, such as entire books, while addressing challenges related to context constraints and cost. We learned the steps of preprocessing text and implemented a strategy that combined semantic chunking and K-means clustering to effectively manage the contextual constraints of the model.

通过使用高效聚类，我们有效地提取了关键段落，减少了直接处理海量文本的开销。此方法不仅通过最大程度减少处理的标记数量来显著降低成本，而且还减轻了 LLM 中固有的新近效应和首因效应，确保对所有文本段落进行平衡考虑。

The development of AI applications through LLM's API has been attracting much attention, in which vector databases play an important role by providing efficient storage and retrieval of context embedding.

MyScaleDB is a vector database designed specifically for AI applications, taking into account all factors including cost, accuracy, and speed. Its SQL-friendly interface allows developers to start developing their AI applications without having to learn new knowledge.

This article was first published on Yunyunzhongsheng ( https://yylives.cc/ ), everyone is welcome to visit.