After trying N possibilities in LangChain, I discovered the secret of blocking!


Chunking is the most challenging problem in building retrieval-augmented generation (RAG) (https://zilliz.com.cn/use-cases/llm-retrieval-augmented-generation) applications. Chunking refers to the process of dividing text. Although it sounds very simple, there are many details to deal with. Depending on the type of text content, different chunking strategies are required.

In this tutorial, we will explore the effects of different chunking strategies using different chunking strategies for the same text. Visit the link (https://github.com/ytang07/llm-experimentation/blob/main/test_langchain_chunking.py) to get the code involved in this article.


01.

Introduction to LangChain Blocking


LangChain is an LLM coordination framework with built-in tools for chunking and loading documents. This chunking tutorial focuses on setting chunking parameters and minimal use of LLM. In short, you load and chunk the document by writing a function and setting its parameters, which prints the result as a chunked text block. In the experiments described below, we will run multiple parameter values ​​through this function.


  • LangChain chunked code import and setup


The first part of the code is mainly about importing and setting up the tools. The following code has many import statements, and os and dotenv are both commonly used. They are only used for environment variables.


Next, we will explain in depth the code related to LangChain and  pymilvus  .


First are the three imports used to get the documentation:

NotionDirectoryLoader is used to load directories containing markdown/Notion documents. The MarkdownHeader and RecursiveCharacter text splitters then split the text in the markdown document based on a header (heading splitter) or a set of preselected character delimiters (recursive splitter).


Next, is the retriever import. We use Milvus, OpenAIEmbeddings models and OpenAI Large Language Model (LLM). SelfQueryRetriever is LangChain's native retriever, allowing the vector database to "query itself".


The last LangChain import is AttributeInfo , which passes an attribute with information into SelfQueryRetriever.


As for pymilvus imports, usually I only use these at the end to clean up the database.


The last step before writing the function is to load the environment variables and declare some constants. The headers_to_split_on variable lists all the headers we want to split in markdown; path is used to help LangChain understand where to find the Notion document.


import osfrom langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from pymilvus import connections, utility
from dotenv import load_dotenv


load_dotenv()
zilliz_uri = os.getenv("ZILLIZ_CLUSTER_01_URI")
zilliz_token = os.getenv("ZILLIZ_CLUSTER_01_TOKEN")


headers_to_split_on = [
("##", "Section"),
]
path='./notion_docs'


  • Build a chunked experiment function


Building the chunked experiment function is the most critical part of this tutorial. As mentioned before, this function requires some parameters for file import and chunking. We need to provide the path to the document, the title to split (the splitter), the chunk size, the chunk overlap and whether we want to clean the database by deleting the Collection. By default, this parameter is set to True, which means deleting the Collection and cleaning the database.


Note that collections should be created and deleted as little as possible to avoid unnecessary overhead.


The first part of the function loads the document from the path through the Notion Directory Loader. Here, only the content of the first page is captured.


Next, get the splitter. First, use the markdown splitter to split based on the title passed in above. Then, use a recursive splitter to split based on chunk size and overlap.


After the segmentation is complete, use environment variables, OpenAI embedding, chunking tools, and Collection names to initialize a LangChain Milvus instance. In addition, we also create a list of metadata fields through the AttributeInfo object to help SelfQueryRetriever understand the "chapter" to which the text block belongs.


After completing all the above settings, obtain the LLM and pass it to the SelfQueryRetriever. The crawler comes into play when we ask questions about a document. I also set up the function so it knows which chunking strategy it is testing. Finally, the Collection can be deleted on demand.


def test_langchain_chunking(docs_path, splitters, chunk_size, chunk_overlap, drop_collection=True):


path=docs_path
loader = NotionDirectoryLoader(path)
docs = loader.load()
md_file=docs[0].page_content


# Let's create groups based on the section headers in our page
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=splitters)
md_header_splits = markdown_splitter.split_text(md_file)


# Define our text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
all_splits = text_splitter.split_documents(md_header_splits)


test_collection_name = f"EngineeringNotionDoc_{chunk_size}_{chunk_overlap}"


vectordb = Milvus.from_documents(documents=all_splits,
embedding=OpenAIEmbeddings(),
connection_args={"uri": zilliz_uri,
"token": zilliz_token},
collection_name=test_collection_name)


metadata_fields_info = [
AttributeInfo(
name="Section",
description="Part of the document that the text comes from",
type="string or list[string]"
),
]
document_content_description = "Major sections of the document"


llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(llm, vectordb, document_content_description, metadata_fields_info, verbose=True)


res = retriever.get_relevant_documents("What makes a distinguished engineer?")
print(f"""Responses from chunking strategy:
{chunk_size}, {chunk_overlap}""")
for doc in res:
print(doc)


# this is just for rough cleanup, we can improve this# lots of user considerations to understand for real experimentation use cases thoughif drop_collection:
connections.connect(uri=zilliz_uri, token=zilliz_token)
utility.drop_collection(test_collection_name)

02.

LangChain chunking experiments and results


Now comes the exciting time! Let's take a look at the results of the chunking experiment.


  • Testing LangChain Blocking


The following code block shows how to run our experimental function. I added five experiments. This tutorial tests chunking strategies with chunking lengths ranging from 32 to 64, 128, 256, and 512, and chunking overlap ranging from 4 to 8, 16, 32, and 64. To test, we iterate through the list of tuples and call the function written above.


chunking_tests = [(32, 4), (64, 8), (128, 16), (256, 32), (512, 64)]for test in chunking_tests:
test_langchain_chunking(path, headers_to_split_on, test[0], test[1])

The following is the output. Next, let's take a closer look at the output results of each set of experiments. The test question we used was "What makes a distinguished engineer?"



  • Block length 32, overlap 4



Obviously, the length of 32 is too short, and this chunking strategy is completely invalid.


  • Block length 64, overlap 8



This strategy did not work well at first, but it eventually gave the answer to the question - Werner Vogels, Chief Technology Officer (CTO) of Amazon.


  • Block length 128, overlap 16


When the length becomes 128, the answers appear to have more complete sentences and fewer "engineer" type answers. This strategy works well and can extract Werner Vogel-related text fragments. But a disadvantage of this strategy is that special characters such as \xa0 and  \n will appear in the answer   . Maybe our chunk length is too long.


  • Block length 256, overlap 32



Although the answer returns relevant content, the chunk length is too long.


  • Block length 512, overlap 64



It is known that the block length of 256 is too long. But when the length is set to 512, the contents of the entire section will be extracted. At this time, we need to think about: Do we want a single line , or the entire section content? This requires judgment based on the usage scenario.


03.

Summarize


This tutorial explores the effects of 5 different chunking strategies. When choosing a chunking strategy, we need to determine the most appropriate chunking length based on the expected return results. We will then test the effects of different chunking overlaps. Stay tuned!


本文作者

Yujian Tang
Zilliz 开发者布道师

推荐阅读



本文分享自微信公众号 - ZILLIZ(Zilliztech)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

阿里云严重故障,全线产品受影响(已恢复) 汤不热 (Tumblr) 凉了 俄罗斯操作系统 Aurora OS 5.0 全新 UI 亮相 Delphi 12 & C++ Builder 12、RAD Studio 12 发布 多家互联网公司急招鸿蒙程序员 UNIX 时间即将进入 17 亿纪元(已进入) 美团招兵买马,拟开发鸿蒙系统 App 亚马逊开发基于 Linux 的操作系统,以摆脱 Android 依赖 Linux 上的 .NET 8 独立体积减少 50% FFmpeg 6.1 "Heaviside" 发布
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4209276/blog/10142197
Recommended