Knowledge base document processing

This project is a personal knowledge base assistant project designed to help users answer user questions based on the content of their personal knowledge base. The personal knowledge base should be able to support various types of data and allow users to import, export and manage them easily. In our project, we used some classic open source courses of Datawhale as examples, designed multiple file types, and introduced the processing methods of each file type, thereby supporting users to build their own knowledge base without difficulty.

1 Knowledge base design

Our knowledge base uses some classic open source courses and videos (parts) of Datawhale as examples, including:
pdf: "Detailed Explanation of Machine Learning Formulas" PDF version: https://github.com/datawhalechina/pumpkin-book/releases
md: " LLM introductory tutorial for developers `Part 1 Prompt Engineering》: https://github.com/datawhalechina/prompt-engineering-for-developers
mp4: "Introduction Guide to Reinforcement Learning": https://www.bilibili.com/ video/BV1HZ4y1v7eX/?spm_id_from=333.999.0.0&vd_source=4922e78f7a24c5981f1ddb6a8ee55ab9

Project address: Project warehouse address: https://github.com/datawhalechina/llm-universe

We will place the knowledge base source data in the …/…/data_base/knowledge_db directory.

image.png

2 Document loading

2.1 PDF document

We use PyMuPDFLoader to read the PDF files of the knowledge base. PyMuPDFLoader is the fastest PDF parser. The result contains detailed metadata of the PDF and its pages, and returns one document per page.

## 安装必要的库
pip install rapidocr_onnxruntime -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install "unstructured[all-docs]" -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install pyMuPDF -i https://pypi.tuna.tsinghua.edu.cn/simple
from langchain.document_loaders import PyMuPDFLoader

# 创建一个 PyMuPDFLoader Class 实例,输入为待加载的 pdf 文档路径
loader = PyMuPDFLoader("../../data_base/knowledge_db/pumkin_book/pumpkin_book.pdf")

# 调用 PyMuPDFLoader Class 的函数 load 对 pdf 文件进行加载
pages = loader.load()

Explore loaded data:
documents are loaded and stored in the pages variable:

  • The variable type of page is List
  • By printing the length of pages, you can see how many pages the pdf contains in total.
print(f"载入后的变量类型为:{
      
      type(pages)},",  f"该 PDF 一共包含 {
      
      len(pages)} 页")
载入后的变量类型为:<class 'list'>, 该 PDF 一共包含 196 页

Each element in page is a document, and the variable type is langchain.schema.document.Document. The document variable type contains two attributes:

  • page_content contains the content of the document.
  • meta_data is descriptive data related to the document.
page = pages[1]
print(f"每一个元素的类型:{
      
      type(page)}.", 
    f"该文档的描述性数据:{
      
      page.metadata}", 
    f"查看该文档的内容:\n{
      
      page.page_content[0:1000]}", 
    sep="\n------\n")
每一个元素的类型:<class 'langchain.schema.document.Document'>.
------
该文档的描述性数据:{'source': '../../data_base/knowledge_db/pumkin_book/pumpkin_book.pdf', 'file_path': '../../data_base/knowledge_db/pumkin_book/pumpkin_book.pdf', 'page': 1, 'total_pages': 196, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'xdvipdfmx (20200315)', 'creationDate': "D:20230303170709-00'00'", 'modDate': '', 'trapped': ''}
------
查看该文档的内容:
前言
“周志华老师的《机器学习》(西瓜书)是机器学习领域的经典入门教材之一,周老师为了使尽可能多的读
者通过西瓜书对机器学习有所了解, 所以在书中对部分公式的推导细节没有详述,但是这对那些想深究公式推
导细节的读者来说可能“不太友好”,本书旨在对西瓜书里比较难理解的公式加以解析,以及对部分公式补充
具体的推导细节。”
读到这里,大家可能会疑问为啥前面这段话加了引号,因为这只是我们最初的遐想,后来我们了解到,周
老师之所以省去这些推导细节的真实原因是,他本尊认为“理工科数学基础扎实点的大二下学生应该对西瓜书
中的推导细节无困难吧,要点在书里都有了,略去的细节应能脑补或做练习”。所以...... 本南瓜书只能算是我
等数学渣渣在自学的时候记下来的笔记,希望能够帮助大家都成为一名合格的“理工科数学基础扎实点的大二
下学生”。
使用说明
• 南瓜书的所有内容都是以西瓜书的内容为前置知识进行表述的,所以南瓜书的最佳使用方法是以西瓜书
为主线,遇到自己推导不出来或者看不懂的公式时再来查阅南瓜书;
• 对于初学机器学习的小白,西瓜书第 1 章和第 2 章的公式强烈不建议深究,简单过一下即可,等你学得
有点飘的时候再回来啃都来得及;
• 每个公式的解析和推导我们都力 (zhi) 争 (neng) 以本科数学基础的视角进行讲解,所以超纲的数学知识
我们通常都会以附录和参考文献的形式给出,感兴趣的同学可以继续沿着我们给的资料进行深入学习;
• 若南瓜书里没有你想要查阅的公式,或者你发现南瓜书哪个地方有错误,请毫不犹豫地去我们 GitHub 的
Issues(地址:https://github.com/datawhalechina/pumpkin-book/issues)进行反馈,在对应版块
提交你希望补充的公式编号或者勘误信息,我们通常会在 24 小时以内给您回复,超过 24 小时未回复的
话可以微信联系我们(微信号:at-Sm1les);
配套视频教程:https://www.bilibili.com/video/BV1Mh411e7VU
在线阅读地址:https://datawhalechina.github.io/pumpkin-book(仅供第 1 版)
最新版 PDF 获取地址:https://github.com/datawhalechina/pumpkin-book/re

We run the above code and the results are output as follows:

image.png

2.2 MD document

We can read in markdown documents in an almost identical way:

from langchain.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("../../data_base/knowledge_db/prompt_engineering/1. 简介 Introduction.md")
pages = loader.load()

The read object is exactly the same as the PDF document read:

print(f"载入后的变量类型为:{
      
      type(pages)},",  f"该 Markdown 一共包含 {
      
      len(pages)} 页")
载入后的变量类型为:<class 'list'>, 该 Markdown 一共包含 1 页
page = pages[0]
print(f"每一个元素的类型:{
      
      type(page)}.", 
    f"该文档的描述性数据:{
      
      page.metadata}", 
    f"查看该文档的内容:\n{
      
      page.page_content[0:]}", 
    sep="\n------\n")

image.png

2.3 MP4 video

LangChain provides a processing interface for crawling and transcribing Youtube videos, but if we want to process our local MP4 videos directly, we need to first transcribe and load them into text format, and then load them into LangChain.
We use Whisper to transcribe videos. The installation method of Whisper will not be described here. For details, please refer to the tutorial:
Zhihu | How to install Whisper, the open source free offline speech recognition artifact: https://zhuanlan.zhihu.com/p/595691785

Here we directly use Whisper to output the transliteration results in the original directory:

whisper ../../data_base/knowledge_db/easy_rl/强化学习入门指南.mp4 --model large --model_dir whisper-large --language zh --output_dir ../../data_base/knowledge_db/easy_rl

The above process uses the whisper tool to perform the transcoding operation;
note that the model_dir parameter here should be the large-whisper parameter path you downloaded locally. After the conversion is completed, the Reinforcement Learning Getting Started Guide.txt file
will be generated in the original directory . We can directly load the txt file:

from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("../../data_base/knowledge_db/easy_rl/强化学习入门指南.txt")
pages = loader.load()

The loaded data attributes are the same as above:

page = pages[0]
print(f"每一个元素的类型:{
      
      type(page)}.", 
    f"该文档的描述性数据:{
      
      page.metadata}", 
    f"查看该文档的内容:\n{
      
      page.page_content[0:1000]}", 
    sep="\n------\n")

image.png

3 document segmentation

Text splitters in Langchain all split based on chunk_size (chunk size) and chunk_overlap (overlap size between chunks).

image.png

image.png

  • chunk_size refers to the number of characters or Tokens (such as words, sentences, etc.) contained in each chunk
  • chunk_overlap refers to the number of characters shared between two chunks, which is used to maintain the coherence of the context and avoid losing context information during segmentation

Langchain provides a variety of document segmentation methods. The difference lies in how to determine the boundaries between blocks, which characters/tokens a block consists of, and how to measure the block size:

  • RecursiveCharacterTextSplitter(): Split text by string, recursively try to split text by different separators.
  • CharacterTextSplitter(): Split text by characters.
  • MarkdownHeaderTextSplitter(): Split markdown files based on the specified header.
  • TokenTextSplitter(): Split text by token.
  • SentenceTransformersTokenTextSplitter(): Split text by token
  • Language(): used in CPP, Python, Ruby, Markdown, etc.
  • NLTKTextSplitter(): Split text by sentences using NLTK (Natural Language Toolkit).
  • SpacyTextSplitter(): Use Spacy to split text by sentence.
''' 
* RecursiveCharacterTextSplitter 递归字符文本分割
RecursiveCharacterTextSplitter 将按不同的字符递归地分割(按照这个优先级["\n\n", "\n", " ", ""]),
    这样就能尽量把所有和语义相关的内容尽可能长时间地保留在同一位置
RecursiveCharacterTextSplitter需要关注的是4个参数:

* separators - 分隔符字符串数组
* chunk_size - 每个文档的字符数量限制
* chunk_overlap - 两份文档重叠区域的长度
* length_function - 长度计算函数
'''
#导入文本分割器
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 知识库中单段文本长度
CHUNK_SIZE = 500

# 知识库中相邻文本重合长度
OVERLAP_SIZE = 50
# 此处我们使用 PDF 文件作为示例
from langchain.document_loaders import PyMuPDFLoader

# 创建一个 PyMuPDFLoader Class 实例,输入为待加载的 pdf 文档路径
loader = PyMuPDFLoader("../../data_base/knowledge_db/pumkin_book/pumpkin_book.pdf")

# 调用 PyMuPDFLoader Class 的函数 load 对 pdf 文件进行加载
pages = loader.load()
page = pages[1]

# 使用递归字符文本分割器
from langchain.text_splitter import TokenTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=OVERLAP_SIZE
)
text_splitter.split_text(page.page_content[0:1000])
['前言\n“周志华老师的《机器学习》(西瓜书)是机器学习领域的经典入门教材之一,周老师为了使尽可能多的读\n者通过西瓜书对机器学习有所了解, 所以在书中对部分公式的推导细节没有详述,但是这对那些想深究公式推\n导细节的读者来说可能“不太友好”,本书旨在对西瓜书里比较难理解的公式加以解析,以及对部分公式补充\n具体的推导细节。”\n读到这里,大家可能会疑问为啥前面这段话加了引号,因为这只是我们最初的遐想,后来我们了解到,周\n老师之所以省去这些推导细节的真实原因是,他本尊认为“理工科数学基础扎实点的大二下学生应该对西瓜书\n中的推导细节无困难吧,要点在书里都有了,略去的细节应能脑补或做练习”。所以...... 本南瓜书只能算是我\n等数学渣渣在自学的时候记下来的笔记,希望能够帮助大家都成为一名合格的“理工科数学基础扎实点的大二\n下学生”。\n使用说明\n• 南瓜书的所有内容都是以西瓜书的内容为前置知识进行表述的,所以南瓜书的最佳使用方法是以西瓜书\n为主线,遇到自己推导不出来或者看不懂的公式时再来查阅南瓜书;\n• 对于初学机器学习的小白,西瓜书第 1 章和第 2 章的公式强烈不建议深究,简单过一下即可,等你学得',
 '有点飘的时候再回来啃都来得及;\n• 每个公式的解析和推导我们都力 (zhi) 争 (neng) 以本科数学基础的视角进行讲解,所以超纲的数学知识\n我们通常都会以附录和参考文献的形式给出,感兴趣的同学可以继续沿着我们给的资料进行深入学习;\n• 若南瓜书里没有你想要查阅的公式,或者你发现南瓜书哪个地方有错误,请毫不犹豫地去我们 GitHub 的\nIssues(地址:https://github.com/datawhalechina/pumpkin-book/issues)进行反馈,在对应版块\n提交你希望补充的公式编号或者勘误信息,我们通常会在 24 小时以内给您回复,超过 24 小时未回复的\n话可以微信联系我们(微信号:at-Sm1les);\n配套视频教程:https://www.bilibili.com/video/BV1Mh411e7VU\n在线阅读地址:https://datawhalechina.github.io/pumpkin-book(仅供第 1 版)\n最新版 PDF 获取地址:https://github.com/datawhalechina/pumpkin-book/re']
split_docs = text_splitter.split_documents(pages)
print(f"切分后的文件数量:{
      
      len(split_docs)}")
切分后的文件数量:737
print(f"切分后的字符数(可以用来大致评估 token 数):{
      
      sum([len(doc.page_content) for doc in split_docs])}")
切分后的字符数(可以用来大致评估 token 数):314712

4 Document word vectorization

In machine learning and natural language processing (NLP), Embeddings are a technique that converts categorical data, such as words, sentences, or entire documents, into real-number vectors. These real vectors can be better understood and processed by computers. The main idea behind embedding is that similar or related objects should be close together in the embedding space.
For example, we can use word embeddings to represent text data. In word embedding, each word is converted into a vector that captures the semantic information of the word. For example, the two words "king" and "queen" will be very close in the embedding space because they have similar meanings. And "apple" and "orange" will also be close because they are both fruits. The two words "king" and "apple" will be farther apart in the embedding space because their meanings are different.

Let's take our slices and Embedding them.
There are three methods provided here. One is to directly use the openai model to generate embedding, and the other is to use the model on HuggingFace to generate embedding.

  • The openAI model requires APIs, which are relatively expensive for a large number of tokens, but are very convenient.
  • HuggingFace's model can be deployed locally, and suitable models can be customized. It is highly playable, but it has some requirements for local resources.
  • Use APIs from other platforms. Students who are inconvenient to obtain the openAI key can use this method.

For students who just want to experience it, you can try to use the generated embedding directly, or deploy a small model locally to try it.
**HuggingFace** is an excellent open source library. We only need to enter the name of the model and it will automatically parse the corresponding capabilities for us.

# 使用前配置自己的 api 到环境变量中如
import os
import openai
import zhipuai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv()) # read local .env fileopenai.api_key  = os.environ['OPENAI_API_KEY']
openai.api_key  = os.environ['OPENAI_API_KEY']
zhihuai.api_key = os.environ['ZHIPUAI_API_KEY']
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from zhipuai_embedding import ZhipuAIEmbeddings

# embedding = OpenAIEmbeddings() 
# embedding = HuggingFaceEmbeddings(model_name="moka-ai/m3e-base")
embedding = ZhipuAIEmbeddings()
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
query1 = "机器学习"
query2 = "强化学习"
query3 = "大语言模型"

# 通过对应的 embedding 类生成 query 的 embedding。
emb1 = embedding.embed_query(query1)
emb2 = embedding.embed_query(query2)
emb3 = embedding.embed_query(query3)

# 将返回结果转成 numpy 的格式,便于后续计算
emb1 = np.array(emb1)
emb2 = np.array(emb2)
emb3 = np.array(emb3)

You can directly view the specific information of the embedding. The dimensions of the embedding usually depend on the model used.

print(f"{
      
      query1} 生成的为长度 {
      
      len(emb1)} 的 embedding , 其前 30 个值为: {
      
      emb1[:30]}")
机器学习 生成的为长度 1024 的 embedding , 其前 30 个值为: [-0.02768379  0.07836673  0.1429528  -0.1584693   0.08204    -0.15819356
 -0.01282174  0.18076552  0.20916627  0.21330206 -0.1205181  -0.06666514
 -0.16731478  0.31798768  0.0680017  -0.13807729 -0.03469152  0.15737721
  0.02108428 -0.29145902 -0.10099868  0.20487919 -0.03603597 -0.09646764
  0.12923686 -0.20558454  0.17238656  0.03429411  0.1497675  -0.25297147]

We have generated the corresponding vectors, how do we measure the relevance of documents and questions?
Here are two commonly used methods:

  • Calculate the dot product between two vectors.
  • Calculate cosine similarity between two vectors

The dot product is a scalar value obtained by multiplying the elements at corresponding positions of two vectors and summing them. The greater the dot product similarity, the more similar the two vectors are.
Here we directly use the numpy function for calculation.

print(f"{
      
      query1}{
      
      query2} 向量之间的点积为:{
      
      np.dot(emb1, emb2)}")
print(f"{
      
      query1}{
      
      query3} 向量之间的点积为:{
      
      np.dot(emb1, emb3)}")
print(f"{
      
      query2}{
      
      query3} 向量之间的点积为:{
      
      np.dot(emb2, emb3)}")
机器学习 和 强化学习 向量之间的点积为:17.218882120572722
机器学习 和 大语言模型 向量之间的点积为:16.522186236712727
强化学习 和 大语言模型 向量之间的点积为:11.368461841901752

Dot product: The calculation is simple and fast, no additional normalization step is required, but the direction information is lost.
Cosine similarity: You can compare the direction and magnitude of vectors at the same time.
Cosine similarity divides the dot product of two vectors by the product of their modulus lengths. Its basic calculation formula is:
image.png

The value range of the cosine function is between -1 and 1, that is, the range of cosine similarity between two vectors is [-1, 1]. When the angle between two vectors is 0°, that is, when the two vectors coincide, the similarity is 1; when the angle is 180°, that is, when the two vectors are in opposite directions, the similarity is -1. That is, the closer it is to 1, the more similar it is, and the closer it is to 0, the less similar it is.

print(f"{
      
      query1}{
      
      query2} 向量之间的余弦相似度为:{
      
      cosine_similarity(emb1.reshape(1, -1) , emb2.reshape(1, -1) )}")
print(f"{
      
      query1}{
      
      query3} 向量之间的余弦相似度为:{
      
      cosine_similarity(emb1.reshape(1, -1) , emb3.reshape(1, -1) )}")
print(f"{
      
      query2}{
      
      query3} 向量之间的余弦相似度为:{
      
      cosine_similarity(emb2.reshape(1, -1) , emb3.reshape(1, -1) )}")
机器学习 和 强化学习 向量之间的余弦相似度为:[[0.68814796]]
机器学习 和 大语言模型 向量之间的余弦相似度为:[[0.63382724]]
强化学习 和 大语言模型 向量之间的余弦相似度为:[[0.43555894]]

It can be seen that the model believes that machine learning and reinforcement learning are more related, and the correlation between reinforcement learning and large language models is even worse. (This part is related to the time of training corpus. The embedding model should not have corpus related to large language models.)

So far, we have learned the basic processing of documents, but how do we manage the embeddings we generate and find the content most relevant to the query? Do we need to traverse all documents every time? Vector database can help us quickly manage and calculate these contents.

Guess you like

Origin blog.csdn.net/Alexa_/article/details/134543823