Implement a ChatBlog using LangChain


Preface

Through this article, you will learn how to use it langchainto build your own knowledge base Q&A.
In fact, the principles of most types chatpdfof products are similar. I will simply and roughly divide it into the following four steps:

  1. build knowledge base
  2. Vectorize the knowledge base
  3. recall
  4. Reading comprehension with LLM

Next, let’s take a look at how to use our own blogs to create a knowledge base data. The ChatBlog
knowledge base data used in this article comes from a blog I wrote before: Retrieval Question and Answer System Based on Sentence-Bert

environment

As usual, the environment is an essential part:

langchain==0.0.148
openai==0.27.4
chromadb==0.3.21

1. Build a knowledge base

It's relatively simple, just go to the code

def get_blog_text():
    data_path = 'blog.txt'
    with open(data_path, 'r') as f:
        data = f.read()
    soup = BeautifulSoup(data, 'lxml')
    text = soup.get_text()
    return text


# 自定义句子分段的方式,保证句子不被截断
def split_paragraph(text, max_length=300):
    text = text.replace('\n', '') 
    text = text.replace('\n\n', '') 
    text = re.sub(r'\s+', ' ', text)
    """
    将文章分段
    """
    # 首先按照句子分割文章
    sentences = re.split('(;|。|!|\!|\.|?|\?)',text) 
    
    new_sents = []
    for i in range(int(len(sentences)/2)):
        sent = sentences[2*i] + sentences[2*i+1]
        new_sents.append(sent)
    if len(sentences) % 2 == 1:
      new_sents.append(sentences[len(sentences)-1])

    # 按照要求分段
    paragraphs = []
    current_length = 0
    current_paragraph = ""
    for sentence in new_sents:
        sentence_length = len(sentence)
        if current_length + sentence_length <= max_length:
            current_paragraph += sentence
            current_length += sentence_length
        else:
            paragraphs.append(current_paragraph.strip())
            current_paragraph = sentence
            current_length = sentence_length
    paragraphs.append(current_paragraph.strip())
    documents = []
    for paragraph in paragraphs:
        new_doc = Document(page_content=paragraph)
        print(new_doc)
        documents.append(new_doc)
    return documents

content = get_blog_text()
documents = split_paragraph(content)

It must be explained here that I did not use langchainthe provided document division function. langchainThere are many ways to divide documents. Interested students can check langchain.text_splitterthe source code inside. I have cut it out here. There are probably these types, but they are all similar. , the purpose is almost to divide the segments more reasonably.
Please add image description
Here we set a max_length, this length, if used chatgpt, the maximum can be 4096, because chatgptthe maximum input allowed Tokenis 4096, converted to Chinese, it is actually shorter , the length needs to be added prompt, Tokenso a certain space needs to be reserved.

If the segmentation is not done well, the impact on the output will be quite large. We divide it by sentences here. In fact, it is more reasonable to divide it by the subtitles of the blog. This is what CSDN's Q&A robot does. Haha, here is a hard recommendation. Wave, the effect is still very good, surpassing all humans. If you are not convinced, you can challenge it:
https://ask.csdn.net/

Later, I will also find time to write a blog about the CSDN question-and-answer robot to share with you the implementation details, so pay attention and not get lost

Please add image description

2. Vectorize the knowledge base

# 持久化向量数据
def persist_embedding(documents):
    # 将embedding数据持久化到本地磁盘
    persist_directory = 'db'
    embedding = OpenAIEmbeddings()
    vectordb = Chroma.from_documents(documents=documents, embedding=embedding, persist_directory=persist_directory)
    vectordb.persist()
    vectordb = None

The default model here OpenAIEmbeddingsis used to do it . You can also change it to another one. The following methods are provided . You can also load a sentence vector model locally . You need to pay attention here. If you are using a vectorized model , it is necessary to turn on Science Internet.text-embedding-ada-002emdeddinglangchainembedding
Please add image description
embeddingopenai

After the vectorization is completed, we need to save the vectorized results. Next time we use them, we can just load them directly. What I use here is to Chromastore the vectorized data. However, langchainother vector databases are also supported, as follows:
Please add image description
ChromaI This is also the first time I have used it. Interested students can find out for themselves. It FAISSshould be used more often. I use it in the question and answer robot pgvectorbecause our database uses PG’s vectorized storage plug-in, so we There is no special reason for using this. In fact, all kinds of vectorized databases are similar. What affects the recall speed and effect is the index construction method. The more famous one is, if you are interested, you can learn about it .PostgresSQLpgvectorHNSW

3. Recall

global retriever
def load_embedding():
    embedding = OpenAIEmbeddings()
    global retriever
    vectordb = Chroma(persist_directory='db', embedding_function=embedding)
    retriever = vectordb.as_retriever(search_kwargs={
    
    "k": 5})

k=5Refers to top 5the results of the recall

as_retrieverThe function also has a search_typeparameter, the default is similarity, the parameter is explained as follows:

search_type Search type: "similarity" or "mmr". search_type="similarity" uses a similarity search in the retriever object, where it selects the text block vector most similar to the question vector. search_type="mmr" uses maximum marginal relevance search, where similarity is optimized for diversity among the queried selected documents.

4. Use LLM for reading comprehension

def prompt(query):
    prompt_template = """请注意:请谨慎评估query与提示的Context信息的相关性,只根据本段输入文字信息的内容进行回答,如果query与提供的材料无关,请回答"我不知道",另外也不要回答无关答案:
    Context: {context}
    Question: {question}
    Answer:"""
    PROMPT = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]
    )
    docs = retriever.get_relevant_documents(query)
    # 基于docs来prompt,返回你想要的内容
    chain = load_qa_chain(ChatOpenAI(temperature=0), chain_type="stuff", prompt=PROMPT)
    result = chain({
    
    "input_documents": docs, "question": query}, return_only_outputs=True)

    return result['output_text']

In fact, it is to use the recalled text as promptpart of it, and then summarize the answers chatgptfrom it The segmentation mentioned earlier has a great impact on the results, and it is also reflected here. If the segmentation is not good, the recalled data If not, it's hard to get an answer out of it.prompt
chatgpt

Note: Scientific Internet access is also required here.

5. Effect

Please add image description
Very correct

Summarize

1. The whole is similar to reading comprehension, but you can adjust the prompt, for example: 请你结合Context和你自己现有的知识, 回答以下问题
2. All code: https://github.com/seanzhang-zhichen/ChatBlog

Guess you like

Origin blog.csdn.net/qq_44193969/article/details/130815310