Article directory
Preface
Through this article, you will learn how to use it langchain
to build your own knowledge base Q&A.
In fact, the principles of most types chatpdf
of products are similar. I will simply and roughly divide it into the following four steps:
- build knowledge base
- Vectorize the knowledge base
- recall
- Reading comprehension with LLM
Next, let’s take a look at how to use our own blogs to create a knowledge base data. The ChatBlog
knowledge base data used in this article comes from a blog I wrote before: Retrieval Question and Answer System Based on Sentence-Bert
environment
As usual, the environment is an essential part:
langchain==0.0.148
openai==0.27.4
chromadb==0.3.21
1. Build a knowledge base
It's relatively simple, just go to the code
def get_blog_text():
data_path = 'blog.txt'
with open(data_path, 'r') as f:
data = f.read()
soup = BeautifulSoup(data, 'lxml')
text = soup.get_text()
return text
# 自定义句子分段的方式,保证句子不被截断
def split_paragraph(text, max_length=300):
text = text.replace('\n', '')
text = text.replace('\n\n', '')
text = re.sub(r'\s+', ' ', text)
"""
将文章分段
"""
# 首先按照句子分割文章
sentences = re.split('(;|。|!|\!|\.|?|\?)',text)
new_sents = []
for i in range(int(len(sentences)/2)):
sent = sentences[2*i] + sentences[2*i+1]
new_sents.append(sent)
if len(sentences) % 2 == 1:
new_sents.append(sentences[len(sentences)-1])
# 按照要求分段
paragraphs = []
current_length = 0
current_paragraph = ""
for sentence in new_sents:
sentence_length = len(sentence)
if current_length + sentence_length <= max_length:
current_paragraph += sentence
current_length += sentence_length
else:
paragraphs.append(current_paragraph.strip())
current_paragraph = sentence
current_length = sentence_length
paragraphs.append(current_paragraph.strip())
documents = []
for paragraph in paragraphs:
new_doc = Document(page_content=paragraph)
print(new_doc)
documents.append(new_doc)
return documents
content = get_blog_text()
documents = split_paragraph(content)
It must be explained here that I did not use langchain
the provided document division function. langchain
There are many ways to divide documents. Interested students can check langchain.text_splitter
the source code inside. I have cut it out here. There are probably these types, but they are all similar. , the purpose is almost to divide the segments more reasonably.
Here we set a max_length
, this length, if used chatgpt
, the maximum can be 4096
, because chatgpt
the maximum input allowed Token
is 4096
, converted to Chinese, it is actually shorter , the length needs to be added prompt
, Token
so a certain space needs to be reserved.
If the segmentation is not done well, the impact on the output will be quite large. We divide it by sentences here. In fact, it is more reasonable to divide it by the subtitles of the blog. This is what CSDN's Q&A robot does. Haha, here is a hard recommendation. Wave, the effect is still very good, surpassing all humans. If you are not convinced, you can challenge it:
https://ask.csdn.net/
Later, I will also find time to write a blog about the CSDN question-and-answer robot to share with you the implementation details, so pay attention and not get lost
2. Vectorize the knowledge base
# 持久化向量数据
def persist_embedding(documents):
# 将embedding数据持久化到本地磁盘
persist_directory = 'db'
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=documents, embedding=embedding, persist_directory=persist_directory)
vectordb.persist()
vectordb = None
The default model here OpenAIEmbeddings
is used to do it . You can also change it to another one. The following methods are provided . You can also load a sentence vector model locally . You need to pay attention here. If you are using a vectorized model , it is necessary to turn on Science Internet.text-embedding-ada-002
emdedding
langchain
embedding
embedding
openai
After the vectorization is completed, we need to save the vectorized results. Next time we use them, we can just load them directly. What I use here is to Chroma
store the vectorized data. However, langchain
other vector databases are also supported, as follows:
Chroma
I This is also the first time I have used it. Interested students can find out for themselves. It FAISS
should be used more often. I use it in the question and answer robot pgvector
because our database uses PG’s vectorized storage plug-in, so we There is no special reason for using this. In fact, all kinds of vectorized databases are similar. What affects the recall speed and effect is the index construction method. The more famous one is, if you are interested, you can learn about it .PostgresSQL
pgvector
HNSW
3. Recall
global retriever
def load_embedding():
embedding = OpenAIEmbeddings()
global retriever
vectordb = Chroma(persist_directory='db', embedding_function=embedding)
retriever = vectordb.as_retriever(search_kwargs={
"k": 5})
k=5
Refers to top 5
the results of the recall
as_retriever
The function also has a search_type
parameter, the default is similarity
, the parameter is explained as follows:
search_type Search type: "similarity" or "mmr". search_type="similarity" uses a similarity search in the retriever object, where it selects the text block vector most similar to the question vector. search_type="mmr" uses maximum marginal relevance search, where similarity is optimized for diversity among the queried selected documents.
4. Use LLM for reading comprehension
def prompt(query):
prompt_template = """请注意:请谨慎评估query与提示的Context信息的相关性,只根据本段输入文字信息的内容进行回答,如果query与提供的材料无关,请回答"我不知道",另外也不要回答无关答案:
Context: {context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
docs = retriever.get_relevant_documents(query)
# 基于docs来prompt,返回你想要的内容
chain = load_qa_chain(ChatOpenAI(temperature=0), chain_type="stuff", prompt=PROMPT)
result = chain({
"input_documents": docs, "question": query}, return_only_outputs=True)
return result['output_text']
In fact, it is to use the recalled text as prompt
part of it, and then summarize the answers chatgpt
from it The segmentation mentioned earlier has a great impact on the results, and it is also reflected here. If the segmentation is not good, the recalled data If not, it's hard to get an answer out of it.prompt
chatgpt
Note: Scientific Internet access is also required here.
5. Effect
Very correct
Summarize
1. The whole is similar to reading comprehension, but you can adjust the prompt, for example: 请你结合Context和你自己现有的知识, 回答以下问题
2. All code: https://github.com/seanzhang-zhichen/ChatBlog