1. Document-based Q&A

1. Create a vector store

CSVLoader loads csv data, and the loader is used in conjunction with the model
Use the Dock Array memory search vector storage, as a memory vector storage, no need to connect to an external database
Create a vector store: import an index, the vector store index creator

from langchain.chains import RetrievalQA #检索QA链，在文档上进行检索
from langchain.chat_models import ChatOpenAI #openai模型
from langchain.document_loaders import CSVLoader #文档加载器，采用csv格式存储
from langchain.vectorstores import DocArrayInMemorySearch #向量存储
from IPython.display import display, Markdown #在jupyter显示信息的工具

file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
#查看数据
import pandas as pd
data = pd.read_csv(file,header=None)

The data is text data with fields as nameand :description
insert image description here

# 创建向量存储
from langchain.indexes import VectorstoreIndexCreator #导入向量存储索引创建器
'''
将指定向量存储类,创建完成后，我们将从加载器中调用,通过文档记载器列表加载
'''
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."
response = index.query(query)#使用索引查询创建一个响应，并传入这个查询
display(Markdown(response))#查看查询返回的内容

insert image description here
Got a Markdown table with the names and descriptions of all the shirts with sun protection, the descriptions are summed up.

'''
为刚才的文本创建embedding，准备将它们存储在向量存储中，使用向量存储上的from documents方法来实现。
该方法接受文档列表、嵌入对象，然后我们将创建一个总体向量存储
'''
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)
query = "Please suggest a shirt with sunblocking"
docs = db.similarity_search(query)#使用这个向量存储来查找与传入查询类似的文本，如果我们在向量存储中使用相似性搜索方法并传入一个查询，我们将得到一个文档列表
len(docs)# 我们可以看到它返回了四个文档

# 回答文档的相关问题
retriever = db.as_retriever() #创建检索器通用接口
llm = ChatOpenAI(temperature = 0.0,max_tokens=1024) #导入语言模型
qdocs = "".join([docs[i].page_content for i in range(len(docs))])  # 将合并文档中的所有页面内容到一个变量中
response = llm.call_as_llm(f"{
    
    qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") #列出所有具有防晒功能的衬衫并在Markdown表格中总结每个衬衫的语言模型

The vector database used is chromadb. Encapsulated by the LangChain chain, create a retrieval QA chain to answer questions on the retrieved documents. To create such a chain, we will pass in a few different things

1. Language model, text generation at the end
2. Incoming chain type, used here stuff, stuffs all documents into the context and makes a call to the language model
3. Pass in a retriever

qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."#创建一个查询并在此查询上运行链
response = qa_stuff.run(query)
display(Markdown(response))#使用 display 和 markdown 显示它

RetrievalQAThe chain actually encapsulates the two steps of merging text fragments and calling the language model. If there is no RetrievalQA chain, we need to implement it like this:

# 将检索出来的文本片段合并成一段文本
qdocs = "".join([docs[i].page_content for i in range(len(docs))])
# 将合并后的文本和问题一起传给LLM
response = llm.call_as_llm(f"{
    
    qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.")

2. Different types of chains

What if you want to perform the same type of question answering on many different types of blocks? The above experiment only returned 4 documents, if there are multiple documents we can use a few different methods

Map Reduce
passes all the chunks to the language model along with the question, gets the responses, uses another language model call to summarize all the individual responses into a final answer, and it can operate on any number of documents. A single problem can be processed in parallel, also requiring more calls. It treats all documents as independent
Refine
is used to loop over many documents, iteratively in fact, building on answers from previous documents, great for forward and backward causal information and progressively building answers over time, depending on the results of previous calls. It usually takes longer and basically requires as many calls as Map Reduce
Map Re-rank
makes a single language model call per document, asks it to return a score, chooses the highest score, this relies on the language model knowing what the score should be, needs to tell it, if it's relevant to the document, it should be the high score , and fine-tuning instructions there, they can be processed in batches relatively quickly, but more expensive
Stuff
combines everything into one document

2. Local Knowledge Base Q&A

1.1 Overall Framework

insert image description here

Improved points (directions):
- Change LLM model
- embedding model
- text segmentation method
- Multi-card accelerated model deployment
- Improving the quality of top-k retrieval recall
Based on data privacy and privatization deployment, it is convenient to use Langchain+ large model for reasoning

2. Text Segmentation

Langchain source code: https://github.com/hwchase17/langchain/blob/master/langchain/text_splitter.py

Common parameters of Langchain's built-in text splitting module:

chunk_size: the size of the text block, that is, the maximum size of the text block;
chunk_overlap: Indicates the degree of overlap between two segmented texts, the maximum amount of overlap between text blocks, retaining some overlap can maintain the continuity between text blocks, and can be constructed using a sliding window, which is more important.
length_function: the method used to calculate the length of the text block, the default is to simply calculate the number of characters;

Other text splitters:
$\begin{array}{|c|c|} \hline \text { text splitter} & \\ \hline \text { LatexTextSplitter } & \text { Split text along Latex headings, titles, enums, etc. } \\ \hline \text { MarkdownTextSplitter } & \begin{array}{l} \text { Split along Markdown headings, code blocks or horizontal rules} \\ \text { Split text. } \end{array} \\ \hline \text { NLTKTextSplitter } & \text { Use NLTK's splitter} \\ \hline \text { PythonCodeTextSplitter } & \text { Split text along Python class and method definitions. } \\ \hline \text { RecursiveCharacterTextSplitter } & \begin{array}{l} \text { Splitter for generic text. It takes a list of chars as } \\ \text { argument, puts all paragraphs (then sentences, } \\ \text { then words) together as much as possible} \end{array} \\ \hline \ text { SpacyTextSplitter } & \text { Use Spacy's splitter} \\ \hline \text { TokenTextSplitter } & \text { Split according to the token number of openAl} \\ \hline \end{array}$
Let's take a chestnut directly, such as segmenting Chinese text and inheriting the classes in CharacterTextSplitterlangchain ChineseTextSplitter. Regular expression sent_sep_patternto match the delimiters of Chinese sentences (such as period, exclamation mark, hello, semicolon, etc.):

# 中文文本切分类
class ChineseTextSplitter(CharacterTextSplitter):
    def __init__(self, pdf: bool = False, **kwargs):
        super().__init__(**kwargs)
        self.pdf = pdf

    def split_text(self, text: str) -> List[str]:
        if self.pdf:
            text = re.sub(r"\n{3,}", "\n", text)
            text = re.sub('\s', ' ', text)
            text = text.replace("\n\n", "")
        sent_sep_pattern = re.compile('([﹒﹔﹖﹗．。！？]["’”」』]{0,2}|(?=["‘“「『]{1,2}|$))')  # del ：；
        sent_list = []
        for ele in sent_sep_pattern.split(text):
            if sent_sep_pattern.match(ele) and sent_list:
                sent_list[-1] += ele
            elif ele:
                sent_list.append(ele)
        return sent_list

3. Graphical process

insert image description here

Reference

[1] MedicalGPT: Training Medical GPT Model
[2] How to handle rate limits.openai
[3] Talking about the core components of langchain large model plug-in knowledge base question answering system: how to better parse and segment complex unstructured text
[4] All You Need to Know to Build Your First LLM App. towards datascience
[5] LLM series | 15: How to use LangChain to do long document question and answer
[6] Large model of local knowledge base dialogue system. tx technical engineering

[LLM] Langchain usage [3] (document-based question and answer)

Article Directory