Local knowledge base Q&A based on LangChain+LLM: from enterprise single document Q&A to batch document Q&A

Preface

In the past six months, the popularity of ChatGPT has directly sparked the entire LLM direction. However, after all, LLM is more pre-trained based on past experience data, and cannot obtain the latest knowledge, as well as the private knowledge of each enterprise.

  • In order to obtain the latest knowledge, the ChatGPT plus version integrates the bing search function. Some models will call a bing function positioned as "langchain linking various AI models and tools"
  • In order to process the private knowledge of the enterprise, it can either be fine-tuned based on the open source model, or it can also build a local knowledge base Q&A based on the vector database and LLM integrated in langchain ( what is the uniqueness of the vector database here ? For example, traditional database Image retrieval may involve searching by keywords, while vector databases may search for the same or similar vectors in images through semantics and present the results )

Therefore, more and more people are beginning to pay attention to langchain and combine it with LLM, which directly promotes the combined application of databases, knowledge graphs and LLM (see the next article for details: Practical introduction to knowledge graph: from what is KG to Practical Combination of LLM and KG/DB )

This article focuses on explaining

  • What is LangChain and the overall structure of langchain?
  • Interpreting the key source code of the langchain-ChatGLM project is not just about using it as a tool, because a better understanding of the principles of the tool will lead to smoother use of the tool. It was not easy to interpret at first, because there are many projects and technical points involved, so it is difficult to interpret it at first
    . It was easy to get confused at first, but fortunately, after following the process of the project step by step, I presented a clear code structure to everyone. In the
    process, it took me nearly a week from contacting the langchain-ChatGLM project to sorting out and writing the overall source code clearly, and After you have this article, you may be able to figure it out in less than a day (improving efficiency by nearly 7 times) . This is one of the value and significance of this article.
  • An upgraded version of the langchain-ChatGLM project: langchain-Chatchat

If you have any questions during the reading process, please feel free to leave a message. We will reply/answer them one by one in time to discuss and dig deeper together.

Part 1  : What is LangChain: LLM’s plug-in/function library 

1.1 The overall structure of langchain

In layman's terms, the so-called langchain ( official website address , GitHub address ) encapsulates many functions commonly used in AI into libraries, and has interfaces for calling various commercial model APIs and open source models. It supports the following components

Friends who come into contact with it for the first time may be confused when they see so many components (there are so many things encapsulated, and it feels like it wants to encapsulate all the functions/tools needed for LLM ). To facilitate understanding, we can first look at it from a large level. The entire langchain library is divided into three major layers: basic layer, capability layer, and application layer

1.1.1 Basic layer: models, LLMs, index

1.1.1.1 Models: models

Various types of models and model integrations, such as OpenAI's various APIs/GPT-4, etc., provide unified interfaces for various basic models,
such as completing a question and answer through the API

import os
os.environ["OPENAI_API_KEY"] = '你的api key'
from langchain.llms import OpenAI

llm = OpenAI(model_name="text-davinci-003",max_tokens=1024)
llm("怎么评价人工智能")

The answer obtained is as shown below

1.1.1.2 LLMS layer

This layer mainly emphasizes the encapsulation of models layer capabilities and service-oriented output capabilities, mainly including:

  • Various LLM model management platforms: emphasizing the richness and ease of use of models
  • Integrated service capability products: Emphasis on out-of-the-box use
  • Differentiated capabilities: For example, focusing on prompt management (including prompt management, prompt optimization and prompt serialization), model running mode based on shared resources, etc.

For example, Google's PaLM Text APIs, or under the llms/openai.py  file

        model_token_mapping = {
            "gpt-4": 8192,
            "gpt-4-0314": 8192,
            "gpt-4-0613": 8192,
            "gpt-4-32k": 32768,
            "gpt-4-32k-0314": 32768,
            "gpt-4-32k-0613": 32768,
            "gpt-3.5-turbo": 4096,
            "gpt-3.5-turbo-0301": 4096,
            "gpt-3.5-turbo-0613": 4096,
            "gpt-3.5-turbo-16k": 16385,
            "gpt-3.5-turbo-16k-0613": 16385,
            "text-ada-001": 2049,
            "ada": 2049,
            "text-babbage-001": 2040,
            "babbage": 2049,
            "text-curie-001": 2049,
            "curie": 2049,
            "davinci": 2049,
            "text-davinci-003": 4097,
            "text-davinci-002": 4097,
            "code-davinci-002": 8001,
            "code-davinci-001": 8001,
            "code-cushman-002": 2048,
            "code-cushman-001": 2048,
        }

1.1.1.3  Index : Vector scheme, KG scheme

To store and retrieve various documents such as user private text, pictures, and PDFs (equivalent to structured documents to allow external data and models to interact), there are two specific implementation plans: a Vector plan and a KG plan

Vector scheme of Index

For the Vector solution: the file is first divided into Chunks, and then the Chunks are encoded, stored and retrieved respectively. You can refer to this code file: langchain/libs/langchain/langchain/indexes/vectorstore.py.
This code file is implemented in sequence.

Module import: various type checks, data structures, predefined classes and functions are imported. Next, a function _get_default_text_splitter and two classes VectorStoreIndexWrapper and VectorstoreIndexCreator
are implemented.

_get_default_text_splitter function:
This is a private function that returns a default text splitter that can recursively split text into blocks of size 1000 with overlap between blocks.

# 默认的文本分割器函数
def _get_default_text_splitter() -> TextSplitter:
    return RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

Next is the VectorStoreIndexWrapper class:
This is a wrapper class, mainly for convenient access and query of the vector store (Vector Store)

  1. vectorstore: attributes of a vector storage object
        vectorstore: VectorStore  # 向量存储对象
    
        class Config:
            """Configuration for this pydantic object."""
    
            extra = Extra.forbid            # 额外配置项
            arbitrary_types_allowed = True  # 允许任意类型
  2. query: a method that accepts a question string and queries the vector store to get the answer
    # 查询向量存储的函数
    def query(
        self,
        question: str,                                          # 输入的问题字符串
        llm: Optional[BaseLanguageModel] = None,                # 可选的语言模型参数,默认为None
        retriever_kwargs: Optional[Dict[str, Any]] = None,      # 提取器的可选参数,默认为None
        **kwargs: Any                                           # 其他关键字参数
    ) -> str:
        """Query the vectorstore."""                            # 函数的文档字符串,描述函数的功能
    
        # 如果没有提供语言模型参数,则使用OpenAI作为默认语言模型,并设定温度参数为0
        llm = llm or OpenAI(temperature=0)                      
    
        # 如果没有提供提取器的参数,则初始化为空字典
        retriever_kwargs = retriever_kwargs or {}               
    
        # 创建一个基于语言模型和向量存储提取器的检索QA链
        chain = RetrievalQA.from_chain_type(
            llm, retriever=self.vectorstore.as_retriever(**retriever_kwargs), **kwargs
        )
    
        # 使用创建的QA链运行提供的问题,并返回结果
        return chain.run(question)
    Explain the extractor that appears above

    The extractor first retrieves documents or fragments relevant to the question from a large corpus, and the generator then generates answers based on these retrieved documents.

    Extractors can be based on many different technologies, including:

        a. Keyword-based retrieval: Use keyword matching to find relevant documents
        b. Vector space model: Represent both documents and queries as vectors and retrieve relevant documents by calculating the similarity between them
        c. Deep learning-based Method: Use pre-trained neural network models (such as BERT, RoBERTa, etc.) to encode documents and queries into vectors, and perform similarity calculations
        d. Indexing method: such as inverted index, which is a commonly used technology by search engines and can be found quickly Documents containing specific words or phrases
    These methods can be used independently or in combination to improve the accuracy and speed of retrieval

  3. query_with_sources: similar to query, but also returns the data sources related to the query results
        # 查询向量存储并返回数据源的函数
        def query_with_sources(
            self,
            question: str,
            llm: Optional[BaseLanguageModel] = None,
            retriever_kwargs: Optional[Dict[str, Any]] = None,
            **kwargs: Any
        ) -> dict:
            """Query the vectorstore and get back sources."""
            llm = llm or OpenAI(temperature=0)              # 默认使用OpenAI作为语言模型
            retriever_kwargs = retriever_kwargs or {}       # 提取器参数
            chain = RetrievalQAWithSourcesChain.from_chain_type(
                llm, retriever=self.vectorstore.as_retriever(**retriever_kwargs), **kwargs
            )
            return chain({chain.question_key: question})

Finally, there is the VectorstoreIndexCreator class:

This is a class that creates a vector storage index

  1. vectorstore_cls: vector storage class used, default is Chroma
        vectorstore_cls: Type[VectorStore] = Chroma              # 默认使用Chroma作为向量存储类
    A simplified vector store can be thought of as a large table or database, where each row represents an item (such as a document, image, sentence, etc.), and each item has a high-dimensional vector associated with it. The dimensions of the vectors can range from tens to thousands, depending on the embedding model used.
    For example:
    Item ID Vector (in a high dimensional space)
    1 [0.34, -0.2, 0.5, ...]
    2 [-0.1, 0.3, -0.4, ...]
    ... ...
    As for Chroma here , it is a common vector database that can be integrated with LangChain to realize various applications based on language models.

  2. embedding: the embedding class used, the default is OpenAIEmbeddings
        embedding: Embeddings = Field(default_factory=OpenAIEmbeddings)  # 默认使用OpenAIEmbeddings作为嵌入类
  3. text_splitter: Text splitter for splitting text
        text_splitter: TextSplitter = Field(default_factory=_get_default_text_splitter)  # 默认文本分割器
  4. from_loaders: Create a vector storage index from the given list of loaders
        # 从加载器创建向量存储索引的函数
        def from_loaders(self, loaders: List[BaseLoader]) -> VectorStoreIndexWrapper:
            """Create a vectorstore index from loaders."""
            docs = []
            for loader in loaders:              # 遍历加载器
                docs.extend(loader.load())      # 加载文档
            return self.from_documents(docs)
  5. from_documents: Create a vector storage index from the given list of documents
        # 从文档创建向量存储索引的函数
        def from_documents(self, documents: List[Document]) -> VectorStoreIndexWrapper:
            """Create a vectorstore index from documents."""
            sub_docs = self.text_splitter.split_documents(documents)      # 分割文档
            vectorstore = self.vectorstore_cls.from_documents(
                sub_docs, self.embedding, **self.vectorstore_kwargs       # 从文档创建向量存储
            )
            return VectorStoreIndexWrapper(vectorstore=vectorstore)       # 返回向量存储的包装对象
Index KGplan_

For the KG solution: This part uses LLM to extract triples in the file and stores them as KG for subsequent retrieval. You can refer to this code file: langchain/libs/langchain/langchain/indexes /graph.py

"""Graph Index Creator."""                     # 定义"图索引创建器"的描述

# 导入相关的模块和类型定义
from typing import Optional, Type              # 导入可选类型和类型的基础类型
from langchain import BasePromptTemplate       # 导入基础提示模板
from langchain.chains.llm import LLMChain      # 导入LLM链
from langchain.graphs.networkx_graph import NetworkxEntityGraph, parse_triples  # 导入Networkx实体图和解析三元组的功能
from langchain.indexes.prompts.knowledge_triplet_extraction import (  # 从知识三元组提取模块导入对应的提示
    KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT,
)
from langchain.pydantic_v1 import BaseModel                      # 导入基础模型
from langchain.schema.language_model import BaseLanguageModel    # 导入基础语言模型的定义

class GraphIndexCreator(BaseModel):  # 定义图索引创建器类,继承自BaseModel
    """Functionality to create graph index."""   # 描述该类的功能为"创建图索引"

    llm: Optional[BaseLanguageModel] = None      # 定义可选的语言模型属性,默认为None
    graph_type: Type[NetworkxEntityGraph] = NetworkxEntityGraph  # 定义图的类型,默认为NetworkxEntityGraph

    def from_text(
        self, text: str, prompt: BasePromptTemplate = KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT
    ) -> NetworkxEntityGraph:                  # 定义一个方法,从文本中创建图索引
        """Create graph index from text."""    # 描述该方法的功能
        if self.llm is None:                   # 如果语言模型为None,则抛出异常
            raise ValueError("llm should not be None")
        graph = self.graph_type()  # 创建一个新的图
        chain = LLMChain(llm=self.llm, prompt=prompt)  # 使用当前的语言模型和提示创建一个LLM链
        output = chain.predict(text=text)      # 使用LLM链对文本进行预测
        knowledge = parse_triples(output)      # 解析预测输出得到的三元组
        for triple in knowledge:               # 遍历所有的三元组
            graph.add_triple(triple)           # 将三元组添加到图中
        return graph  # 返回创建的图

    async def afrom_text(             # 定义一个异步版本的from_text方法
        self, text: str, prompt: BasePromptTemplate = KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT
    ) -> NetworkxEntityGraph:
        """Create graph index from text asynchronously."""  # 描述该异步方法的功能
        if self.llm is None:          # 如果语言模型为None,则抛出异常
            raise ValueError("llm should not be None")
        graph = self.graph_type()     # 创建一个新的图
        chain = LLMChain(llm=self.llm, prompt=prompt)       # 使用当前的语言模型和提示创建一个LLM链
        output = await chain.apredict(text=text)   # 异步使用LLM链对文本进行预测
        knowledge = parse_triples(output)          # 解析预测输出得到的三元组
        for triple in knowledge:                   # 遍历所有的三元组
            graph.add_triple(triple)               # 将三元组添加到图中
        return graph                               # 返回创建的图

In addition, in order to index, the following capabilities have to be involved:

  • Document Loaders , a standard interface for document loading,
    integrates with documents and data sources in various formats, such as Arxiv, Email, Excel, Markdown, PDF (so it can be used for applications like ChatPDF), Youtube... Similar ones include docstore,

    which
    includes wikipedia.py and other
    document_transformers
  • embeddings ( langchain/libs/langchain/langchain/embeddings ), involves various embeddings algorithms, which are reflected in various code files:
    elasticsearch.py, google_palm.py, gpt4all.py, huggingface.py, huggingface_hub.py
    llamacpp.py, minimax.py, modelscope_hub.py, mosaicml.py
    openai.py
    sentence_transformer.py, spacy_embeddings.py, tensorflow_hub.py, vertexai.py

1.1.2 Capability layer: Chains, Memory, Tools

If the basic layer provides the core abilities, the ability layer installs hands, feet, and brains on these abilities, giving them the ability to remember and trigger everything, including: Chains, Memory, and Tool.

  • Chains : In
    short, links are equivalent to a series of calls to various components, which may be a Prompt template, a language model, and an output parser, working together to process user input, generate responses, and process output.

    Specifically , which is equivalent to abstracting and customizing different execution logic according to different needs. Chains can be nested in each other and executed serially. Through this layer, LLM's capabilities can be linked to various industries. For example, interacting with the Elasticsearch database: elasticsearch_database
    , for
    example Based on knowledge graph question and answer: The code file in graph_qa : chains/graph_qa/base.py implements a question and answer system based on knowledge graph. The specific steps are: first, search for relevant information in the knowledge graph according to the extracted entities. This is achieved through  self.graph.get_entity_knowledge(entity) , which returns all the information related to the entity in the form of triples. Then, all the triples are combined to form the context . Finally, the question and context are Enter them into qa_chain together to get the final answer



       entities = get_entities(entity_string)  # 获取实体列表。
            context = ""               # 初始化上下文。
            all_triplets = []          # 初始化三元组列表。
            for entity in entities:    # 遍历每个实体
                all_triplets.extend(self.graph.get_entity_knowledge(entity))  # 获取实体的所有知识并加入到三元组列表中。
            context = "\n".join(all_triplets)          # 用换行符连接所有的三元组作为上下文。
            
            # 打印完整的上下文。
            _run_manager.on_text("Full Context:", end="\n", verbose=self.verbose)
            _run_manager.on_text(context, color="green", end="\n", verbose=self.verbose)
            
            # 使用上下文和问题获取答案。
            result = self.qa_chain(
                {"question": question, "context": context},
                callbacks=_run_manager.get_child(),
            )
            return {self.output_key: result[self.qa_chain.output_key]}  # 返回答案
    For example, code can be automatically generated and executed: llm_math , etc.
    For example, for private domain data: qa_with_sources . The code file chains/qa_with_sources/vector_db.py is an answer to questions using a vector database. The core lies in the following two functions
    reduce_tokens_below_limit
    # 定义基于向量数据库的问题回答类
    class VectorDBQAWithSourcesChain(BaseQAWithSourcesChain):
        """Question-answering with sources over a vector database."""
        
        # 定义向量数据库的字段
        vectorstore: VectorStore = Field(exclude=True)
    
        """Vector Database to connect to."""
        # 定义返回结果的数量
        k: int = 4
    
        # 是否基于token限制来减少返回结果的数量
        reduce_k_below_max_tokens: bool = False
    
        # 定义返回的文档基于token的最大限制
        max_tokens_limit: int = 3375
    
        # 定义额外的搜索参数
        search_kwargs: Dict[str, Any] = Field(default_factory=dict)
    
        # 定义函数来根据最大token限制来减少文档
        def _reduce_tokens_below_limit(self, docs: List[Document]) -> List[Document]:
            num_docs = len(docs)
    
            # 检查是否需要根据token减少文档数量
            if self.reduce_k_below_max_tokens and isinstance(
                self.combine_documents_chain, StuffDocumentsChain
            ):
                tokens = [
                    self.combine_documents_chain.llm_chain.llm.get_num_tokens(
                        doc.page_content
                    )
                    for doc in docs
                ]
                token_count = sum(tokens[:num_docs])
    
                # 减少文档数量直到满足token限制
                while token_count > self.max_tokens_limit:
                    num_docs -= 1
                    token_count -= tokens[num_docs]
    
            return docs[:num_docs]
    _get_docs
        # 获取相关文档的函数
        def _get_docs(
            self, inputs: Dict[str, Any], *, run_manager: CallbackManagerForChainRun
        ) -> List[Document]:
            question = inputs[self.question_key]
    
            # 从向量存储中搜索相似的文档
            docs = self.vectorstore.similarity_search(
                question, k=self.k, **self.search_kwargs
            )
            return self._reduce_tokens_below_limit(docs)
    For example, for SQL data sources: sql_database , you can focus on this code file: chains/sql_database/query.py

    For example, for model dialogue: chat_models , including these code files: __init__.py, anthropopic.py, azure_openai.py, base .py, fake.py, google_palm.py, human.py, jinachat.py, openai.py, promptlayer_openai.py, vertexai.py

    In addition, there are more eye-catching ones:
    constitutional_ai : bias against the final result, The logic of handling compliance issues ensures that the final result conforms to the values
    ​​llm_checker : allows LLM to automatically detect whether its output has problem-free logic
  • Memory:
    In short, it is used to save the context state when interacting with the model and handle long-term memory.

    Specifically, this layer mainly has two core points:
    \rightarrow  memorizing and structurally storing the input and output during the execution of Chains. , providing context for the next interaction. This part can be simply stored in Redis to
    \rightarrow  build a knowledge graph based on the interaction history, and provide accurate results based on the associated information. The corresponding code file is: memory/kg.py
    # 定义知识图谱对话记忆类
    class ConversationKGMemory(BaseChatMemory):
        """知识图谱对话记忆类
    
        在对话中与外部知识图谱集成,存储和检索对话中的知识三元组信息。
        """
    
        k: int = 2  # 考虑的上下文对话数量
        human_prefix: str = "Human"  # 人类前缀
        ai_prefix: str = "AI"  # AI前缀
        kg: NetworkxEntityGraph = Field(default_factory=NetworkxEntityGraph)  # 知识图谱实例
        knowledge_extraction_prompt: BasePromptTemplate = KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT          # 知识提取提示
        entity_extraction_prompt: BasePromptTemplate = ENTITY_EXTRACTION_PROMPT  # 实体提取提示
        llm: BaseLanguageModel                  # 基础语言模型
        summary_message_cls: Type[BaseMessage] = SystemMessage  # 总结消息类
        memory_key: str = "history"             # 历史记忆键
    
        def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
            """返回历史缓冲区。"""
            entities = self._get_current_entities(inputs)  # 获取当前实体
    
            summary_strings = []
            for entity in entities:  # 对于每个实体
                knowledge = self.kg.get_entity_knowledge(entity)      # 获取与实体相关的知识
                if knowledge:
                    summary = f"On {entity}: {'. '.join(knowledge)}."  # 构建总结字符串
                    summary_strings.append(summary)
            context: Union[str, List]
            if not summary_strings:
                context = [] if self.return_messages else ""
            elif self.return_messages:
                context = [
                    self.summary_message_cls(content=text) for text in summary_strings
                ]
            else:
                context = "\n".join(summary_strings)
    
            return {self.memory_key: context}
    
        @property
        def memory_variables(self) -> List[str]:
            """始终返回记忆变量列表。"""
            return [self.memory_key]
    
        def _get_prompt_input_key(self, inputs: Dict[str, Any]) -> str:
            """获取提示的输入键。"""
            if self.input_key is None:
                return get_prompt_input_key(inputs, self.memory_variables)
            return self.input_key
    
        def _get_prompt_output_key(self, outputs: Dict[str, Any]) -> str:
            """获取提示的输出键。"""
            if self.output_key is None:
                if len(outputs) != 1:
                    raise ValueError(f"One output key expected, got {outputs.keys()}")
                return list(outputs.keys())[0]
            return self.output_key
    
        def get_current_entities(self, input_string: str) -> List[str]:
            """从输入字符串中获取当前实体。"""
            chain = LLMChain(llm=self.llm, prompt=self.entity_extraction_prompt)
            buffer_string = get_buffer_string(
                self.chat_memory.messages[-self.k * 2 :],
                human_prefix=self.human_prefix,
                ai_prefix=self.ai_prefix,
            )
            output = chain.predict(
                history=buffer_string,
                input=input_string,
            )
            return get_entities(output)
    
        def _get_current_entities(self, inputs: Dict[str, Any]) -> List[str]:
            """获取对话中的当前实体。"""
            prompt_input_key = self._get_prompt_input_key(inputs)
            return self.get_current_entities(inputs[prompt_input_key])
    
        def get_knowledge_triplets(self, input_string: str) -> List[KnowledgeTriple]:
            """从输入字符串中获取知识三元组。"""
            chain = LLMChain(llm=self.llm, prompt=self.knowledge_extraction_prompt)
            buffer_string = get_buffer_string(
                self.chat_memory.messages[-self.k * 2 :],
                human_prefix=self.human_prefix,
                ai_prefix=self.ai_prefix,
            )
            output = chain.predict(
                history=buffer_string,
                input=input_string,
                verbose=True,
            )
            knowledge = parse_triples(output)  # 解析三元组
            return knowledge
    
        def _get_and_update_kg(self, inputs: Dict[str, Any]) -> None:
            """从对话历史中获取并更新知识图谱。"""
            prompt_input_key = self._get_prompt_input_key(inputs)
            knowledge = self.get_knowledge_triplets(inputs[prompt_input_key])
            for triple in knowledge:
                self.kg.add_triple(triple)  # 向知识图谱中添加三元组
    
        def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
            """将此对话的上下文保存到缓冲区。"""
            super().save_context(inputs, outputs)
            self._get_and_update_kg(inputs)
    
        def clear(self) -> None:
            """清除记忆内容。"""
            super().clear()
            self.kg.clear()  # 清除知识图谱内容
  • Tools layer, tools.
    In fact, the Chains layer can execute some specific logic based on LLM + Prompt. However, if it is unrealistic to use Chain to implement all logic, it can also be achieved through the Tools layer. The Tools layer is more reasonable to understand as skills, such as search. , Wikipedia, weather forecast, ChatGPT service, etc.

1.1.3 Application layer: Agents

  • Agents :
    In short, with the basic layer and capability layer, we can build a variety of fun and valuable services. Here is Agent

    specifically. Agent acts as an agent to make requests to LLM and then take action. And check the results until the work is completed, including agents for tasks that LLM cannot handle ( such as search or calculation, plug-ins like ChatGPT plus have the function of calling bing and calculator )
    . For example, the Agent can use Wikipedia to find the birth date of Barack Obama, and then Use the calculator to find out his age in 2023
    # pip install wikipedia
    from langchain.agents import load_tools
    from langchain.agents import initialize_agent
    from langchain.agents import AgentType
    
    tools = load_tools(["wikipedia", "llm-math"], llm=llm)
    agent = initialize_agent(tools, 
                             llm, 
                             agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, 
                             verbose=True)
    
    
    agent.run("奥巴马的生日是哪天? 到2023年他多少岁了?")
    In addition, for Wikipedia, you can pay attention to this code file: langchain/docstore/wikipedia.py  ...

The final overall technical architecture of langchain can be shown in the figure below ( view the high-definition large image , in addition, here is another architecture diagram )

1.2 Some application examples of langchain: Internet search + document Q&A

But after looking at the theoretical introduction, you may not be able to understand what langchain is used for. To facilitate your understanding, here are a few application examples of langchain.

1.2.1 Search Google and return answers

Because it needs to be implemented with the help of Serpapi, and Serpapi provides the API interface of Google search

Therefore, first go to the Serpapi official website (https://serpapi.com/) to register a user, copy it to generate an API key for us, and then set it in the environment variable.

import os
os.environ["OPENAI_API_KEY"] = '你的api key'
os.environ["SERPAPI_API_KEY"] = '你的api key'

Then, start writing code

from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.llms import OpenAI
from langchain.agents import AgentType

# 加载 OpenAI 模型
llm = OpenAI(temperature=0,max_tokens=2048) 

 # 加载 serpapi 工具
tools = load_tools(["serpapi"])

# 如果搜索完想再计算一下可以这么写
# tools = load_tools(['serpapi', 'llm-math'], llm=llm)

# 如果搜索完想再让他再用python的print做点简单的计算,可以这样写
# tools=load_tools(["serpapi","python_repl"])

# 工具加载后都需要初始化,verbose 参数为 True,会打印全部的执行详情
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

# 运行 agent
agent.run("What's the date today? What great events have taken place today in history?")

1.2.2 Implement a document conversation robot in less than 50 lines of code

As we all know, since the data trained by ChatGPT is only updated to 2021, it does not know the latest knowledge of the Internet (unless it calls the search function bing), and using "LangChain + ChatGPT's API" can be implemented with less than 50 lines of code A conversational bot with existing documents

Assuming that all updated content in 2022 exists in the 2022.txt document, then through the following code, ChatGPT can support answering questions in 2022

The principle is also very simple:

  1. Vectorize user input/prompt
  2. Document segmentation
  3. Document splitting
  4. Text vectorization.
    Only after vectorization can the similarity between vectors be calculated.
  5. Vectorized text is stored in a vector database
  6. Find the answer in the vector data according to the user's input/prompt (the answer is determined based on the similarity matching between the prompt/input and the relevant paragraph vector in the text)
  7. Finally, the answer is returned through LLM
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import os                            # 导入os模块,用于操作系统相关的操作
import jieba as jb                   # 导入结巴分词库
from langchain.chains import ConversationalRetrievalChain   # 导入用于创建对话检索链的类
from langchain.chat_models import ChatOpenAI                # 导入用于创建ChatOpenAI对象的类
from langchain.document_loaders import DirectoryLoader      # 导入用于加载文件的类
from langchain.embeddings import OpenAIEmbeddings           # 导入用于创建词向量嵌入的类
from langchain.text_splitter import TokenTextSplitter       # 导入用于分割文档的类
from langchain.vectorstores import Chroma                   # 导入用于创建向量数据库的类

# 初始化函数,用于处理输入的文档
def init():  
    files = ['2022.txt']      # 需要处理的文件列表
    for file in files:        # 遍历每个文件
        with open(f"./data/{file}", 'r', encoding='utf-8') as f:   # 以读模式打开文件
            data = f.read()   # 读取文件内容

        cut_data = " ".join([w for w in list(jb.cut(data))])       # 对读取的文件内容进行分词处理
        cut_file = f"./data/cut/cut_{file}"      # 定义处理后的文件路径和名称
        with open(cut_file, 'w') as f:           # 以写模式打开文件
            f.write(cut_data)                    # 将处理后的内容写入文件

# 新建一个函数用于加载文档
def load_documents(directory):  
    # 创建DirectoryLoader对象,用于加载指定文件夹内的所有.txt文件
    loader = DirectoryLoader(directory, glob='**/*.txt')  
    docs = loader.load()  # 加载文件
    return docs  # 返回加载的文档

# 新建一个函数用于分割文档
def split_documents(docs):  
    # 创建TokenTextSplitter对象,用于分割文档
    text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=0)  
    docs_texts = text_splitter.split_documents(docs)  # 分割加载的文本
    return docs_texts  # 返回分割后的文本

# 新建一个函数用于创建词嵌入
def create_embeddings(api_key):  
    # 创建OpenAIEmbeddings对象,用于获取OpenAI的词向量
    embeddings = OpenAIEmbeddings(openai_api_key=api_key)  
    return embeddings  # 返回创建的词嵌入

# 新建一个函数用于创建向量数据库
def create_chroma(docs_texts, embeddings, persist_directory):  
    # 使用文档,embeddings和持久化目录创建Chroma对象
    vectordb = Chroma.from_documents(docs_texts, embeddings, persist_directory=persist_directory)  
    vectordb.persist()      # 持久化存储向量数据
    return vectordb         # 返回创建的向量数据库

# load函数,调用上面定义的具有各个职责的函数
def load():
    docs = load_documents('./data/cut')        # 调用load_documents函数加载文档
    docs_texts = split_documents(docs)         # 调用split_documents函数分割文档
    api_key = os.environ.get('OPENAI_API_KEY')   # 从环境变量中获取OpenAI的API密钥
    embeddings = create_embeddings(api_key)      # 调用create_embeddings函数创建词嵌入

    # 调用create_chroma函数创建向量数据库
    vectordb = create_chroma(docs_texts, embeddings, './data/cut/')  

    # 创建ChatOpenAI对象,用于进行聊天对话
    openai_ojb = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")  

    # 从模型和向量检索器创建ConversationalRetrievalChain对象
    chain = ConversationalRetrievalChain.from_llm(openai_ojb, vectordb.as_retriever())  
    return chain  # 返回该对象

# 调用load函数,获取ConversationalRetrievalChain对象
chain = load()  

# 定义一个函数,根据输入的问题获取答案
def get_ans(question):  
    chat_history = []      # 初始化聊天历史为空列表
    result = chain({       # 调用chain对象获取聊天结果
        'chat_history': chat_history,  # 传入聊天历史
        'question': question,          # 传入问题
    })
    return result['answer']      # 返回获取的答案

if __name__ == '__main__':       # 如果此脚本作为主程序运行
    s = input('please input:')   # 获取用户输入
    while s != 'exit':      # 如果用户输入的不是'exit'
        ans = get_ans(s)    # 调用get_ans函数获取答案
        print(ans)  # 打印答案
        s = input('please input:')  # 获取用户输入

//To be updated


Part 2: Local knowledge base Q&A based on LangChain + ChatGLM-6B (first version in July 23)

2.1 Core steps: How to implement local knowledge base Q&A through LangChain+LLM

In July 2023, there was a   question and answer application based on local knowledge base implemented using langchain ideas on GitHub: langchain-ChatGLM ( this is its GitHub address , of course there are similar projects that now support Vicuna-13b, such as LangChain-ChatGLM-Webui  ), the goal is to build a knowledge base question and answer solution that is friendly to Chinese scenarios and open source models and can run offline.

The implementation principle of this project is shown in the figure below ( similar to document-based question and answer, the process includes: 1 loading document -> 2 reading document -> 3/4 document segmentation -> 5/6 text vectorization -> 8/9 question Vectorization -> 10 Match the top k most similar to the question vector in the document vector -> 11/12/13 The matched text is added to prompt as context and question -> 14/15 Submit to LLM generation answer  )

  1. The first stage: Load file - Read file - Text splitter
    Load file
    : This is the step to read the knowledge base file stored locally.
    Read file : Read the content of the loaded file, usually convert it into Text format
    text splitter (Text splitter) : Split text according to certain rules (such as paragraphs, sentences, words, etc.). The following is only a sample code (not the source code of the langchain-ChatGLM project)

        def _load_file(self, filename):
            # 判断文件类型
            if filename.lower().endswith(".pdf"):  # 如果文件是 PDF 格式
                loader = UnstructuredFileLoader(filename)   # 使用 UnstructuredFileLoader 加载器来加载 PDF 文件
                text_splitor = CharacterTextSplitter()      # 使用 CharacterTextSplitter 来分割文件中的文本
                docs = loader.load_and_split(text_splitor)  # 加载文件并进行文本分割
            else:          # 如果文件不是 PDF 格式
                loader = UnstructuredFileLoader(filename, mode="elements")  # 使用 UnstructuredFileLoader 加载器以元素模式加载文件
                text_splitor = CharacterTextSplitter()      # 使用 CharacterTextSplitter 来分割文件中的文本
                docs = loader.load_and_split(text_splitor)  # 加载文件并进行文本分割
            return docs    # 返回处理后的文件数据
    
  2. The second stage: text vectorization (embedding) - stored in the vector database
    Text vectorization (embedding)
    : This usually involves NLP feature extraction. The segmented text can be converted into Numeric vector

        # 初始化方法,接受一个可选的模型名称参数,默认值为 None
        def __init__(self, model_name=None) -> None:  
            if not model_name:  # 如果没有提供模型名称
                # 使用默认的嵌入模型
                # 创建一个 HuggingFaceEmbeddings 对象,模型名称为类的 model_name 属性
                self.embeddings = HuggingFaceEmbeddings(model_name=self.model_name)  
    

    Store in vector database : After vectorizing the text, store it in the database vectorstore ( FAISS, FAISS will be explained in detail in the next section )

    def init_vector_store(self):
        persist_dir = os.path.join(VECTORE_PATH, ".vectordb")  # 持久化向量数据库的地址
        print("向量数据库持久化地址: ", persist_dir)              # 打印持久化地址
    
        # 如果持久化地址存在
        if os.path.exists(persist_dir):  
            # 从本地持久化文件中加载
            print("从本地向量加载数据...")
            # 使用 Chroma 加载持久化的向量数据
            vector_store = Chroma(persist_directory=persist_dir, embedding_function=self.embeddings)  
    
        # 如果持久化地址不存在
        else:      
            # 加载知识库
            documents = self.load_knownlege()  
            # 使用 Chroma 从文档中创建向量存储
            vector_store = Chroma.from_documents(documents=documents, 
                                                 embedding=self.embeddings,
                                                 persist_directory=persist_dir)  
            vector_store.persist()      # 持久化向量存储
        return vector_store             # 返回向量存储

    The implementation of load_knownlege is

    def load_knownlege(self):
        docments = []         # 初始化一个空列表来存储文档
    
        # 遍历 DATASETS_DIR 目录下的所有文件
        for root, _, files in os.walk(DATASETS_DIR, topdown=False):
            for file in files:
                filename = os.path.join(root, file)      # 获取文件的完整路径
                docs = self._load_file(filename)         # 加载文件中的文档
    
                # 更新 metadata 数据
                new_docs = []             # 初始化一个空列表来存储新文档
                for doc in docs:
                    # 更新文档的 metadata,将 "source" 字段的值替换为不包含 DATASETS_DIR 的相对路径
                    doc.metadata = {"source": doc.metadata["source"].replace(DATASETS_DIR, "")} 
                    print("文档2向量初始化中, 请稍等...", doc.metadata)  # 打印正在初始化的文档的 metadata
                    new_docs.append(doc)  # 将文档添加到新文档列表
    
                docments += new_docs      # 将新文档列表添加到总文档列表
    
        return docments      # 返回所有文档的列表
  3. The third stage: question vectorization
    This is to convert the user's query or question into a vector. The same method as text vectorization should be used so that comparisons can be made in the same space.

  4. The fourth stage: match the top k text vectors that are most similar to the question vector.
    This step is the core of information retrieval. By calculating cosine similarity, Euclidean distance, etc., the text closest to the question vector is found. vector

        def query(self, q):
            """在向量数据库中查找与问句向量相似的文本向量"""
            vector_store = self.init_vector_store()
            docs = vector_store.similarity_search_with_score(q, k=self.top_k)
            for doc in docs:
                dc, s = doc
                yield s, dc
  5. The fifth stage: The matched text is added to the prompt as context and the question.
    This is to use the matched text to form a context related to the question for input to the language model.

  6. Stage 6: Submit to LLM to generate an answer.
    Finally, submit the question and context to the language model (such as the GPT series) and let it generate an answer,
    such as a knowledge query ( code source )

    class KnownLedgeBaseQA:
        # 初始化
        def __init__(self) -> None:
            k2v = KnownLedge2Vector()      # 创建一个知识到向量的转换器
            self.vector_store = k2v.init_vector_store()     # 初始化向量存储
            self.llm = VicunaLLM()         # 创建一个 VicunaLLM 对象
        
        # 获得与查询相似的答案
        def get_similar_answer(self, query):
            # 创建一个提示模板
            prompt = PromptTemplate(
                template=conv_qa_prompt_template, 
                input_variables=["context", "question"]  # 输入变量包括 "context"(上下文) 和 "question"(问题)
            )
    
            # 使用向量存储来检索文档
            retriever = self.vector_store.as_retriever(search_kwargs={"k": VECTOR_SEARCH_TOP_K}) 
            docs = retriever.get_relevant_documents(query=query)  # 获取与查询相关的文本
    
            context = [d.page_content for d in docs]     # 从文本中提取出内容
            result = prompt.format(context="\n".join(context), question=query) # 格式化模板,并用从文本中提取出的内容和问题填充
            return result                 # 返回结果
    

As you can see, this method of combining langchain+LLM is particularly suitable for some vertical fields or large group companies to build an internal private Q&A system using LLM's intelligent dialogue capabilities. It is also suitable for individuals to specifically conduct Q&A on some English papers. For example, a popular open source project: ChatPDF, from the perspective of document processing, the implementation process is as follows ( picture source ):

2.2 Facebook AI Similarity Search (FAISS): Efficient vector similarity retrieval

The full name of Faiss is Facebook AI Similarity Search ( official introduction page , GitHub address ). It is a tool developed by FaceBook's AI team to address large-scale similarity retrieval problems. It is written in C++ and has a python interface. It can handle 1 billion-level indexes. Achieve millisecond-level retrieval performance

To put it simply, Faiss's job is to encapsulate our own candidate vector set into an index database, which can speed up our process of retrieving TopK similar vectors. Some of the indexes also support GPU construction.

2.2.1 Basic process of Faiss retrieval of similar vector TopK

The project of Faiss retrieval of similar vectors TopK can basically be divided into three steps:

  1. Get vector library
    import numpy as np
    d = 64                                           # 向量维度
    nb = 100000                                      # index向量库的数据量
    nq = 10000                                       # 待检索query的数目
    np.random.seed(1234)             
    xb = np.random.random((nb, d)).astype('float32')
    xb[:, 0] += np.arange(nb) / 1000.                # index向量库的向量
    xq = np.random.random((nq, d)).astype('float32')
    xq[:, 0] += np.arange(nq) / 1000.                # 待检索的query向量
  2. Use faiss to build the index and add the vector to the index.
    The brute force retrieval method FlatL2 is used to build the index. L2 represents the similarity measurement method used in the constructed index. It is the L2 norm, which is the Euclidean distance.
    import faiss          
    index = faiss.IndexFlatL2(d)             
    print(index.is_trained)         # 输出为True,代表该类index不需要训练,只需要add向量进去即可
    index.add(xb)                   # 将向量库中的向量加入到index中
    print(index.ntotal)             # 输出index中包含的向量总数,为100000
  3. Use faiss index to search and retrieve TopK similar queries.
    k = 4                     # topK的K值
    D, I = index.search(xq, k)# xq为待检索向量,返回的I为每个待检索query最相似TopK的索引list,D为其对应的距离
    print(I[:5])
    print(D[-5:])

    The printout is:
    >>> 
    [[ 0 393 363 78] 
     [ 1 555 277 364] 
     [ 2 304 101 13] 
     [ 3 173 18 182] 
     [ 4 288 370 531]]  
    [[ 0. 7.17517328 7.2076292 7.2511625 3]  
     [0. 6.32356453 6.6845808 6.79994535]  
     [ 0. 5.79640865 6.39173603 7.28151226]  
     [ 0. 7.27790546 7.52798653 7.66284657]  
     [ 0. 6.763803 48 7.29512024 7.36881447]]

2.2.2 FAISS’s various ways to build indexes

The index method and parameter passing method can be constructed as

dim, measure = 64, faiss.METRIC_L2
param = 'Flat'
index = faiss.index_factory(dim, param, measure)
  • dim is the vector dimension
  • The most important thing is the param parameter, which is the parameter passed into the index and represents what type of index needs to be built;
  • Measure is a measurement method, currently supporting two types, Euclidean distance and inner product, that is, inner product. Therefore, to calculate cosine similarity, you only need to normalize the vecs and use the inner product metric.

According to this article, faiss now officially supports eight measurement methods, which are:

  1. METRIC_INNER_PRODUCT(内积)
  2. METRIC_L1 (Manhattan distance)
  3. METRIC_L2 (Euclidean distance)
  4. METRIC_Linf (infinite norm)
  5. METRIC_Lp (p norm)
  6. METRIC_BrayCurtis (BC dissimilarity)
  7. METRIC_Canberra (Random distance/Canberra distance)
  8. METRIC_JensenShannon (JS Divergence)

2.2.2. 1 Flat: violent retrieval

  • Advantages: This method is the most accurate and has the highest recall rate among all Faiss indexes; there is no other method;
  • Disadvantages: slow speed, large memory usage.
  • Usage: The vector candidate set is very small, within 500,000, and the memory is not tight.
  • Note: Although they are all violent searches, the violent search speed of faiss is much faster than the violent search written by ordinary programmers , so it does not mean that it is useless. It is recommended that students who have violent search needs still use faiss.
  • Build method:
dim, measure = 64, faiss.METRIC_L2
param = 'Flat'
index = faiss.index_factory(dim, param, measure)
index.is_trained                                   # 输出为True
index.add(xb)                                      # 向index中添加向量

2.2.2. 2 IVFx Flat: Inverted violent search

  • Advantages: IVF mainly uses the idea of ​​​​inversion. The inversion technology in the document retrieval scenario means that a kw is followed by many docs containing the word. Since the number of kw is much smaller than the doc, it will greatly reduce the retrieval time. time. How to use inversion in vectors? You can take out the vector ID under each cluster center, and hang a bunch of non-center vectors behind each center ID. Every time you query the vector, find the nearest center IDs and search for the non-centers under these centers respectively. vector. Improve search efficiency by reducing the search scope.
  • Disadvantages: The speed is not very fast.
  • Usage: Compared with Flat, it will greatly increase the speed of retrieval. It is recommended that millions of vectors can be used.
  • Parameters: x in IVFx is the number of k-means clustering centers
  • Build method:
dim, measure = 64, faiss.METRIC_L2 
param = 'IVF100,Flat'                           # 代表k-means聚类中心为100,   
index = faiss.index_factory(dim, param, measure)
print(index.is_trained)                          # 此时输出为False,因为倒排索引需要训练k-means,
index.train(xb)                                  # 因此需要先训练index,再add向量
index.add(xb)                                     

2.2.2.3 PQx: product quantization

  • Advantages: The method of product quantification is used to improve ordinary retrieval. The dimensions of a vector are cut into x segments, each segment is retrieved separately, and the final TopK is obtained by intersecting the retrieval results of each segment of vectors. Therefore, it is very fast, takes up less memory, and has a relatively high recall rate.
  • Disadvantages: Compared with violent retrieval, the recall rate drops more.
  • Usage: Memory is extremely scarce, and fast retrieval speed is required, and recall rate is not that important.
  • Parameters: x in PQx is the number of segments to divide the vector. Therefore, x needs to be divisible by the vector dimension . The larger the x, the more detailed the segmentation and the higher the time complexity.
  • Build method:
dim, measure = 64, faiss.METRIC_L2 
param =  'PQ16' 
index = faiss.index_factory(dim, param, measure)
print(index.is_trained)                          # 此时输出为False,因为倒排索引需要训练k-means,
index.train(xb)                                  # 因此需要先训练index,再add向量
index.add(xb)          

2.2.2. 4 IVFxPQy inverted product quantization

  • Advantages: This method is widely used in the industry, and all indicators are acceptable. The product quantification method is used to improve the k-means of IVF. The dimensions of a vector are cut into x segments, and k-means is retrieved separately for each segment. .
  • Disadvantages: It combines the strengths of hundreds of schools, but also the shortcomings of hundreds of schools of thought.
  • Usage: Generally speaking, if there are no special extreme requirements in all aspects, this method is most recommended!
  • Parameters: IVFx, PQy, where x and y are the same as above
  • Build method:
dim, measure = 64, faiss.METRIC_L2  
param =  'IVF100,PQ16'
index = faiss.index_factory(dim, param, measure) 
print(index.is_trained)                          # 此时输出为False,因为倒排索引需要训练k-means, 
index.train(xb)                                  # 因此需要先训练index,再add向量 index.add(xb)       

2.2.2. 5 LSH locality sensitive hashing

  • Principle: Hashing is familiar to everyone. Vectors can also use hashing to speed up searches. The hashing we are talking about here refers to Locality Sensitive Hashing (LSH), which is different from traditional hashing and tries to avoid collisions. ,Locality-sensitive hashing relies on collisions to find ,neighbors. If two points in a high-dimensional space are very close, then design a hash function to hash the two points and then divide them into buckets, so that their hash bucket values ​​have a high probability of being the same. If the two points are If the distance between them is large, the probability that their hash bucket values ​​are the same will be very small.
  • Advantages: training is very fast, supports batch import, index takes up very little memory, and retrieval is also fast
  • Disadvantages: The recall rate is very low.
  • Usage: The candidate vector library is very large, offline retrieval, and memory resources are scarce.
  • Build method:
dim, measure = 64, faiss.METRIC_L2  
param =  'LSH'
index = faiss.index_factory(dim, param, measure) 
print(index.is_trained)                          # 此时输出为True
index.add(xb)       

2.2.2 .6 HNSWx

  • Advantages: This method is an improved method based on graph retrieval. The retrieval speed is extremely fast, and the retrieval results can be obtained in 1 billion seconds. The recall rate is almost comparable to Flat, and can reach an astonishing 97%. The time complexity of retrieval is loglogn, which can almost ignore the magnitude of the candidate vector. It also supports batch import, which is extremely suitable for online tasks and provides a millisecond-level experience.
  • Disadvantages: Index building is extremely slow and takes up a lot of memory (the largest in Faiss, larger than the memory size occupied by the original vector)
  • Parameters: x in HNSWx is the maximum number of nodes connected to each point when building the graph. The larger x is, the more complex the composition is, the more accurate the query is, and of course the slower the index construction time is. x is any integer from 4 to 64. .
  • Usage: Don't care about memory, and have plenty of time to build the index
  • Build method:
dim, measure = 64, faiss.METRIC_L2   
param =  'HNSW64' 
index = faiss.index_factory(dim, param, measure)  
print(index.is_trained)                          # 此时输出为True 
index.add(xb)

2.3 Project deployment: langchain + ChatGLM-6B to build a local knowledge base for Q&A

2.3.1 Deployment process one: support multiple usage modes

The LLM model can be selected according to actual business needs. ChatGLM-6B used in this project, its GitHub address is: https://github.com/THUDM/ChatGLM-6B
ChatGLM-6B is an open source, It supports Chinese and English bilingual conversational language model, based on the General Language Model (GLM) architecture and has 6.2 billion parameters. Combined with model quantization technology, users can deploy it locally on consumer-grade graphics cards (a minimum of 6GB of video memory is required at the INT4 quantization level)

ChatGLM-6B uses technology similar to ChatGPT and is optimized for Chinese question and answer and dialogue. After bilingual training in Chinese and English with about 1T identifiers, supplemented by supervised fine-tuning, feedback self-service, human feedback reinforcement learning and other technologies, the 6.2 billion parameter ChatGLM-6B has been able to generate answers that are quite consistent with human preferences.

  1. Create a new python3.8.13 environment (the model file can still be used)
    conda create -n langchain python==3.8.13
  2. Pull items
    git clone https://github.com/imClumsyPanda/langchain-ChatGLM.git
  3. Enter directory
    cd langchain-ChatGLM
  4. Install requirements.txt
    conda activate langchain
    pip install -r requirements.txt
  5. The highest version of langchain supported by the current environment is 0.0.166. If 0.0.174 cannot be installed, try installing 0.0.166 first. Modify the
    configuration file path:
    vi configs/model_config.py
  6. Set the path of chatglm-6b to your own
    “chatglm-6b”: {
    “name”: “chatglm-6b”,
    “pretrained_model_name”: “/data/sim_chatgpt/chatglm-6b”,
    “local_model_path”: None,
    “provides”: “ChatGLM”
  7. Modify the code file to be run: webui.py
    vi webui.py
  8. Set share in the last launch function to True and inbrowser to True
  9. Execute webui.py file
    python webui.py
    There may be a network problem and a public link cannot be created. You can map cloud servers and local ports, reference: https://www.cnblogs.com/monologuesmw/p/14465117.html

Corresponding output:

Insert image description here

Video memory occupied: about 15 G

2.3.2 Deployment process 2: Support online experience on multiple communities

Project address: https://github.com/thomas-yanxin/LangChain-ChatGLM-Webui
HUggingFace community online experience: https://huggingface.co/spaces/thomas-yanxin/LangChain-ChatLLM

In addition, it also supports online experiences such as ModelScope Magic Community and Flying Paddle AIStudio Community.

  1. Download project
    git clone https://github.com/thomas-yanxin/LangChain-ChatGLM-Webui.git
  2. Enter directory
    cd LangChain-ChatGLM-Webui
  3. Install required packages
    pip install -r requirements.txt
    pip install gradio==3.10
  4. Modify config.py
    init_llm = "ChatGLM-6B"
    
    llm_model_dict = {
        "chatglm": {
            "ChatGLM-6B": "/data/sim_chatgpt/chatglm-6b",
  5. Modify the app.py file, set share in the launch function to True, and set inbrowser to True
    to execute the webui.py file
    python webui.py

Insert image description here

 Video memory takes up about 13G


The third part is an in-depth analysis line by line: source code interpretation of the langchain-ChatGLM (first edition in July 23) project

Let’s review the architecture diagram of the langchain-ChatGLM project again ( picture source )

You will find that the project mainly consists of the following modules

  1. chains: Working link implementation, such as chains/local_doc_qa implements Q&A based on local documents.
  2. configs: configuration file storage
  3. knowledge_base/content: used to store uploaded original files
  4. loader: Implementation class of document loader
  5. models: llm's interface class and implementation class, providing streaming output support for open source models
  6. textsplitter: implementation class of text segmentation
  7. vectorstores: used to store vector library files, that is, local knowledge base ontology
  8. ..

Next, in order to make it easier for readers to understand at a glance,

  1. I basically added Chinese comments to "every line of code in the project below"
  2. And for a smoother understanding, the order in which I interpret each code folder is unfolded one by one according to the project process (rather than the order in which each code folder is presented on GitHub in the picture above)

If you have any questions, you can leave a comment at any time

3.1 agent:custom_agent/bing_search

3.1.1 agent/custom_agent.py

from langchain.agents import Tool          # 导入工具模块
from langchain.tools import BaseTool       # 导入基础工具类
from langchain import PromptTemplate, LLMChain      # 导入提示模板和语言模型链
from agent.custom_search import DeepSearch          # 导入自定义搜索模块

# 导入基础单动作代理,输出解析器,语言模型单动作代理和代理执行器
from langchain.agents import BaseSingleActionAgent, AgentOutputParser, LLMSingleActionAgent, AgentExecutor    
from typing import List, Tuple, Any, Union, Optional, Type      # 导入类型注释模块
from langchain.schema import AgentAction, AgentFinish           # 导入代理动作和代理完成模式
from langchain.prompts import StringPromptTemplate          # 导入字符串提示模板
from langchain.callbacks.manager import CallbackManagerForToolRun      # 导入工具运行回调管理器
from langchain.base_language import BaseLanguageModel       # 导入基础语言模型
import re                                                   # 导入正则表达式模块

# 定义一个代理模板字符串
agent_template = """
你现在是一个{role}。这里是一些已知信息:
{related_content}
{background_infomation}
{question_guide}:{input}

{answer_format}
"""

# 定义一个自定义提示模板类,继承自字符串提示模板
class CustomPromptTemplate(StringPromptTemplate):
    template: str          # 提示模板字符串
    tools: List[Tool]      # 工具列表

    # 定义一个格式化函数,根据提供的参数生成最终的提示模板
    def format(self, **kwargs) -> str:
        intermediate_steps = kwargs.pop("intermediate_steps")
        # 判断是否有互联网查询信息
        if len(intermediate_steps) == 0:
            # 如果没有,则给出默认的背景信息,角色,问题指导和回答格式
            background_infomation = "\n"
            role = "傻瓜机器人"
            question_guide = "我现在有一个问题"
            answer_format = "如果你知道答案,请直接给出你的回答!如果你不知道答案,请你只回答\"DeepSearch('搜索词')\",并将'搜索词'替换为你认为需要搜索的关键词,除此之外不要回答其他任何内容。\n\n下面请回答我上面提出的问题!"

        else:
            # 否则,根据 intermediate_steps 中的 AgentAction 拼装 background_infomation
            background_infomation = "\n\n你还有这些已知信息作为参考:\n\n"
            action, observation = intermediate_steps[0]
            background_infomation += f"{observation}\n"
            role = "聪明的 AI 助手"
            question_guide = "请根据这些已知信息回答我的问题"
            answer_format = ""

        kwargs["background_infomation"] = background_infomation
        kwargs["role"] = role
        kwargs["question_guide"] = question_guide
        kwargs["answer_format"] = answer_format
        return self.template.format(**kwargs)  # 格式化模板并返回

# 定义一个自定义搜索工具类,继承自基础工具类
class CustomSearchTool(BaseTool):
    name: str = "DeepSearch"           # 工具名称
    description: str = ""              # 工具描述

    # 定义一个运行函数,接受一个查询字符串和一个可选的回调管理器作为参数,返回DeepSearch的搜索结果
    def _run(self, query: str, run_manager: Optional[CallbackManagerForToolRun] = None):
        return DeepSearch.search(query = query)

    # 定义一个异步运行函数,但由于DeepSearch不支持异步,所以直接抛出一个未实现错误
    async def _arun(self, query: str):
        raise NotImplementedError("DeepSearch does not support async")

# 定义一个自定义代理类,继承自基础单动作代理
class CustomAgent(BaseSingleActionAgent):
    # 定义一个输入键的属性
    @property
    def input_keys(self):
        return ["input"]

    # 定义一个计划函数,接受一组中间步骤和其他参数,返回一个代理动作或者代理完成
    def plan(self, intermedate_steps: List[Tuple[AgentAction, str]],
            **kwargs: Any) -> Union[AgentAction, AgentFinish]:
        return AgentAction(tool="DeepSearch", tool_input=kwargs["input"], log="")

# 定义一个自定义输出解析器,继承自代理输出解析器
class CustomOutputParser(AgentOutputParser):
    # 定义一个解析函数,接受一个语言模型的输出字符串,返回一个代理动作或者代理完成
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # 使用正则表达式匹配输出字符串,group1是调用函数名字,group2是传入参数
        match = re.match(r'^[\s\w]*(DeepSearch)\(([^\)]+)\)', llm_output, re.DOTALL)
        print(match)

        # 如果语言模型没有返回 DeepSearch() 则认为直接结束指令
        if not match:
            return AgentFinish(
                return_values={"output": llm_output.strip()},
                log=llm_output,
            )
        # 否则的话都认为需要调用 Tool
        else:
            action = match.group(1).strip()
            action_input = match.group(2).strip()
            return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)


# 定义一个深度代理类
class DeepAgent:
    tool_name: str = "DeepSearch"       # 工具名称
    agent_executor: any                 # 代理执行器
    tools: List[Tool]                   # 工具列表
    llm_chain: any                      # 语言模型链

    # 定义一个查询函数,接受一个相关内容字符串和一个查询字符串,返回执行器的运行结果
    def query(self, related_content: str = "", query: str = ""):
        tool_name =这段代码的主要目的是建立一个深度搜索的AI代理。AI代理首先通过接收一个问题输入,然后根据输入生成一个提示模板,然后通过该模板引导AI生成回答或进行更深入的搜索。现在,我将继续为剩余的代码添加中文注释

```python
        self.tool_name
        result = self.agent_executor.run(related_content=related_content, input=query ,tool_name=self.tool_name)
        return result       # 返回执行器的运行结果

    # 在初始化函数中,首先从DeepSearch工具创建一个工具实例,并添加到工具列表中
    def __init__(self, llm: BaseLanguageModel, **kwargs):
        tools = [
                    Tool.from_function(
                        func=DeepSearch.search,
                        name="DeepSearch",
                        description=""
                    )
                ]
        self.tools = tools      # 保存工具列表
        tool_names = [tool.name for tool in tools]    # 提取工具列表中的工具名称
        output_parser = CustomOutputParser()          # 创建一个自定义输出解析器实例
        # 创建一个自定义提示模板实例
        prompt = CustomPromptTemplate(template=agent_template,
                                      tools=tools,
                                      input_variables=["related_content","tool_name", "input", "intermediate_steps"])
        # 创建一个语言模型链实例
        llm_chain = LLMChain(llm=llm, prompt=prompt)
        self.llm_chain = llm_chain      # 保存语言模型链实例

        # 创建一个语言模型单动作代理实例
        agent = LLMSingleActionAgent(
            llm_chain=llm_chain,
            output_parser=output_parser,
            stop=["\nObservation:"],
            allowed_tools=tool_names
        )

        # 创建一个代理执行器实例
        agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True)
        self.agent_executor = agent_executor         # 保存代理执行器实例

3.1.2 agent/bing_search.py

#coding=utf8
# 声明文件编码格式为 utf8

from langchain.utilities import BingSearchAPIWrapper
# 导入 BingSearchAPIWrapper 类,这个类用于与 Bing 搜索 API 进行交互

from configs.model_config import BING_SEARCH_URL, BING_SUBSCRIPTION_KEY
# 导入配置文件中的 Bing 搜索 URL 和 Bing 订阅密钥

def bing_search(text, result_len=3):
    # 定义一个名为 bing_search 的函数,该函数接收一个文本和结果长度的参数,默认结果长度为3

    if not (BING_SEARCH_URL and BING_SUBSCRIPTION_KEY):
        # 如果 Bing 搜索 URL 或 Bing 订阅密钥未设置,则返回一个错误信息的文档
        return [{"snippet": "please set BING_SUBSCRIPTION_KEY and BING_SEARCH_URL in os ENV",
                 "title": "env inof not fould",
                 "link": "https://python.langchain.com/en/latest/modules/agents/tools/examples/bing_search.html"}]

    search = BingSearchAPIWrapper(bing_subscription_key=BING_SUBSCRIPTION_KEY,
                                  bing_search_url=BING_SEARCH_URL)
    # 创建 BingSearchAPIWrapper 类的实例,该实例用于与 Bing 搜索 API 进行交互

    return search.results(text, result_len)
    # 返回搜索结果,结果的数量由 result_len 参数决定

if __name__ == "__main__":
    # 如果这个文件被直接运行,而不是被导入作为模块,那么就执行以下代码

    r = bing_search('python')
    # 使用 Bing 搜索 API 来搜索 "python" 这个词,并将结果保存在变量 r 中

    print(r)
    # 打印出搜索结果

3.2 models: Contains models and document loader loader

  • models: llm's interface class and implementation class, providing streaming output support for open source models
  • loader: Implementation class of document loader

3.2.1 models/chatglm_llm.py

from abc import ABC  # 导入抽象基类
from langchain.llms.base import LLM           # 导入语言学习模型基类
from typing import Optional, List             # 导入类型标注模块
from models.loader import LoaderCheckPoint    # 导入模型加载点
from models.base import (BaseAnswer,          # 导入基本回答模型
                         AnswerResult)        # 导入回答结果模型


class ChatGLM(BaseAnswer, LLM, ABC):  # 定义ChatGLM类,继承基础回答、语言学习模型和抽象基类
    max_token: int = 10000          # 最大的token数
    temperature: float = 0.01       # 温度参数,用于控制生成文本的随机性
    top_p = 0.9  # 排序前0.9的token会被保留
    checkPoint: LoaderCheckPoint = None  # 检查点模型
    # history = []          # 历史记录
    history_len: int = 10   # 历史记录长度

    def __init__(self, checkPoint: LoaderCheckPoint = None):  # 初始化方法
        super().__init__()  # 调用父类的初始化方法
        self.checkPoint = checkPoint  # 赋值检查点模型

    @property
    def _llm_type(self) -> str:  # 定义只读属性_llm_type,返回语言学习模型的类型
        return "ChatGLM"

    @property
    def _check_point(self) -> LoaderCheckPoint:  # 定义只读属性_check_point,返回检查点模型
        return self.checkPoint

    @property
    def _history_len(self) -> int:  # 定义只读属性_history_len,返回历史记录的长度
        return self.history_len

    def set_history_len(self, history_len: int = 10) -> None:  # 设置历史记录长度
        self.history_len = history_len

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:  # 定义_call方法,实现模型的具体调用
        print(f"__call:{prompt}")  # 打印调用的提示信息
        response, _ = self.checkPoint.model.chat(  # 调用模型的chat方法,获取回答和其他信息
            self.checkPoint.tokenizer,  # 使用的分词器
            prompt,  # 提示信息
            history=[],  # 历史记录
            max_length=self.max_token,      # 最大长度
            temperature=self.temperature    # 温度参数
        )
        print(f"response:{response}")  # 打印回答信息
        print(f"+++++++++++++++++++++++++++++++++++")  # 打印分隔线
        return response  # 返回回答

    def generatorAnswer(self, prompt: str,
                         history: List[List[str]] = [],
                         streaming: bool = False):  # 定义生成回答的方法,可以处理流式输入

        if streaming:  # 如果是流式输入
            history += [[]]  # 在历史记录中添加新的空列表
            for inum, (stream_resp, _) in enumerate(self.checkPoint.model.stream_chat(  # 对模型的stream_chat方法返回的结果进行枚举
                    self.checkPoint.tokenizer,  # 使用的分词器
                    prompt,  # 提示信息
                    history=history[-self.history_len:-1] if self.history_len > 1 else [],  # 使用的历史记录
                    max_length=self.max_token,  # 最大长度
                    temperature=self.temperature  # 温度参数
            )):
                # self.checkPoint.clear_torch_cache()  # 清空缓存
                history[-1] = [prompt, stream_resp]  # 更新最后一个历史记录
                answer_result = AnswerResult()  # 创建回答结果对象
                answer_result.history = history  # 更新回答结果的历史记录
                answer_result.llm_output = {"answer": stream_resp}  # 更新回答结果的输出
                yield answer_result  # 生成回答结果
        else:  # 如果不是流式输入
            response, _ = self.checkPoint.model.chat(  # 调用模型的chat方法,获取回答和其他信息
                self.checkPoint.tokenizer,  # 使用的分词器
                prompt,  # 提示信息
                history=history[-self.history_len:] if self.history_len > 0 else [],  # 使用的历史记录
                max_length=self.max_token,  # 最大长度
                temperature=self.temperature  # 温度参数
            )
            self.checkPoint.clear_torch_cache()  # 清空缓存
            history += [[prompt, response]]  # 更新历史记录
            answer_result = AnswerResult()  # 创建回答结果对象
            answer_result.history = history  # 更新回答结果的历史记录
            answer_result.llm_output = {"answer": response}  # 更新回答结果的输出
            yield answer_result  # 生成回答结果

3.2.2 models/shared.py

The function of this file is to remotely call LLM

import sys      # 导入sys模块,通常用于与Python解释器进行交互
from typing import Any      # 从typing模块导入Any,用于表示任何类型

# 从models.loader.args模块导入parser,可能是解析命令行参数用
from models.loader.args import parser       
# 从models.loader模块导入LoaderCheckPoint,可能是模型加载点
from models.loader import LoaderCheckPoint  

# 从configs.model_config模块导入llm_model_dict和LLM_MODEL
from configs.model_config import (llm_model_dict, LLM_MODEL)  
# 从models.base模块导入BaseAnswer,即模型的基础类
from models.base import BaseAnswer  

# 定义一个名为loaderCheckPoint的变量,类型为LoaderCheckPoint,并初始化为None
loaderCheckPoint: LoaderCheckPoint = None  


def loaderLLM(llm_model: str = None, no_remote_model: bool = False, use_ptuning_v2: bool = False) -> Any:
    """
    初始化 llm_model_ins LLM
    :param llm_model: 模型名称
    :param no_remote_model: 是否使用远程模型,如果需要加载本地模型,则添加 `--no-remote-model
    :param use_ptuning_v2: 是否使用 p-tuning-v2 PrefixEncoder
    :return:
    """
    pre_model_name = loaderCheckPoint.model_name      # 获取loaderCheckPoint的模型名称
    llm_model_info = llm_model_dict[pre_model_name]   # 从模型字典中获取模型信息

    if no_remote_model:      # 如果不使用远程模型
        loaderCheckPoint.no_remote_model = no_remote_model  # 将loaderCheckPoint的no_remote_model设置为True
    if use_ptuning_v2:       # 如果使用p-tuning-v2
        loaderCheckPoint.use_ptuning_v2 = use_ptuning_v2    # 将loaderCheckPoint的use_ptuning_v2设置为True

    if llm_model:            # 如果指定了模型名称
        llm_model_info = llm_model_dict[llm_model]  # 从模型字典中获取指定的模型信息

    if loaderCheckPoint.no_remote_model:  # 如果不使用远程模型
        loaderCheckPoint.model_name = llm_model_info['name']  # 将loaderCheckPoint的模型名称设置为模型信息中的name
    else:  # 如果使用远程模型
        loaderCheckPoint.model_name = llm_model_info['pretrained_model_name']  # 将loaderCheckPoint的模型名称设置为模型信息中的pretrained_model_name

    loaderCheckPoint.model_path = llm_model_info["local_model_path"]  # 设置模型的本地路径

    if 'FastChatOpenAILLM' in llm_model_info["provides"]:  # 如果模型信息中的provides包含'FastChatOpenAILLM'
        loaderCheckPoint.unload_model()  # 卸载模型
    else:  # 如果不包含
        loaderCheckPoint.reload_model()  # 重新加载模型

    provides_class = getattr(sys.modules['models'], llm_model_info['provides'])  # 获取模型类
    modelInsLLM = provides_class(checkPoint=loaderCheckPoint)  # 创建模型实例
    if 'FastChatOpenAILLM' in llm_model_info["provides"]:      # 如果模型信息中的provides包含'FastChatOpenAILLM'
        modelInsLLM.set_api_base_url(llm_model_info['api_base_url'])  # 设置API基础URL
        modelInsLLM.call_model_name(llm_model_info['name'])    # 设置模型名称
    return modelInsLLM  # 返回模型实例

// To be updated..

3.3 configs: Configuration file storage model_config.py

import torch.cuda
import torch.backends
import os
import logging
import uuid

LOG_FORMAT = "%(levelname) -5s %(asctime)s" "-1d: %(message)s"
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.basicConfig(format=LOG_FORMAT)

# 在以下字典中修改属性值,以指定本地embedding模型存储位置
# 如将 "text2vec": "GanymedeNil/text2vec-large-chinese" 修改为 "text2vec": "User/Downloads/text2vec-large-chinese"
# 此处请写绝对路径
embedding_model_dict = {
    "ernie-tiny": "nghuyong/ernie-3.0-nano-zh",
    "ernie-base": "nghuyong/ernie-3.0-base-zh",
    "text2vec-base": "shibing624/text2vec-base-chinese",
    "text2vec": "GanymedeNil/text2vec-large-chinese",
    "m3e-small": "moka-ai/m3e-small",
    "m3e-base": "moka-ai/m3e-base",
}

# Embedding model name
EMBEDDING_MODEL = "text2vec"

# Embedding running device
EMBEDDING_DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"


# supported LLM models
# llm_model_dict 处理了loader的一些预设行为,如加载位置,模型名称,模型处理器实例
# 在以下字典中修改属性值,以指定本地 LLM 模型存储位置
# 如将 "chatglm-6b" 的 "local_model_path" 由 None 修改为 "User/Downloads/chatglm-6b"
# 此处请写绝对路径
llm_model_dict = {
    "chatglm-6b-int4-qe": {
        "name": "chatglm-6b-int4-qe",
        "pretrained_model_name": "THUDM/chatglm-6b-int4-qe",
        "local_model_path": None,
        "provides": "ChatGLM"
    },
    "chatglm-6b-int4": {
        "name": "chatglm-6b-int4",
        "pretrained_model_name": "THUDM/chatglm-6b-int4",
        "local_model_path": None,
        "provides": "ChatGLM"
    },
    "chatglm-6b-int8": {
        "name": "chatglm-6b-int8",
        "pretrained_model_name": "THUDM/chatglm-6b-int8",
        "local_model_path": None,
        "provides": "ChatGLM"
    },
    "chatglm-6b": {
        "name": "chatglm-6b",
        "pretrained_model_name": "THUDM/chatglm-6b",
        "local_model_path": None,
        "provides": "ChatGLM"
    },
    "chatglm2-6b": {
        "name": "chatglm2-6b",
        "pretrained_model_name": "THUDM/chatglm2-6b",
        "local_model_path": None,
        "provides": "ChatGLM"
    },
    "chatglm2-6b-int4": {
        "name": "chatglm2-6b-int4",
        "pretrained_model_name": "THUDM/chatglm2-6b-int4",
        "local_model_path": None,
        "provides": "ChatGLM"
    },
    "chatglm2-6b-int8": {
        "name": "chatglm2-6b-int8",
        "pretrained_model_name": "THUDM/chatglm2-6b-int8",
        "local_model_path": None,
        "provides": "ChatGLM"
    },
    "chatyuan": {
        "name": "chatyuan",
        "pretrained_model_name": "ClueAI/ChatYuan-large-v2",
        "local_model_path": None,
        "provides": None
    },
    "moss": {
        "name": "moss",
        "pretrained_model_name": "fnlp/moss-moon-003-sft",
        "local_model_path": None,
        "provides": "MOSSLLM"
    },
    "vicuna-13b-hf": {
        "name": "vicuna-13b-hf",
        "pretrained_model_name": "vicuna-13b-hf",
        "local_model_path": None,
        "provides": "LLamaLLM"
    },

    # 通过 fastchat 调用的模型请参考如下格式
    "fastchat-chatglm-6b": {
        "name": "chatglm-6b",             # "name"修改为fastchat服务中的"model_name"
        "pretrained_model_name": "chatglm-6b",
        "local_model_path": None,
        "provides": "FastChatOpenAILLM",  # 使用fastchat api时,需保证"provides"为"FastChatOpenAILLM"
        "api_base_url": "http://localhost:8000/v1"  # "name"修改为fastchat服务中的"api_base_url"
    },
    "fastchat-chatglm2-6b": {
        "name": "chatglm2-6b",              # "name"修改为fastchat服务中的"model_name"
        "pretrained_model_name": "chatglm2-6b",
        "local_model_path": None,
        "provides": "FastChatOpenAILLM",    # 使用fastchat api时,需保证"provides"为"FastChatOpenAILLM"
        "api_base_url": "http://localhost:8000/v1"  # "name"修改为fastchat服务中的"api_base_url"
    },

    # 通过 fastchat 调用的模型请参考如下格式
    "fastchat-vicuna-13b-hf": {
        "name": "vicuna-13b-hf",          # "name"修改为fastchat服务中的"model_name"
        "pretrained_model_name": "vicuna-13b-hf",
        "local_model_path": None,
        "provides": "FastChatOpenAILLM",  # 使用fastchat api时,需保证"provides"为"FastChatOpenAILLM"
        "api_base_url": "http://localhost:8000/v1"  # "name"修改为fastchat服务中的"api_base_url"
    },
}

# LLM 名称
LLM_MODEL = "chatglm-6b"
# 量化加载8bit 模型
LOAD_IN_8BIT = False
# Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU.
BF16 = False
# 本地lora存放的位置
LORA_DIR = "loras/"

# LLM lora path,默认为空,如果有请直接指定文件夹路径
LLM_LORA_PATH = ""
USE_LORA = True if LLM_LORA_PATH else False

# LLM streaming reponse
STREAMING = True

# Use p-tuning-v2 PrefixEncoder
USE_PTUNING_V2 = False

# LLM running device
LLM_DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

# 知识库默认存储路径
KB_ROOT_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "knowledge_base")

# 基于上下文的prompt模版,请务必保留"{question}"和"{context}"
PROMPT_TEMPLATE = """已知信息:
{context} 

根据上述已知信息,简洁和专业的来回答用户的问题。如果无法从中得到答案,请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”,不允许在答案中添加编造成分,答案请使用中文。 问题是:{question}"""

# 缓存知识库数量,如果是ChatGLM2,ChatGLM2-int4,ChatGLM2-int8模型若检索效果不好可以调成’10’
CACHED_VS_NUM = 1

# 文本分句长度
SENTENCE_SIZE = 100

# 匹配后单段上下文长度
CHUNK_SIZE = 250

# 传入LLM的历史记录长度
LLM_HISTORY_LEN = 3

# 知识库检索时返回的匹配内容条数
VECTOR_SEARCH_TOP_K = 5

# 知识检索内容相关度 Score, 数值范围约为0-1100,如果为0,则不生效,经测试设置为小于500时,匹配结果更精准
VECTOR_SEARCH_SCORE_THRESHOLD = 0

NLTK_DATA_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "nltk_data")

FLAG_USER_NAME = uuid.uuid4().hex

logger.info(f"""
loading model config
llm device: {LLM_DEVICE}
embedding device: {EMBEDDING_DEVICE}
dir: {os.path.dirname(os.path.dirname(__file__))}
flagging username: {FLAG_USER_NAME}
""")

# 是否开启跨域,默认为False,如果需要开启,请设置为True
# is open cross domain
OPEN_CROSS_DOMAIN = False

# Bing 搜索必备变量
# 使用 Bing 搜索需要使用 Bing Subscription Key,需要在azure port中申请试用bing search
# 具体申请方式请见
# https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/create-bing-search-service-resource
# 使用python创建bing api 搜索实例详见:
# https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/quickstarts/rest/python
BING_SEARCH_URL = "https://api.bing.microsoft.com/v7.0/search"
# 注意不是bing Webmaster Tools的api key,

# 此外,如果是在服务器上,报Failed to establish a new connection: [Errno 110] Connection timed out
# 是因为服务器加了防火墙,需要联系管理员加白名单,如果公司的服务器的话,就别想了GG
BING_SUBSCRIPTION_KEY = ""

# 是否开启中文标题加强,以及标题增强的相关配置
# 通过增加标题判断,判断哪些文本为标题,并在metadata中进行标记;
# 然后将文本与往上一级的标题进行拼合,实现文本信息的增强。
ZH_TITLE_ENHANCE = False

3.4 loader: document loading and text conversion

3.4.1 loader/pdf_loader.py

# 导入类型提示模块,用于强化代码的可读性和健壮性
from typing import List

# 导入UnstructuredFileLoader,这是一个从非结构化文件中加载文档的类
from langchain.document_loaders.unstructured import UnstructuredFileLoader

# 导入PaddleOCR,这是一个开源的OCR工具,用于从图片中识别和读取文字
from paddleocr import PaddleOCR

# 导入os模块,用于处理文件和目录
import os

# 导入fitz模块,用于处理PDF文件
import fitz

# 导入nltk模块,用于处理文本数据
import nltk

# 导入模型配置文件中的NLTK_DATA_PATH,这是nltk数据的路径
from configs.model_config import NLTK_DATA_PATH

# 设置nltk数据的路径,将模型配置中的路径添加到nltk的数据路径中
nltk.data.path = [NLTK_DATA_PATH] + nltk.data.path

# 定义一个类,UnstructuredPaddlePDFLoader,该类继承自UnstructuredFileLoader
class UnstructuredPaddlePDFLoader(UnstructuredFileLoader):

    # 定义一个内部方法_get_elements,返回一个列表
    def _get_elements(self) -> List:

        # 定义一个内部函数pdf_ocr_txt,用于从pdf中进行OCR并输出文本文件
        def pdf_ocr_txt(filepath, dir_path="tmp_files"):
            # 将dir_path与filepath的目录部分合并成一个新的路径
            full_dir_path = os.path.join(os.path.dirname(filepath), dir_path)

            # 如果full_dir_path对应的目录不存在,则创建这个目录
            if not os.path.exists(full_dir_path):
                os.makedirs(full_dir_path)
            
            # 创建一个PaddleOCR实例,设置一些参数
            ocr = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=False, show_log=False)

            # 打开pdf文件
            doc = fitz.open(filepath)

            # 创建一个txt文件的路径
            txt_file_path = os.path.join(full_dir_path, f"{os.path.split(filepath)[-1]}.txt")

            # 创建一个临时的图片文件路径
            img_name = os.path.join(full_dir_path, 'tmp.png')

            # 打开txt_file_path对应的文件,并以写模式打开
            with open(txt_file_path, 'w', encoding='utf-8') as fout:
                # 遍历pdf的所有页面
                for i in range(doc.page_count):
                    # 获取当前页面
                    page = doc[i]

                    # 获取当前页面的文本内容,并写入txt文件
                    text = page.get_text("")
                    fout.write(text)
                    fout.write("\n")

                    # 获取当前页面的所有图片
                    img_list = page.get_images()

                    # 遍历所有图片
                    for img in img_list:
                        # 将图片转换为Pixmap对象
                        pix = fitz.Pixmap(doc, img[0])

                        # 如果图片有颜色信息,则将其转换为RGB格式
                        if pix.n - pix.alpha >= 4:
                            pix = fitz.Pixmap(fitz.csRGB, pix)
                        
                        # 保存图片
                        pix.save(img_name)

                        # 对图片进行OCR识别
                        result = ocr.ocr(img_name)

                        # 从OCR结果中提取文本,并写入txt文件
                        ocr_result = [i[1][0] for line in result for i in line]
                        fout.write("\n".join(ocr_result))
            
            # 如果图片文件存在,则删除它
            if os.path.exists(img_name):
                os.remove(img_name)
            
            # 返回txt文件的路径
            return txt_file_path

        # 调用上面定义的函数,获取txt文件的路径
        txt_file_path = pdf_ocr_txt(self.file_path)

        # 导入partition_text函数,该函数用于将文本文件分块
        from unstructured.partition.text import partition_text

        # 对txt文件进行分块,并返回分块结果
        return partition_text(filename=txt_file_path, **self.unstructured_kwargs)

# 运行入口
if __name__ == "__main__":
    # 导入sys模块,用于操作Python的运行环境
    import sys

    # 将当前文件的上一级目录添加到Python的搜索路径中
    sys.path.append(os.path.dirname(os.path.dirname(__file__)))

    # 定义一个pdf文件的路径
    filepath = os.path.join(os.path.dirname(os.path.dirname(__file__)), "knowledge_base", "samples", "content", "test.pdf")

    # 创建一个UnstructuredPaddlePDFLoader的实例
    loader = UnstructuredPaddlePDFLoader(filepath, mode="elements")

    # 加载文档
    docs = loader.load()

    # 遍历并打印所有文档
    for doc in docs:
        print(doc)

// To be updated..

3.5  textsplitter : document segmentation

3.5.1 textsplitter/ali_text_splitter.py

The code of ali_text_splitter.py is as follows

# 导入CharacterTextSplitter模块,用于文本切分
from langchain.text_splitter import CharacterTextSplitter  
import re                  # 导入正则表达式模块,用于文本匹配和替换
from typing import List    # 导入List类型,用于指定返回的数据类型
 
# 定义一个新的类AliTextSplitter,继承自CharacterTextSplitter
class AliTextSplitter(CharacterTextSplitter):  
    # 类的初始化函数,如果参数pdf为True,那么使用pdf文本切分规则,否则使用默认规则
    def __init__(self, pdf: bool = False, **kwargs):  
        # 调用父类的初始化函数,接收传入的其他参数
        super().__init__(**kwargs)  
        self.pdf = pdf          # 将pdf参数保存为类的成员变量

    # 定义文本切分方法,输入参数为一个字符串,返回值为字符串列表
    def split_text(self, text: str) -> List[str]:  
        if self.pdf:            # 如果pdf参数为True,那么对文本进行预处理

            # 替换掉连续的3个及以上的换行符为一个换行符
            text = re.sub(r"\n{3,}", r"\n", text)  
            # 将所有的空白字符(包括空格、制表符、换页符等)替换为一个空格
            text = re.sub('\s', " ", text)  
            # 将连续的两个换行符替换为一个空字符
            text = re.sub("\n\n", "", text)  
        
        # 导入pipeline模块,用于创建一个处理流程
        from modelscope.pipelines import pipeline  

        # 创建一个document-segmentation任务的处理流程
        # 用的模型为damo/nlp_bert_document-segmentation_chinese-base,计算设备为cpu
        p = pipeline(
            task="document-segmentation",
            model='damo/nlp_bert_document-segmentation_chinese-base',
            device="cpu")
        result = p(documents=text)    # 对输入的文本进行处理,返回处理结果
        sent_list = [i for i in result["text"].split("\n\t") if i]  # 将处理结果按照换行符和制表符进行切分,得到句子列表
        return sent_list              # 返回句子列表

Among them, there are three points worth noting:

  • The parameter use_document_segmentation specifies whether to use semantic segmentation of documents.
    The document semantic segmentation model adopted here is open sourced by DAMO Academy: nlp_bert_document-segmentation_chinese-base   (this is his paper )
  • In addition, if you use the model for document semantic segmentation, you need to install:
    modelscope[nlp]:pip install "modelscope[nlp]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
  • And considering that three models are used, it may not be friendly to low-configuration GPUs, so here the models are loaded into the CPU for calculation. If necessary, you can replace the device with your own graphics card ID.

3.6 knowledge_base: stores files uploaded by users and quantifies them

There are two files under knowledge_bas. One is content, which is the original file uploaded by the user. Vector_store is used to store vector library files, that is, the local knowledge base ontology. Because the content varies from person to person, whoever uploads it is the same, so there is nothing to analyze. There are two files under vector_store, one index.faiss and one index.pkl

3.7 chains: vector search/matching

As mentioned before, the "FAISS" in "FAISS Index, FAISS Search" in the figure at the beginning of this section is a library launched by Facebook AI to effectively search for similarities in large-scale high-dimensional vector spaces. In large-scale data sets Quickly finding the vector most similar to a given vector is an important part of many AI applications, such as recommendation systems, natural language processing, image retrieval, etc.

3.7.1 chains/modules/vectorstores.py file: Find text vectors similar to query in the vector database based on query vector query

It is mainly about the use of FAISS (Facebook AI Similarity Search) and the definition of a FAISS vector storage class (FAISSVS, the FAISSVS class inherits from the FAISS class), including the following main methods:

  • max_marginal_relevance_search
    Given a query statement, first convert the query statement into an embedding vector " embedding = self.embedding_function(query) ", and then call  the max_marginal_relevance_search_by_vector  function for MMR search
    #  使用最大边际相关性返回被选中的文本
    def max_marginal_relevance_search(
        self,
        query: str,            # 查询
        k: int = 4,            # 返回的文档数量,默认为 4
        fetch_k: int = 20,     # 用于传递给 MMR 算法的抓取文档数量
        **kwargs: Any,
    ) -> List[Tuple[Document, float]]:    
    
        # 查询向量化
        embedding = self.embedding_function(query)
        # 调用:max_marginal_relevance_search_by_vector
        docs = self.max_marginal_relevance_search_by_vector(embedding, k, fetch_k)
        return docs
    \rightarrow  max_marginal_relevance_search_by_vector
    uses the Maximal Marginal Relevance (MMR) method to return relevant text through a given embedding vector. MMR
    is an algorithm that solves the diversity and relevance of query results. Specifically, it requires not only the returned The text is as similar as possible to the query, and you want the set of text returned to be as diverse as possible
    # 使用最大边际相关性返回被选中的文档,最大边际相关性旨在优化查询的相似性和选定文本之间的多样性
    def max_marginal_relevance_search_by_vector(
        self, embedding: List[float], k: int = 4, fetch_k: int = 20, **kwargs: Any
    ) -> List[Tuple[Document, float]]:
    
        # 使用索引在文本中搜索与嵌入向量相似的内容,返回最相似的fetch_k个文本的得分和索引
        scores, indices = self.index.search(np.array([embedding], dtype=np.float32), fetch_k)  
        
        # 通过索引从文本中重构出嵌入向量,-1表示没有足够的文本返回
        embeddings = [self.index.reconstruct(int(i)) for i in indices[0] if i != -1] 
    
        # 使用最大边际相关性算法选择出k个最相关的文本
        mmr_selected = maximal_marginal_relevance(
            np.array([embedding], dtype=np.float32), embeddings, k=k
        )  
    
        selected_indices = [indices[0][i] for i in mmr_selected]    # 获取被选中的文本的索引
        selected_scores = [scores[0][i] for i in mmr_selected]      # 获取被选中的文本的得分
        docs = []
        for i, score in zip(selected_indices, selected_scores):     # 对于每个被选中的文本索引和得分
            if i == -1:  # 如果索引为-1,表示没有足够的文本返回
                continue
    
            _id = self.index_to_docstore_id[i]       # 通过索引获取文本的id
            doc = self.docstore.search(_id)          # 通过id在文档库中搜索文本
            if not isinstance(doc, Document):        # 如果搜索到的文本不是Document类型,抛出错误
                raise ValueError(f"Could not find document for id {_id}, got {doc}")
            docs.append((doc, score))        # 将文本和得分添加到结果列表中
        return docs                          # 返回结果列表
    
  • __from
    is used to create a FAISSVS instance from a set of text and corresponding embedding vectors. The method first creates a FAISS index and adds embedding vectors, then creates a text store to store the text associated with each embedding vector
    # 从给定的文本、嵌入向量、元数据等信息构建一个FAISS索引对象
    def __from(
        cls,
        texts: List[str],                 # 文本列表,每个文本将被转化为一个文本对象
        embeddings: List[List[float]],    # 对应文本的嵌入向量列表
        embedding: Embeddings,            # 嵌入向量生成器,用于将查询语句转化为嵌入向量
        metadatas: Optional[List[dict]] = None,
        **kwargs: Any,
    ) -> FAISS:
    
        faiss = dependable_faiss_import()      # 导入FAISS库
        index = faiss.IndexFlatIP(len(embeddings[0]))      # 使用FAISS库创建一个新的索引,索引的维度等于嵌入文本向量的长度
        index.add(np.array(embeddings, dtype=np.float32))  # 将嵌入向量添加到FAISS索引中
    
        # quantizer = faiss.IndexFlatL2(len(embeddings[0]))
        # index = faiss.IndexIVFFlat(quantizer, len(embeddings[0]), 100)
        # index.train(np.array(embeddings, dtype=np.float32))
        # index.add(np.array(embeddings, dtype=np.float32))
    
        documents = []
        for i, text in enumerate(texts):      # 对于每一段文本
            # 获取对应的元数据,如果没有提供元数据则使用空字典
            metadata = metadatas[i] if metadatas else {}  
    
            # 创建一个文本对象并添加到文本列表中
            documents.append(Document(page_content=text, metadata=metadata))  
    
        # 为每个文本生成一个唯一的ID
        index_to_id = {i: str(uuid.uuid4()) for i in range(len(documents))}  
    
        # 创建一个文本库,用于存储文本对象和对应的ID
        docstore = InMemoryDocstore(
            {index_to_id[i]: doc for i, doc in enumerate(documents)}  
        )
    
        # 返回FAISS对象
        return cls(embedding.embed_query, index, docstore, index_to_id)  

The above is the main content of this code. By using FAISS and MMR, it can help us find the most relevant text for a given query among a large amount of text.

3.7.2 chains/local_doc_qa.py code file: vector search

  1. Importing packages and modules
    The beginning of the code is a series of import statements that import necessary Python packages and modules, including file loaders, text splitters, model configurations, and some Python built-in modules and other third-party libraries.
  2. Rewrite the hash method of the HuggingFaceEmbeddings class.
    The code defines a function named _embeddings_hash and assigns it to the __hash__ method of the HuggingFaceEmbeddings class. The purpose of this is to enable the HuggingFaceEmbeddings object to be hashed, that is, it can be used as a dictionary key or added to a collection.
  3. Loading vector memory
    defines a function named load_vector_store , which is used to load a vector memory from the local and returns an object of FAISS class. The lru_cache decorator is used, which can cache the recently used CACHED_VS_NUM results and improve code efficiency.
  4. The file tree traversal
    tree function is a recursive function that traverses all files in a specified directory and returns a list containing the full paths and file names of all files. It can ignore specified files or directories
  5. Loading files:
    The load_file function selects the appropriate loader and text splitter based on the file extension, loads and splits the file
  6. Generate reminder:
    The generate_prompt function is used to generate a reminder based on related documents and queries. The reminder template is provided by the prompt_template parameter
  7. Create a document list
    search_result2docs
    # 创建一个空列表,用于存储文档
    def search_result2docs(search_results):
        docs = []
    
        # 对于搜索结果中的每一项
        for result in search_results:
            # 创建一个文档对象
            # 如果结果中包含"snippet"关键字,则其值作为页面内容,否则页面内容为空字符串
            # 如果结果中包含"link"关键字,则其值作为元数据中的源链接,否则源链接为空字符串
            # 如果结果中包含"title"关键字,则其值作为元数据中的文件名,否则文件名为空字符串
            doc = Document(page_content=result["snippet"] if "snippet" in result.keys() else "",
                           metadata={"source": result["link"] if "link" in result.keys() else "",
                                     "filename": result["title"] if "title" in result.keys() else ""})
    
            # 将创建的文档对象添加到列表中
            docs.append(doc)
        
        # 返回文档列表
        return docs

After that, a class named LocalDocQA is defined, mainly used for document-based question and answer tasks. The main function of the document-based question and answer task is to return an answer based on a given set of documents (here called the knowledge base) and questions entered by the user. The main methods of the LocalDocQA class include:

  • init_cfg() : This method initializes some variables, including assigning llm_model (a language model used to generate answers) to self.llm, assigning a HuggingFace-based embedding model to self.embeddings, and assigning the input parameter top_k to self.top_k
  • init_knowledge_vector_store() : This method is responsible for initializing the knowledge vector library. It first checks the input file path, and for each file in the path, loads the file content into a Document object, then converts these documents into embedded vectors and stores them in the vector library
  • one_knowledge_add() : This method is used to add a new knowledge document to the knowledge base. It creates the input title and content as a Document object, then converts it into an embedded vector and adds it to the vector library
  • get_knowledge_based_answer() : This method generates an answer based on the given knowledge base and the user-entered question. It first finds the most relevant documents in the knowledge base based on the questions entered by the user, then generates a prompt containing relevant documents and user questions, and passes the prompt to llm_model to generate the answer. Note that this function calls what has been implemented above
    : similarity_search_with_score
  • get_knowledge_based_conent_test() : This method is for testing, it will return the most relevant documents and query prompts with the input query
        # query query content
        # vs_path knowledge base path
        # chunk_conent whether to enable contextual association
        # score_threshold search matching score threshold
        # vector_search_top_k search knowledge base Number of content items, the default search result is 5
        # chunk_sizes Match the connection context length of a single piece of content
        def get_knowledge_based_conent_test(self, query, vs_path, chunk_conent,
                                            score_threshold=VECTOR_SEARCH_SCORE_THRESHOLD,
                                            vector_search_top_k=VECTOR_SEARCH_TOP_K, chunk_size=CHUNK_SIZE):
  • get_search_result_based_answer() : This method is similar to get_knowledge_based_answer(), but here the results of bing_search are used as the knowledge base
    def get_search_result_based_answer(self, query, chat_history=[], streaming: bool = STREAMING):
        # 对查询进行 Bing 搜索,并获取搜索结果
        results = bing_search(query)
    
        # 将搜索结果转化为文本的形式
        result_docs = search_result2docs(results)
    
        # 生成用于提问的提示语
        prompt = generate_prompt(result_docs, query)
    
        # 通过 LLM(长语言模型)生成回答
        for answer_result in self.llm.generatorAnswer(prompt=prompt, history=chat_history,
                                                      streaming=streaming):
            # 获取回答的文本
            resp = answer_result.llm_output["answer"]
    
            # 获取聊天历史
            history = answer_result.history
    
            # 将聊天历史中的最后一项的提问替换为当前的查询
            history[-1][0] = query
    
            # 组装回答的结果
            response = {"query": query,
                        "result": resp,
                        "source_documents": result_docs}
    
            # 返回回答的结果和聊天历史
            yield response, history
    As you can see, the main difference between this function and the above function is that this function directly uses the search results of the search engine to generate answers, while the above function uses query similarity search to find the most relevant text, and then based on These texts generate answers
    and this bing_search has been defined in Section 3.1.2
  • Next are the three methods delete_file_from_vector_store update_file_from_vector_store list_file_from_vector_store for deleting files from vector store, updating files, and listing files respectively.


        # 删除向量存储中的文件
        def delete_file_from_vector_store(self,
                                          filepath: str or List[str],  # 文件路径,可以是单个文件或多个文件列表
                                          vs_path):      # 向量存储路径
            vector_store = load_vector_store(vs_path, self.embeddings)  # 从给定路径加载向量存储
            status = vector_store.delete_doc(filepath)   # 删除指定文件
            return status  # 返回删除状态
    
        # 更新向量存储中的文件
        def update_file_from_vector_store(self,
                                          filepath: str or List[str],  # 需要更新的文件路径,可以是单个文件或多个文件列表
                                          vs_path,  # 向量存储路径
                                          docs: List[Document],):      # 需要更新的文件内容,文件以文档形式给出
            vector_store = load_vector_store(vs_path, self.embeddings)  # 从给定路径加载向量存储
            status = vector_store.update_doc(filepath, docs)  # 更新指定文件
            return status  # 返回更新状态
    
        # 列出向量存储中的文件
        def list_file_from_vector_store(self,
                                        vs_path,  # 向量存储路径
                                        fullpath=False):  # 是否返回完整路径,如果为 False,则只返回文件名
            vector_store = load_vector_store(vs_path, self.embeddings)  # 从给定路径加载向量存储
            docs = vector_store.list_docs()      # 列出所有文件
            if fullpath:  # 如果需要完整路径
                return docs  # 返回完整路径列表
            else:  # 如果只需要文件名
                return [os.path.split(doc)[-1] for doc in docs]  # 用 os.path.split 将路径和文件名分离,只返回文件名列表

__main__Part of the code is an example of instantiation and usage of the LocalDocQA class

  1. It first initializes an llm_model_ins object
  2. Then create an instance of LocalDocQA and call its init_cfg() method to initialize
  3. After that, it specifies a query and the path to the knowledge base
  4. Then call the get_knowledge_based_answer() or get_search_result_based_answer() method to get the answer based on the query, and print out the answer and source document information

3.7.3 chains/text_load.py

There is also the last project file (langchain-ChatGLM/text_load.py at master · imClumsyPanda/langchain-ChatGLM · GitHub) under the chain folder , as shown below

import os
import pinecone 
from tqdm import tqdm
from langchain.llms import OpenAI
from langchain.text_splitter import SpacyTextSplitter
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone

#一些配置文件
openai_key="你的key" # 注册 openai.com 后获得
pinecone_key="你的key" # 注册 app.pinecone.io 后获得
pinecone_index="你的库" #app.pinecone.io 获得
pinecone_environment="你的Environment"  # 登录pinecone后,在indexes页面 查看Environment
pinecone_namespace="你的Namespace" #如果不存在自动创建

#科学上网你懂得
os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'

#初始化pinecone
pinecone.init(
    api_key=pinecone_key,
    environment=pinecone_environment
)
index = pinecone.Index(pinecone_index)

#初始化OpenAI的embeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_key)

#初始化text_splitter
text_splitter = SpacyTextSplitter(pipeline='zh_core_web_sm',chunk_size=1000,chunk_overlap=200)

# 读取目录下所有后缀是txt的文件
loader = DirectoryLoader('../docs', glob="**/*.txt", loader_cls=TextLoader)

#读取文本文件
documents = loader.load()

# 使用text_splitter对文档进行分割
split_text = text_splitter.split_documents(documents)
try:
	for document in tqdm(split_text):
		# 获取向量并储存到pinecone
		Pinecone.from_documents([document], embeddings, index_name=pinecone_index)
except Exception as e:
    print(f"Error: {e}")
    quit()

3.8 vectorstores:MyFAISS.py

Two files, one __init__.py (just one line of code: from .MyFAISS import MyFAISS) and the other MyFAISS.py, as shown in the following code

# 从langchain.vectorstores库导入FAISS
from langchain.vectorstores import FAISS
# 从langchain.vectorstores.base库导入VectorStore            
from langchain.vectorstores.base import VectorStore
# 从langchain.vectorstores.faiss库导入dependable_faiss_import
from langchain.vectorstores.faiss import dependable_faiss_import  

from typing import Any, Callable, List, Dict  # 导入类型检查库
from langchain.docstore.base import Docstore  # 从langchain.docstore.base库导入Docstore

# 从langchain.docstore.document库导入Document
from langchain.docstore.document import Document  

import numpy as np      # 导入numpy库,用于科学计算
import copy             # 导入copy库,用于数据复制
import os               # 导入os库,用于操作系统相关的操作
from configs.model_config import *  # 从configs.model_config库导入所有内容


# 定义MyFAISS类,继承自FAISS和VectorStore两个父类
class MyFAISS(FAISS, VectorStore):

Next, implement the following functions one by one

3.8.1 Define the initialization function of the class: __init__

    # 定义类的初始化函数
    def __init__(
            self,
            embedding_function: Callable,
            index: Any,
            docstore: Docstore,
            index_to_docstore_id: Dict[int, str],
            normalize_L2: bool = False,
    ):
        # 调用父类FAISS的初始化函数
        super().__init__(embedding_function=embedding_function,
                         index=index,
                         docstore=docstore,
                         index_to_docstore_id=index_to_docstore_id,
                         normalize_L2=normalize_L2)
        # 初始化分数阈值
        self.score_threshold=VECTOR_SEARCH_SCORE_THRESHOLD
        # 初始化块大小
        self.chunk_size = CHUNK_SIZE
        # 初始化块内容
        self.chunk_conent = False

3.8.2 seperate_list: Decompose a list into multiple sublists

    # 定义函数seperate_list,将一个列表分解成多个子列表,每个子列表中的元素在原列表中是连续的
    def seperate_list(self, ls: List[int]) -> List[List[int]]:
        # TODO: 增加是否属于同一文档的判断
        lists = []
        ls1 = [ls[0]]
        for i in range(1, len(ls)):
            if ls[i - 1] + 1 == ls[i]:
                ls1.append(ls[i])
            else:
                lists.append(ls1)
                ls1 = [ls[i]]
        lists.append(ls1)
        return lists

3.8.3 similarity_search_with_score_by_vector, find the closest k texts based on the input vector

The similarity_search_with_score_by_vector function is used to perform a similarity search through vectors, returning the text most similar to a given embedding vector and the corresponding score

    # 定义函数similarity_search_with_score_by_vector,根据输入的向量,查找最接近的k个文本
    def similarity_search_with_score_by_vector(
            self, embedding: List[float], k: int = 4
    ) -> List[Document]:
        # 调用dependable_faiss_import函数,导入faiss库
        faiss = dependable_faiss_import()

        # 将输入的列表转换为numpy数组,并设置数据类型为float32
        vector = np.array([embedding], dtype=np.float32)

        # 如果需要进行L2归一化,则调用faiss.normalize_L2函数进行归一化
        if self._normalize_L2:
            faiss.normalize_L2(vector)

        # 调用faiss库的search函数,查找与输入向量最接近的k个向量,并返回他们的分数和索引
        scores, indices = self.index.search(vector, k)

        # 初始化一个空列表,用于存储找到的文本
        docs = []
        # 初始化一个空集合,用于存储文本的id
        id_set = set()

        # 获取文本库中文本的数量
        store_len = len(self.index_to_docstore_id)

        # 初始化一个布尔变量,表示是否需要重新排列id列表
        rearrange_id_list = False

        # 遍历找到的索引和分数
        for j, i in enumerate(indices[0]):
            # 如果索引为-1,或者分数小于阈值,则跳过这个索引
            if i == -1 or 0 < self.score_threshold < scores[0][j]:
                # This happens when not enough docs are returned.
                continue

            # 如果索引存在于index_to_docstore_id字典中,则获取对应的文本id
            if i in self.index_to_docstore_id:
                _id = self.index_to_docstore_id[i]

            # 如果索引不存在于index_to_docstore_id字典中,则跳过这个索引
            else:
                continue
            # 从文本库中搜索对应id的文本
            doc = self.docstore.search(_id)

            # 如果不需要拆分块内容,或者文档的元数据中没有context_expand字段,或者context_expand字段的值为false,则执行以下代码
            if (not self.chunk_conent) or ("context_expand" in doc.metadata and not doc.metadata["context_expand"]):
                # 匹配出的文本如果不需要扩展上下文则执行如下代码
                # 如果搜索到的文本不是Document类型,则抛出异常
                if not isinstance(doc, Document):
                    raise ValueError(f"Could not find document for id {_id}, got {doc}")
                # 在文本的元数据中添加score字段,其值为找到的分数
                doc.metadata["score"] = int(scores[0][j])

                # 将文本添加到docs列表中
                docs.append(doc)
                continue

            # 将文本id添加到id_set集合中
            id_set.add(i)

            # 获取文本的长度
            docs_len = len(doc.page_content)

            # 遍历范围在1到i和store_len - i之间的数字k
            for k in range(1, max(i, store_len - i)):
                # 初始化一个布尔变量,表示是否需要跳出循环
                break_flag = False

                # 如果文本的元数据中有context_expand_method字段,并且其值为"forward",则扩展范围设置为[i + k]
                if "context_expand_method" in doc.metadata and doc.metadata["context_expand_method"] == "forward":
                    expand_range = [i + k]

                # 如果文本的元数据中有context_expand_method字段,并且其值为"backward",则扩展范围设置为[i - k]
                elif "context_expand_method" in doc.metadata and doc.metadata["context_expand_method"] == "backward":
                    expand_range = [i - k]

                # 如果文本的元数据中没有context_expand_method字段,或者context_expand_method字段的值不是"forward"也不是"backward",则扩展范围设置为[i + k, i - k]
                else:
                    expand_range = [i + k, i - k]

                # 遍历扩展范围
                for l in expand_range:
                    # 如果l不在id_set集合中,并且l在0到len(self.index_to_docstore_id)之间,则执行以下代码
                    if l not in id_set and 0 <= l < len(self.index_to_docstore_id):
                        # 获取l对应的文本id
                        _id0 = self.index_to_docstore_id[l]

                        # 从文本库中搜索对应id的文本
                        doc0 = self.docstore.search(_id0)

                        # 如果文本长度加上新文档的长度大于块大小,或者新文本的源不等于当前文本的源,则设置break_flag为true,跳出循环
                        if docs_len + len(doc0.page_content) > self.chunk_size or doc0.metadata["source"] != \
                                doc.metadata["source"]:
                            break_flag = True
                            break

                        # 如果新文本的源等于当前文本的源,则将新文本的长度添加到文本长度上,将l添加到id_set集合中,设置rearrange_id_list为true
                        elif doc0.metadata["source"] == doc.metadata["source"]:
                            docs_len += len(doc0.page_content)
                            id_set.add(l)
                            rearrange_id_list = True

                # 如果break_flag为true,则跳出循环
                if break_flag:
                    break

        # 如果不需要拆分块内容,或者不需要重新排列id列表,则返回docs列表
        if (not self.chunk_conent) or (not rearrange_id_list):
            return docs

        # 如果id_set集合的长度为0,并且分数阈值大于0,则返回空列表
        if len(id_set) == 0 and self.score_threshold > 0:
            return []

        # 对id_set集合中的元素进行排序,并转换为列表
        id_list = sorted(list(id_set))

        # 调用seperate_list函数,将id_list分解成多个子列表
        id_lists = self.seperate_list(id_list)

        # 遍历id_lists中的每一个id序列
        for id_seq in id_lists:
            # 遍历id序列中的每一个id
            for id in id_seq:
                # 如果id等于id序列的第一个元素,则从文档库中搜索对应id的文本,并深度拷贝这个文本
                if id == id_seq[0]:
                    _id = self.index_to_docstore_id[id]

                    # doc = self.docstore.search(_id)
                    doc = copy.deepcopy(self.docstore.search(_id))

                # 如果id不等于id序列的第一个元素,则从文本库中搜索对应id的文档,将新文本的内容添加到当前文本的内容后面
                else:
                    _id0 = self.index_to_docstore_id[id]
                    doc0 = self.docstore.search(_id0)
                    doc.page_content += " " + doc0.page_content

            # 如果搜索到的文本不是Document类型,则抛出异常
            if not isinstance(doc, Document):
                raise ValueError(f"Could not find document for id {_id}, got {doc}")
            # 计算文本的分数,分数等于id序列中的每一个id在分数列表中对应的分数的最小值
            doc_score = min([scores[0][id] for id in [indices[0].tolist().index(i) for i in id_seq if i in indices[0]]])

            # 在文本的元数据中添加score字段,其值为文档的分数
            doc.metadata["score"] = int(doc_score)

            # 将文本添加到docs列表中
            docs.append(doc)
        # 返回docs列表
        return docs

3.8.4 delete_doc method: delete the text from the specified source in the text library

    #定义了一个名为 delete_doc 的方法,这个方法用于删除文本库中指定来源的文本
    def delete_doc(self, source: str or List[str]):
        # 使用 try-except 结构捕获可能出现的异常
        try:
            # 如果 source 是字符串类型
            if isinstance(source, str):
                # 找出文本库中所有来源等于 source 的文本的id
                ids = [k for k, v in self.docstore._dict.items() if v.metadata["source"] == source]

                # 获取向量存储的路径
                vs_path = os.path.join(os.path.split(os.path.split(source)[0])[0], "vector_store")

            # 如果 source 是列表类型
            else:
                # 找出文本库中所有来源在 source 列表中的文本的id
                ids = [k for k, v in self.docstore._dict.items() if v.metadata["source"] in source]

                # 获取向量存储的路径
                vs_path = os.path.join(os.path.split(os.path.split(source[0])[0])[0], "vector_store")

            # 如果没有找到要删除的文本,返回失败信息
            if len(ids) == 0:
                return f"docs delete fail"

            # 如果找到了要删除的文本
            else:
                # 遍历所有要删除的文本id
                for id in ids:
                    # 获取该id在索引中的位置
                    index = list(self.index_to_docstore_id.keys())[list(self.index_to_docstore_id.values()).index(id)]

                    # 从索引中删除该id
                    self.index_to_docstore_id.pop(index)

                    # 从文本库中删除该id对应的文本
                    self.docstore._dict.pop(id)

                # TODO: 从 self.index 中删除对应id,这是一个未完成的任务
                # self.index.reset()
                # 保存当前状态到本地
                self.save_local(vs_path)

                # 返回删除成功的信息
                return f"docs delete success"

        # 捕获异常
        except Exception as e:
            # 打印异常信息
            print(e)
            # 返回删除失败的信息
            return f"docs delete fail"

3.8.5 update_doc和lists_doc

   # 定义了一个名为 update_doc 的方法,这个方法用于更新文档库中的文档
    def update_doc(self, source, new_docs):
        # 使用 try-except 结构捕获可能出现的异常
        try:
            # 删除旧的文档
            delete_len = self.delete_doc(source)

            # 添加新的文档
            ls = self.add_documents(new_docs)

            # 返回更新成功的信息
            return f"docs update success"
        # 捕获异常
        except Exception as e:
            # 打印异常信息
            print(e)

            # 返回更新失败的信息
            return f"docs update fail"

    # 定义了一个名为 list_docs 的方法,这个方法用于列出文档库中所有文档的来源
    def list_docs(self):
        # 遍历文档库中的所有文档,取出每个文档的来源,转换为集合,再转换为列表,最后返回这个列表
        return list(set(v.metadata["source"] for v in self.docstore._dict.values()))

See you in more classes: Practical combat between LLM and langchain/knowledge graph/database in July [Problem solving, practicality is king]


Part 4: Source code analysis of the upgraded version of Langchain-Chachat in September 23

In September 23, the original project LangChain + ChatGLM-6B was upgraded and became the current Langchain-Chachat project .

Its main update is the addition of a sever folder, which includes

  • chat, including
    __init__.py
    chat.py
    knowledge_base_chat.py
    openai_chat.py
    search_engine_chat.py
    utils.py
  • db, including
    models
        __init__.py
        base.py
        knowledge_base_model.py (that is, the implementation of KnowledgeBaseModel, explained in Section 4.2.1 below)
       
    knowledge_file_model .py (that is, the implementation of KnowledgeFile, explained in Section 4.2.2 below)

    repository
        __init__.py
        knowledge_base_repository.py ( Involves the implementation of add_kb_to_db, explained in Section 4.3.1 below)
        knowledge_file_repository.py (Involves the implementation of add_flie_to_db/add_docs_to_db, explained in Section 4.3.2 below)
  • knowledge_base, including
    kb_cache
        base.py (involving the implementation of KBServiceFactory, explained in Section 4.1.2 below)
        faiss_cache.py


    kb_service
        __init__.py
       
    base .py
        default_kb_service.py
        faiss_kb_service.py
        milvus_kb_service.py
        pg_kb_service.py


    __init__.py
    kb_api.py
    kb_doc_api. py (this implements the core logic of multi-document Q&A, as explained in Section 4.1.1 below)
    migrate.py
    utils.py
  • model_workers
  • static

Equally divided folders

4.1 server/knowledge_base: Enterprise knowledge base Q&A based on batch documents

The latest version of the project implements batch document-based question and answer, such as

# 开始遍历自定义的文档集合(docs)
for file_name, v in docs.items():
    try:
        # 对于v中的每个条目,检查它是否已经是Document类型
        # 如果不是,那么将其转换为Document对象
        v = [x if isinstance(x, Document) else Document(**x) for x in v]
        
        # 根据文件名和知识库名称创建KnowledgeFile对象
        kb_file = KnowledgeFile(filename=file_name, knowledge_base_name=knowledge_base_name)
        
        # 在知识库中更新该文件的文档
        kb.update_doc(kb_file, docs=v, not_refresh_vs_cache=True)
        
        # ...

4.1.1 knowledge_base /kb_doc_api.py

The following is a line-by-line analysis of the project file: Langchain-Chachat/server/knowledge_base /kb_doc_api.py

  1. Import module :

    • Some standard libraries are imported, such as osand urllib, for operating system operations and URL parsing respectively.
    • Import fastapithe relevant modules, which is a modern, high-speed web framework for building APIs
    • Import custom modules and configurations, such as model configurations and tool functions of the knowledge base
  2. DocumentWithScore class :

    • This is a Documentsubclass that inherits from class and adds a scorefield. This is most likely a data structure that returns the search relevance score when searching for a document
  3. search_docs function :

    • This function is used to search documents in the knowledge base.
    • It accepts query string, knowledge base name, maximum number of returned documents, and scoring threshold as parameters.
    • Use KBServiceFactoryto obtain the corresponding knowledge base service, and then search for documents in the knowledge base.
    • Returns a list of searched documents that are relevant to the query, each with a relevance score
      # 定义一个用于搜索文档的函数
      def search_docs(query: str = Body(..., description="用户输入", examples=["你好"]),
                      knowledge_base_name: str = Body(..., description="知识库名称", examples=["samples"]),
                      top_k: int = Body(VECTOR_SEARCH_TOP_K, description="匹配向量数"),
                      score_threshold: float = Body(SCORE_THRESHOLD, description="知识库匹配相关度阈值,取值范围在0-1之间,SCORE越小,相关度越高,取到1相当于不筛选,建议设置在0.5左右", ge=0, le=1),
                      ) -> List[DocumentWithScore]:
      
          # 根据知识库名称获取相应的知识库服务实例
          kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
          
          # 如果没有找到对应的知识库服务实例,返回空列表
          if kb is None:
              return []
          
          # 调用知识库服务的search_docs方法来搜索与查询字符串匹配的文档
          docs = kb.search_docs(query, top_k, score_threshold)
          
          # 将搜索到的文档转换为DocumentWithScore对象,包括文档内容和匹配分数
          data = [DocumentWithScore(**x[0].dict(), score=x[1]) for x in docs]
      
          # 返回带分数的匹配文档列表
          return data

      In summary, this search_docs function is to search the given knowledge base for documents matching the query string and return the top_k most relevant documents and their matching scores

  4. list_files function :

    • Used to list all files in the specified knowledge base.
    • First verify the knowledge base name, and then use it KBServiceFactoryto obtain the corresponding knowledge base service.
    • If the knowledge base exists, it will return all document names in the knowledge base.
  5. _save_files_in_thread function :

    • This is a private function used to save uploaded files to the specified knowledge base in a background thread.
    • It defines a save_filefunction internally for saving a single file.
    • If the file already exists and the user has not chosen to overwrite it, it will check the file size and return a message stating that the file already exists.
    • Otherwise, it writes the file contents.
    • At the end of this function, use run_in_thread_poolmethods to perform file save operations in a multi-threaded environment and return the save results for each file
  6. upload_docsfunction

    • Purpose: Upload documents and perform selective vectorization.
    • enter:
      • A series of files, knowledge base names, file processing parameters, etc.
    • The main function:
      • Verify the validity of the knowledge base name.
      • Save the uploaded file to disk.
      • Vectorize the saved file.
    • Output: A basic response indicating whether the operation was successful and a list of failed files.

      An upload_docs function is also defined below. The difference is that upload_docs here is defined in the class KnowledgeBase and
      calls self.kb_service.upload_docs(docs) inside the function, which means that it To a certain extent, it is a proxy method that hands the upload task to kb_service to complete,
      which is equivalent to providing a simplified upload interface in the context of the KnowledgeBase class. It's probably for using kb_service in some context or configuration to upload documents

      and the second upload_docs function below is probably the function that actually handles the document uploading and vectorization tasks. In actual application scenarios, the method of the KnowledgeBase class may call the second upload_docs function below to perform the actual upload operation.
  7. delete_docsfunction

    • Purpose: Delete the specified file from the knowledge base.
    • enter:
      • Knowledge base name, list of file names to be deleted, and other processing parameters.
    • The main function:
      • Verify the validity of the knowledge base name.
      • Delete the specified file from the knowledge base.
    • Output: A base response indicating whether the operation was successful and a list of failed files.
  8. update_docsfunction

    • Purpose: Update documents in the knowledge base.
    • enter:
      • Knowledge base name, list of file names to be updated, and other processing parameters.
    • The main function:
      • Generate documents from files and vectorize
      • Update the specified file in the knowledge base, that is, add the file to the knowledge base file list
    • Output: A base response indicating whether the operation was successful and a list of failed files
      def update_docs(
          knowledge_base_name: str = Body(..., description="知识库名称", examples=["samples"]),
          file_names: List[str] = Body(..., description="文件名称,支持多文件", examples=["file_name"]),
          chunk_size: int = Body(CHUNK_SIZE, description="知识库中单段文本最大长度"),
          chunk_overlap: int = Body(OVERLAP_SIZE, description="知识库中相邻文本重合长度"),
          zh_title_enhance: bool = Body(ZH_TITLE_ENHANCE, description="是否开启中文标题加强"),
          override_custom_docs: bool = Body(False, description="是否覆盖之前自定义的docs"),
          docs: Json = Body({}, description="自定义的docs", examples=[{"test.txt": [Document(page_content="custom doc")]}]),
          not_refresh_vs_cache: bool = Body(False, description="暂不保存向量库(用于FAISS)")
      ) -> BaseResponse:
          '''
          更新知识库文档
          '''
      
          # 验证知识库名称
          if not validate_kb_name(knowledge_base_name):
              return BaseResponse(code=403, msg="Don't attack me")
      
          # 获取知识库服务
          kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
          if kb is None:
              return BaseResponse(code=404, msg=f"未找到知识库 {knowledge_base_name}")
      
          failed_files = {}
          kb_files = []
      
          # 生成需要加载docs的文件列表
          for file_name in file_names:
              # 获取文件详情
              file_detail = get_file_detail(kb_name=knowledge_base_name, filename=file_name)
              # 如果该文件之前使用了自定义docs,则根据参数决定略过或覆盖
              if file_detail.get("custom_docs") and not override_custom_docs:
                  continue
              if file_name not in docs:
                  try:
                      # 将文件名和知识库名组合为一个KnowledgeFile对象,并添加到kb_files列表中
                      kb_files.append(KnowledgeFile(filename=file_name, knowledge_base_name=knowledge_base_name))
                  except Exception as e:
                      # 记录失败的文件和错误信息
                      msg = f"加载文档 {file_name} 时出错:{e}"
                      logger.error(f'{e.__class__.__name__}: {msg}',
                                   exc_info=e if log_verbose else None)
                      failed_files[file_name] = msg
      
          # 从文件生成docs,并进行向量化
          for status, result in files2docs_in_thread(kb_files,
                                                     chunk_size=chunk_size,
                                                     chunk_overlap=chunk_overlap,
                                                     zh_title_enhance=zh_title_enhance):
              if status:
                  # 成功处理文件后,更新知识库中的文档
                  kb_name, file_name, new_docs = result
                  kb_file = KnowledgeFile(filename=file_name, knowledge_base_name=knowledge_base_name)
                  kb_file.splited_docs = new_docs
                  kb.update_doc(kb_file, not_refresh_vs_cache=True)
              else:
                  # 记录失败的文件和错误信息
                  kb_name, file_name, error = result
                  failed_files[file_name] = error
      
          # 将自定义的docs进行向量化
          for file_name, v in docs.items():
              try:
                  # 对于v中的每个条目,检查它是否已经是Document类型
                  # 如果不是,那么将其转换为Document对象
                  v = [x if isinstance(x, Document) else Document(**x) for x in v]
      
                  # 根据文件名和知识库名称创建KnowledgeFile对象
                  kb_file = KnowledgeFile(filename=file_name, knowledge_base_name=knowledge_base_name)
      
                  # 在知识库中更新该文件的文档
                  kb.update_doc(kb_file, docs=v, not_refresh_vs_cache=True)
      
              except Exception as e:
                  # 当遇到异常时,构建一个错误消息并使用logger进行记录
                  msg = f"为 {file_name} 添加自定义docs时出错:{e}"
                  logger.error(f'{e.__class__.__name__}: {msg}',
                               exc_info=e if log_verbose else None)
                  failed_files[file_name] = msg
      
          # 如果需要刷新向量库,则进行保存
          if not not_refresh_vs_cache:
              kb.save_vector_store()
      
          # 返回响应,包括失败文件列表
          return BaseResponse(code=200, msg=f"更新文档完成", data={"failed_files": failed_files})
      The KBServiceFactory and KnowledgeFile will be introduced in the following sections respectively: "4.1.2 Implementation of KBServiceFactory: knowledge_base/kb_service/base.py" and "4.2.2 Implementation of KnowledgeFile"
  9. download_docfunction

    • Purpose: Download the specified document from the knowledge base.
    • enter:
      • Knowledge base name, file name to be downloaded and other parameters.
    • The main function:
      • Provides file preview or download functions.
    • Output: A file response that allows the user to download or preview the file.
  10. recreate_vector_storefunction

    • Purpose: Reconstruct vector storage from content.
    • enter:
      • Knowledge base name, vector storage type, and other processing parameters.
    • The main function:
      • Clear existing vector storage.
      • List files from a folder.
      • Documentation is generated for each file and added to the knowledge base.
    • Output: A streaming response for real-time updates on the progress of the operation

Overall, this code mainly provides CRUD operations (create, read, update, delete) and related vectorization processing for knowledge base documents.

4.1.2 Implementation of KBServiceFactory: knowledge_base/kb_service/base.py

  1. Import module :

    • Basic Python libraries such as os, operator.
    • Libraries for data manipulation and vectorization such as numpy, sklearn.
    • Modules within the project such as langchain.embeddings.base, langchain.docstore.document, etc.
  2. SupportedVSType class

    • This is a simple class that defines the supported vector storage types. For example: FAISS, MILVUS, etc.
  3. KBService class :

    • This is an abstract base class that defines the basic functionality and behavior of the knowledge base service.
    • Initialization function: Given a knowledge base name and embedded model name, initialize it.
    • Provides a series of methods, such as
      create_kb (create knowledge base)
          # 创建知识库方法
          def create_kb(self):
              # 检查doc_path路径是否存在
              if not os.path.exists(self.doc_path):
                  # 如果不存在,创建该目录
                  os.makedirs(self.doc_path)
              # 调用子类中定义的do_create_kb方法来执行具体的知识库创建过程
              self.do_create_kb()
              # 将新的知识库添加到数据库中,并返回操作状态
              status = add_kb_to_db(self.kb_name, self.vs_type(), self.embed_model)
              # 返回操作状态
              return status
      The add_kb_to_db will be analyzed in "4.3.1 knowledge_base_repository.py: Implementing add_kb_to_db" below.

      add_doc (adding files to the knowledge base)
          # 向知识库添加文件方法
          def add_doc(self, kb_file: KnowledgeFile, docs: List[Document] = [], **kwargs):
              # 判断docs列表是否有内容
              if docs:
                  # 设置一个标志,表示这是自定义文档列表
                  custom_docs = True
      
                  # 遍历传入的文档列表
                  for doc in docs:
                      # 为每个文档的metadata设置默认的"source"属性,值为文件的路径
                      doc.metadata.setdefault("source", kb_file.filepath)
              else:
                  # 如果没有提供docs,从kb_file中读取文档内容
                  docs = kb_file.file2text()
                  # 设置一个标志,表示这不是自定义文档列表
                  custom_docs = False
      
              # 如果docs列表有内容
              if docs:
                  # 删除与kb_file相关联的现有文档
                  self.delete_doc(kb_file)
      
                  # 调用子类中定义的do_add_doc方法来执行具体的文档添加过程,并返回文档信息
                  doc_infos = self.do_add_doc(docs, **kwargs)
      
                  # 将新的文档信息添加到数据库中,并返回操作状态
                  status = add_file_to_db(kb_file,
                                          custom_docs=custom_docs, 
                                          docs_count=len(docs),
                                          doc_infos=doc_infos)
              else:
                  # 如果docs列表为空,则设置操作状态为False
                  status = False
              # 返回操作状态
              return status
      Among them, add_file_to_db will be analyzed in "4.3.2 knowledge_file_repository.py: Implementing add_file_to_db/add_docs_to_db" below.
    • There are also some abstract methods, such as do_create_kb, do_search, etc. Subclasses must implement these methods
  4. KBServiceFactory class :

    • This is a factory class that returns the corresponding knowledge base service instance based on the provided vector storage type.
    • Use static methods to obtain different knowledge base service instances
      # 知识库服务工厂类
      class KBServiceFactory:
      
          # 根据向量存储类型返回相应的知识库服务实例
          @staticmethod
          def get_instance(vs_type: str, knowledge_base_name: str) -> KBService:
              if vs_type == SupportedVSType.FAISS:
                  from server.knowledge_base.kb_faiss import KBServiceFaiss
                  return KBServiceFaiss(knowledge_base_name)
              elif vs_type == SupportedVSType.MILVUS:
                  from server.knowledge_base.kb_milvus import KBServiceMilvus
                  return KBServiceMilvus(knowledge_base_name)
              else:
                  raise ValueError(f"Unsupported VS type: {vs_type}")
  5. get_kb_details function :

    • Get details of all knowledge bases in catalogs and databases and merge them
  6. get_kb_file_details function :

    • Gets details for all files in directories and databases for the specified knowledge base and merges the information.
  7. EmbeddingsFunAdapter class

    • This is an adapter class used to add extra functionality on the Embeddings class.
    • The embed_documents and embed_query methods embed the input text and normalize the results.
    • aembed_documents and aembed_query are their asynchronous versions.
  8. score_threshold_process function :

    • Here is a simple function that filters documents if their score is below a certain threshold and returns the top k documents

4.2 Update of server/db/models folder

4.2.1 Implementation of KnowledgeBaseModel

Implemented in server/db/models/knowledge_base_model.py

from sqlalchemy import Column, Integer, String, DateTime, func

from server.db.base import Base


class KnowledgeBaseModel(Base):
    """
    知识库模型
    """
    __tablename__ = 'knowledge_base'
    id = Column(Integer, primary_key=True, autoincrement=True, comment='知识库ID')
    kb_name = Column(String(50), comment='知识库名称')
    vs_type = Column(String(50), comment='向量库类型')
    embed_model = Column(String(50), comment='嵌入模型名称')
    file_count = Column(Integer, default=0, comment='文件数量')
    create_time = Column(DateTime, default=func.now(), comment='创建时间')

    def __repr__(self):
        return f"<KnowledgeBase(id='{self.id}', kb_name='{self.kb_name}', vs_type='{self.vs_type}', embed_model='{self.embed_model}', file_count='{self.file_count}', create_time='{self.create_time}')>"

4.2.2 Implementation of KnowledgeFile

After careful search, it was found that KnowledgeFile is implemented in the server/db/models/knowledge_file_model.py project file

# 导入sqlalchemy所需的模块和函数
from sqlalchemy import Column, Integer, String, DateTime, Float, Boolean, JSON, func

# 从server.db.base导入Base类,这通常用于ORM的基础模型
from server.db.base import Base

# 定义KnowledgeFileModel类,用于映射“知识文件”数据模型
class KnowledgeFileModel(Base):
    """
    知识文件模型
    """
    __tablename__ = 'knowledge_file'
    id = Column(Integer, primary_key=True, autoincrement=True, comment='知识文件ID')

    file_name = Column(String(255), comment='文件名')
    file_ext = Column(String(10), comment='文件扩展名')
    kb_name = Column(String(50), comment='所属知识库名称')
    document_loader_name = Column(String(50), comment='文档加载器名称')

    text_splitter_name = Column(String(50), comment='文本分割器名称')
    file_version = Column(Integer, default=1, comment='文件版本')
    file_mtime = Column(Float, default=0.0, comment="文件修改时间")
    file_size = Column(Integer, default=0, comment="文件大小")
    custom_docs = Column(Boolean, default=False, comment="是否自定义docs")
    docs_count = Column(Integer, default=0, comment="切分文档数量")
    create_time = Column(DateTime, default=func.now(), comment='创建时间')

    # 定义对象的字符串表示形式
    def __repr__(self):
        return f"<KnowledgeFile(id='{self.id}', file_name='{self.file_name}',
file_ext='{self.file_ext}', kb_name='{self.kb_name}', 
document_loader_name='{self.document_loader_name}', 
text_splitter_name='{self.text_splitter_name}', 
file_version='{self.file_version}', create_time='{self.create_time}')>"

# 定义FileDocModel类,用于映射“文件-向量库文档”数据模型
class FileDocModel(Base):
    """
    文件-向量库文档模型
    """
    # 定义表名为'file_doc'
    __tablename__ = 'file_doc'

    # 定义id字段为主键,并设置自动递增,并且附加注释
    id = Column(Integer, primary_key=True, autoincrement=True, comment='ID')
    # 定义知识库名称字段,并附加注释
    kb_name = Column(String(50), comment='知识库名称')

    # 定义文件名称字段,并附加注释
    file_name = Column(String(255), comment='文件名称')

    # 定义向量库文档ID字段,并附加注释
    doc_id = Column(String(50), comment="向量库文档ID")
    # 定义元数据字段,默认为一个空字典
    meta_data = Column(JSON, default={})

    # 定义对象的字符串表示形式
    def __repr__(self):
        return f"<FileDoc(id='{self.id}', kb_name='{self.kb_name}', file_name='{self.file_name}', doc_id='{self.doc_id}', metadata='{self.metadata}')>"

4.3 server/db/repository

4.3.1 knowledge_base_repository.py: implement add_kb_to_db

def add_kb_to_db(session, kb_name, vs_type, embed_model):
    # 查询指定名称的知识库是否存在于数据库中
    kb = session.query(KnowledgeBaseModel).filter_by(kb_name=kb_name).first()
    
    # 如果指定的知识库不存在,则创建一个新的知识库实例
    if not kb:
        kb = KnowledgeBaseModel(kb_name=kb_name, vs_type=vs_type, embed_model=embed_model)
        # 将新的知识库实例添加到session,这样可以在之后提交到数据库
        session.add(kb)
    else:  # 如果知识库已经存在,则更新它的vs_type和embed_model
        kb.vs_type = vs_type
        kb.embed_model = embed_model
    
    # 返回True,表示操作成功完成
    return True

As for the KnowledgeBaseModel method, it has been analyzed in "4.2.1 Implementation of KnowledgeBaseModel" above.

4.3.2 knowledge_file_repository.py:实现add_file_to_db/add_docs_to_db

  • add_file_to_db, will add files to the database, and finally call add_docs_to_db
  • add_docs_to_db, add documents to the database
# 定义向数据库添加文件的函数
def add_file_to_db(session,  # 数据库会话对象
                kb_file: KnowledgeFile,       # 知识文件对象
                docs_count: int = 0,           # 文档数量,默认为0
                custom_docs: bool = False,     # 是否为自定义文档,默认为False
                doc_infos: List[str] = [],     # 文档信息列表,默认为空。形式为:[{"id": str, "metadata": dict}, ...]
                ):
    # 从数据库中查询与知识库名相匹配的知识库记录
    kb = session.query(KnowledgeBaseModel).filter_by(kb_name=kb_file.kb_name).first()

    # 如果该知识库存在
    if kb:
        # 查询与文件名和知识库名相匹配的文件记录
        existing_file: KnowledgeFileModel = (session.query(KnowledgeFileModel)
                                             .filter_by(file_name=kb_file.filename,
                                                        kb_name=kb_file.kb_name)
                                            .first())
        # 获取文件的修改时间
        mtime = kb_file.get_mtime()
        # 获取文件的大小
        size = kb_file.get_size()

        # 如果该文件已存在
        if existing_file:
            # 更新文件的修改时间
            existing_file.file_mtime = mtime
            # 更新文件的大小
            existing_file.file_size = size
            # 更新文档数量
            existing_file.docs_count = docs_count
            # 更新自定义文档标志
            existing_file.custom_docs = custom_docs
            # 文件版本号自增
            existing_file.file_version += 1
        # 如果文件不存在
        else:
            # 创建一个新的文件记录对象
            new_file = KnowledgeFileModel(
                file_name=kb_file.filename,
                file_ext=kb_file.ext,
                kb_name=kb_file.kb_name,
                document_loader_name=kb_file.document_loader_name,
                text_splitter_name=kb_file.text_splitter_name or "SpacyTextSplitter",
                file_mtime=mtime,
                file_size=size,
                docs_count = docs_count,
                custom_docs=custom_docs,
            )
            # 知识库的文件计数增加
            kb.file_count += 1
            # 将新文件添加到数据库会话中
            session.add(new_file)

        # 添加文档到数据库
        add_docs_to_db(kb_name=kb_file.kb_name, file_name=kb_file.filename, doc_infos=doc_infos)

    # 返回True表示操作成功
    return True

By looking at the second to last line of code above, we can see that add_file_to_db finally calls add_docs_to_db to add documents to the database.

def add_docs_to_db(session,
                   kb_name: str,
                   file_name: str,
                   doc_infos: List[Dict]):
    '''
    将某知识库某文件对应的所有Document信息添加到数据库
    doc_infos形式:[{"id": str, "metadata": dict}, ...]
    '''
    for d in doc_infos:
        obj = FileDocModel(
            kb_name=kb_name,
            file_name=file_name,
            doc_id=d["id"],
            meta_data=d["metadata"],
        )
        session.add(obj)
    return True

// To be updated..


References and Recommended Reading

  1. langchain official website: LangChain , https://python.langchain.com/ , API list: https://api.python.langchain.com/en/latest/api_reference.html
    langchain Chinese website (the translation is not good yet)
  2. LangChain Panorama
  3. Understand langchain in one article (ignore this title, because reading this article alone is not enough)
  4. How to Build a Smart Chatbot in 10 mins with LangChain
  5. Several tutorials about FAISS: Faiss introduction and application experience records ,
  6. QLoRA: 4-bit level quantification + LoRA method, using 3090 to create a personal knowledge base based on 33B LLM on DB-GPT
  7. Build enhanced QA based on LangChain+LLM , use LangChain to build large language model applications , what is LangChain
  8. Practical combat between LLM and langchain/knowledge graph/database in July [Problem solving, practicality is king]

postscript

This article has gone through three stages

  1. There are many components of langchain
    . If you want to understand it thoroughly, you need to go through it step by step.
    When I first started to look at this library, I was really confused and couldn't start. After 10 days, I can directly click on the files one by one. Look...
    in short, everything is a process
  2. Interpretation of the langchain-ChatGLM project source code
    To be honest, it was quite confusing at first because there were so many various project files. Fortunately, it took a week to sort it out.
  3. Start writing a new one: Part 4: Source code analysis of the upgraded version of Langchain-Chachat in September 23

Create, modify, and optimize records

  1. From July 5th to July 9th, write one part every day
  2. 7.10, improve the first part of the introduction about what langchain is
  3. 7.11, according to the latest update of the langchain-ChatGLM project, organize the written content
  4. 7.12 After writing the first 3.8 sections, and adjusting the order of interpretation of each folder according to the project process,
    which is equivalent to nearly a week, I finally sorted out the "overall code structure of langchain-ChatGLM".
  5. 7.15, supplements the content related to langchain architecture, and for the convenience of understanding, the entire langchain library is divided into three major layers: basic layer, capability layer, and application layer
  6. 7.17, start writing the fourth part, focusing on Section 4.2: Using knowledge graph to enhance LLM’s pre-training, reasoning, and interpretability
  7. 7.26, continue the fourth part and start updating the fifth part: the combination of LLM and database
  8. On 9.15, the original fourth and fifth parts were extracted into a new article: Popular Introduction to Knowledge Graph: From what is KG to the practical combination of LLM and KG/DB
  9. 9.16, start writing a new one: Part 4: Source code analysis of the upgraded version of Langchain-Chachat in September 23

Guess you like

Origin blog.csdn.net/v_JULY_v/article/details/131552592#comments_28802008