[Tencent Cloud Lab] Build a crawler + vector database + LLM large model from scratch to build an enterprise privatized knowledge base

1. Foreword

This article mainly demonstrates how to build a crawler from scratch->Vector database->LLM< /span> combined with the process, provides you with ideas and directions for the construction of large model solutions for popular enterprises at this stage. DatabaseThe process of large model knowledge base, the article does not rely on any crawler, LangChain, Chat GLM and other frameworks, and interprets large models and vectors from the most original perspective in an easy-to-understand and intuitive way

    The currently popular Chinese open source large models are none other than ChatGLM (Chipu), baichuan (Baichuan), etc. Although its cognitive capabilities cannot keep up with ChatGPT 3.5, its open source has attracted a large number of AI researchers.

At present, the biggest problems with large language models are:

1. The research cost is high. If you build a 13B or above model, more than 24GB of video memory will be needed to run it in full. If the quantitative quality does not meet the requirements, a lot of cost will be invested in the early research, and if there are multiple LLM projects being developed in parallel, it will Encountered resource competition issues among project teams;

2. The training cost is high and the rate of return is random. As all the "immortals" who practice "alchemy" know, the biggest problem in alchemy is that the compiled training set, number of training rounds and various parameters may lead to the refining of useless elixirs, and the loss of knowledge It is updated day by day. If you want to update your knowledge, you need to retrain;

3. Talking nonsense (fantasy). The fantasy is that you ask a question and it gives you a clear answer. If you are not a professional, you may be misled by its answer. The fantasy of LLM is not a real fantasy, but is determined by its training method and data source. LLM is trained through a large amount of Internet text data, which contains various topics and contexts.

The above are the common problems of current LLM models, which are headaches for model developers and users. For enterprise-level AI applications, a solution that everyone is currently exploring is the combination of vector database + LLM large model to solve the problems of research cost, training and random speaking, and make up for the lack of data through accurate content in the knowledge base.

The principle is to store the key points of knowledge in a vector database, decompose the question content through word segmentation or a large model when asking questions, extract the key content from the vector database, and then feed the content to the LLM model to get the desired answer. The maintainability of the AI ​​database is achieved. This model can be used for OpenAI API and LLM privatized models.

Next, we will study the application of vector database + LLM large model from the perspective of exploration (LangChain things are not used here because they are deeply encapsulated and are not suitable for theoretical research and exploration)

2. Realistic name

The goal this time is to build a Tencent Cloud vector database LLM knowledge question and answer system:

1. Build a vector database (Tencent Cloud VectorDB is used here);

2. Develop knowledge base collection and storage tools

(1) Write crawler tools to crawl knowledge base data;

(2) WritingData StorageService

3. Develop the LLM large model dialogue function and add the vector database knowledge base to the large model dialogue;

Environment preparation:

Python:3.10

LLM:ChatGLM 3

Operating environment: Windows11 WSL2 Ubuntu22.04

Development tools: VsCode

3. Develop crawler and knowledge base storage tools

3.1. Environment setup

Create an independent python virtual environment and store the content in venv

python  -m venv venv

Activate venv execution:

vectorenv\Scripts\activate.bat

3.2. Reptile tool opening

Determine the URL address to crawl:

https://cloud.tencent.com/document/product/1709

Write Crawling.py crawler to crawl vector knowledge base content

Introduce dependency packages:

import requests
import json
import re
from bs4 import BeautifulSoup

Quote dependencies:

pip install bs4
pip install lxml  

Define relevant variables:

seed = "https://cloud.tencent.com/document/product/1709"
baseUrl="https://cloud.tencent.com"
appendUrlList=[]
appendDataList = []

Get all the URL addresses of the sub-columns in the column. Here, the J-qcSideNavListData CSS of the textarea is used for positioning, and the JSON description information is obtained from the text.

def getCrawl(seed):
    seeds = []
    seeds.append(seed)
    textdata = requests.get(seed).text
    soup = BeautifulSoup(textdata,'lxml')
    nodes = soup.select("textarea.J-qcSideNavListData")
    jsonObj=json.loads(nodes[0].getText())["list"]
    seeds.append(nodes)
    getChild(jsonObj) 

def getChild(nowObj):
     if nowObj is not None:
        for n in nowObj:
            links= baseUrl+n["link"]
            data={"title":n["title"],"link":links}
            appendUrlList.append(data)
            if n.get("children") is not None:
                getChild(n.get("children"))

Traverse the crawled address information to obtain the text content of the specified page. The html tags in the text need to be removed, otherwise there will be a lot of interfering content:

def crawlData():
    getCrawl(seed)
    for data in appendUrlList:
        url = data["link"]
        print("正在爬取:"+data["title"]+"        "+data["link"])
        textdata = requests.get(url).text
        soup = BeautifulSoup(textdata,'lxml')
        nodes = soup.select("div.J-markdown-box")
        if nodes is not None and len(nodes)>0:
            text = nodes[0].get_text()
            text = text[:20000] #这里需要截取长度,不然会出现过长溢出
            stringText = re.sub('\n+', '\n', text)
            data={"url":url,"title":data["title"],"text":stringText}
            appendDataList.append(data)
    return appendDataList

At this point, the dynamic acquisition part of the knowledge base is completed, it is relatively simple!

3.3. Development of vector knowledge base storage function

3.3.1 Create Tencent Cloud vector database

Tencent Cloud Vector Database is currently available for free with low configuration. You only need to search for: Vector Database in the console.

Select a region closest to your location and click New to create one

Create a free beta vector database

Enter the instance and enable external network access:

    Set the IP address that is allowed to be accessed. If it is just for testing, write 0.0.0.0/0, so that all IPs can be accessed, and it will save multiple IP networks from researching which external IP should be included in the whitelist.

    Get the external IP:

Get the key:

Once you get this information, you can write the information into the code.

In addition, if you want to query the entered data or create libraries and collections, you can also click DMC to log in to the management terminal to view:

Login to DMC

Query data

3.3.2 Develop vector storage

Things to note before starting:


1. [Important things] Do not create an index for the text field corresponding to the vector, which will waste a large amount of memory and have no effect.

2. [Required index]: The two fields of primary key id and vector field vector are currently fixed and required. Please refer to the example below;

3. [Other indexes]: Fields that need to be used as conditional queries during retrieval. For example, if you want to filter by the author of a book, then the author field needs to be indexed.

Otherwise, the author field cannot be filtered during query, and fields that do not need to be filtered do not need to be indexed, which will waste memory;

4. The vector database supports dynamic Schema. When writing data, any field can be written without defining it in advance, similar to MongoDB.

5. In the example, an index of book fragments is created. For example, the information of book fragments includes {id, vector, segment, bookName, page},

The id needs to be globally unique as the primary key, the segment is the text fragment, and the vector is the vector of the segment. The vector field needs to establish a vector index. If we want to query the specified book when querying

For the content of the name, bookName needs to be indexed at this time. There is no need for conditional query on other fields and no indexing is required.

6. When creating a collection with Embedding, you need to ensure that the dimension of the set vector index is consistent with the dimension of the vector generated by the model used by Embedding. The model and dimension relationship:

Create TencentVDB.py file

Introduce dependency packages

from Crawling import crawlData
import tcvectordb
from tcvectordb.model.collection import Embedding
from tcvectordb.model.document import Document, Filter, SearchParams
from tcvectordb.model.enum import FieldType, IndexType, MetricType, EmbeddingModel, ReadConsistency
from tcvectordb.model.index import Index, VectorIndex, FilterIndex, HNSWParams, IVFFLATParams

Turn off debug mode

tcvectordb.debug.DebugEnable = False

Create a class TencentVDB class, which explains the meaning in chunks:

Initialize the client connected to tcvectordb, and the relevant information will be passed in in main later.

def __init__(self, url: str, username: str, key: str, timeout: int = 30):
            """
            初始化客户端
            """
            # 创建客户端时可以指定 read_consistency,后续调用 sdk 接口的 read_consistency 将延用该值
            self._client = tcvectordb.VectorDBClient(url=url, username=username, key=key,
                                                    read_consistency=ReadConsistency.EVENTUAL_CONSISTENCY, timeout=timeout)

Create databases and collections. You can also create them in Tencent Cloud DMC:

Create databases and collections

def create_db_and_collection(self):
        database = 'crawlingdb'
        coll_embedding_name = 'tencent_knowledge'

        # 创建DB--'book'
        db = self._client.create_database(database)
        database_list = self._client.list_databases()
        for db_item in database_list:
            print(db_item.database_name)


        index = Index()
        index.add(VectorIndex('vector', 1024, IndexType.HNSW, MetricType.COSINE, HNSWParams(m=16, efconstruction=200)))
        index.add(FilterIndex('id', FieldType.String, IndexType.PRIMARY_KEY))
        index.add(FilterIndex('title', FieldType.String, IndexType.FILTER))
        ebd = Embedding(vector_field='vector', field='text', model=EmbeddingModel.TEXT2VEC_LARGE_CHINESE)

        # 第二步:创建 Collection
        # 创建支持 Embedding 的 Collection
        db.create_collection(
            name=coll_embedding_name,
            shard=3,
            replicas=0,
            description='爬虫向量数据库实验',
            index=index,
            embedding=ebd,
            timeout=50
        )

There are many options for Embedding

You can choose the model you need to use according to the actual situation

The above model enumeration code location: \venv\Lib\site-packages\tcvectordb\model\enum.py

 BGE_BASE_ZH = ("bge-base-zh", 768)
 M3E_BASE = ("m3e-base", 768)
 TEXT2VEC_LARGE_CHINESE = ("text2vec-large-chinese", 1024)
 E5_LARGE_V2 = ("e5-large-v2", 1024)
 MULTILINGUAL_E5_BASE = ("multilingual-e5-base", 768)

Call the crawler and write data to the vector database

    def upsert_data(self):
        # 获取 Collection 对象
        db = self._client.database('book')
        coll = db.collection('book_segments')

        # upsert 写入数据,可能会有一定延迟
        # 1. 支持动态 Schema,除了 id、vector 字段必须写入,可以写入其他任意字段;
        # 2. upsert 会执行覆盖写,若文档 id 已存在,则新数据会直接覆盖原有数据(删除原有数据,再插入新数据)
        data = crawlData()
        docList =[]
        for dd in data:
            docList.append(Document(id=dd["url"],
                        text=dd["text"],
                        title=dd["title"]))
        coll.upsert(documents=docList,build_index=True)
        print("成功将数据写入腾讯云向量数据库")

transfer:

if __name__ == '__main__':
    test_vdb = TencentVDB('http://xxxxxxxx.clb.ap-beijing.tencentclb.com:50000', key='xxxxx', username='root')
    test_vdb.upsert_data()

After execution, it will output:

If prompted:

code=1, message=There was an error with the embedding: token rate limit reached. It means that there is too much collected content, the free account has been limited, and some stored collections need to be deleted.

Log in to check whether the data is in the database:

4. Develop LLM model dialogue function

LLM uses ChatGLM3. I still prefer this model because it has a fast 6B speed and relatively high data accuracy. When I ask questions here, I need about 14G of video memory. If the video memory is small, I can quantify it according to the actual situation or use autodl or the like. Cloud vendor virtualization services cost less than 10 yuan to build a set and verify it.

Introduce dependencies

Create file requirements.txt

protobuf
transformers>=4.30.2
cpm_kernels
torch>=2.0
gradio~=3.39
sentencepiece
accelerate
sse-starlette
streamlit>=1.24.0
fastapi>=0.104.1
uvicorn~=0.24.0
sse_starlette
loguru~=0.7.2
streamlit

Import LLM dependencies

pip install -r requirements.txt

Download the ChatGLM3 model, domestic download address:

https://modelscope.cn/models/ZhipuAI/chatglm3-6b/

https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base/

https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k/summary

Just choose one of the three. 32K mainly supports long text content conversations. The model is more than ten gigabytes in size. It is best to put it on a solid state drive to reduce the loading time.

Encoding ChatGLM chat conversation, here using streamlit as the chat conversation UI framework

Introduce dependency packages:

import os
import streamlit as st
import torch
from transformers import AutoModel, AutoTokenizer

Set the model location. My code here is in the same location as the THUDM directory. You can also use an absolute path to point to the downloaded model folder.

MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/chatglm3-6b')
TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH)

Determine whether the current processor has an Nvidia graphics card or whether the cuda driver is installed. If not, please refer to my article:

https://blog.csdn.net/cnor/article/details/129170865

Perform cuda installation.

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

Set title

# 设置页面标题、图标和布局
st.set_page_config(
    page_title="我的AI知识库",
    page_icon=":robot:",
    layout="wide"
)

Get the model and determine whether to use cuda or cpu for calculation.

@st.cache_resource
def get_model():
    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
    if 'cuda' in DEVICE:  # AMD, NVIDIA GPU can use Half Precision
        model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).to(DEVICE).eval()
    else:  # CPU, Intel GPU and other GPU can use Float16 Precision Only
        model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).float().to(DEVICE).eval()
    # 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量
    # from utils import load_model_on_gpus
    # model = load_model_on_gpus("THUDM/chatglm3-6b", num_gpus=2)
    return tokenizer, model

Edit some adjustment switches on the left side of the page and process the content of historical chat records to support historical content analysis.

# 加载Chatglm3的model和tokenizer
tokenizer, model = get_model()

# 初始化历史记录和past key values
if "history" not in st.session_state:
    st.session_state.history = []
if "past_key_values" not in st.session_state:
    st.session_state.past_key_values = None
 
# 设置max_length、top_p和temperature
max_length = st.sidebar.slider("max_length", 0, 32768, 8000, step=1)
top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
temperature = st.sidebar.slider("temperature", 0.0, 1.0, 0.8, step=0.01)


# 清理会话历史
buttonClean = st.sidebar.button("清理会话历史", key="clean")
if buttonClean:
    st.session_state.history = []
    st.session_state.past_key_values = None
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    st.rerun()

# 渲染聊天历史记录
for i, message in enumerate(st.session_state.history):
    print(message)
    if message["role"] == "user":
        with st.chat_message(name="user", avatar="user"):
            st.markdown(message["content"])
    else:
        with st.chat_message(name="assistant", avatar="assistant"):
            st.markdown(message["content"])

# 输入框和输出框
with st.chat_message(name="user", avatar="user"):
    input_placeholder = st.empty()
with st.chat_message(name="assistant", avatar="assistant"):
    message_placeholder = st.empty()

Get the input content, use model.stream_chat to send the data content to transformers and print it out in the form of a stream.

# 获取用户输入
prompt_text = st.chat_input("请输入您的问题")

# 如果用户输入了内容,则生成回复
if prompt_text:
    input_placeholder.markdown(prompt_text)
    history = st.session_state.history
    past_key_values = st.session_state.past_key_values
    history = []
    for response, history, past_key_values in model.stream_chat(
        tokenizer,
        prompt_text,
        history,
        past_key_values=past_key_values, 
        max_length=max_length, 
        top_p=top_p,
        temperature=temperature,
        return_past_key_values=True,
    ):
        # print(response)
        message_placeholder.markdown(response)
    st.session_state.history = history
    st.session_state.past_key_values = past_key_values

The above steps have completed the large model dialogue part. Subsequent supplements will be based on the above code to complete the connection with Tencent Cloud vector database. Here you can choose Tencent Cloud or Milvus, etc.

Output result:

At this time, the input vector is still the original understanding of the model and has no knowledge related to the Tencent Cloud vector database.

5 Combination of Tencent Cloud vector database and LLM large model

Based on the fourth step, we need to add the following content to complete the development of this large model knowledge base

Added Tencent Cloud vector database query function

def searchTvdb(txt):
    conn_params = {
        'url':'http://lb-xxxxx.clb.ap-beijing.tencentclb.com:50000',
        'key':'xxxxxxxxxxxxxxxxxx',
        'username':'root',
        'timeout':20
        }

    vdb_client = tcvectordb.VectorDBClient(
            url=conn_params['url'],
            username=conn_params['username'],
            key=conn_params['key'],
            timeout=conn_params['timeout'],
        )
    db_list =  vdb_client.list_databases()

    db = vdb_client.database('crawlingdb')
    coll = db.collection('tencent_knowledge')
    embeddingItems = [txt]
    search_by_text_res = coll.searchByText(embeddingItems=embeddingItems,limit=3, params=SearchParams(ef=100))
    # print_object(search_by_text_res.get('documents'))
    # print(search_by_text_res.get('documents'))
    return search_by_text_res.get('documents')

def listToString(doc_lists):
    str =""
    for i, docs in enumerate(doc_lists):
        for doc in docs:
          str=str+doc["text"]
    return str

Users query the content from the vector database and add some auxiliary descriptions in front of the content to make it easier for LLM to understand our intention to use the knowledge base.

Since there is a lot of content coming out of the vector database, and since conditions do not allow it, a switch is added here to use historical dialogue for ordinary GLM chat, and to turn off historical dialogue when using the knowledge base.

def on_mode_change():
    mode = st.session_state.dialogue_mode
    text = f"已切换到 {mode} 模式。"
    st.session_state.history = []
    st.session_state.past_key_values = None
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
       
    st.toast(text)

dialogue_mode = st.sidebar.selectbox("请选择对话模式",
                                ["腾讯云知识库对话",
                                "正常LLM对话(支持历史)",
                                ],
                                on_change=on_mode_change,
                                key="dialogue_mode",
                                )

Change the input part to:

if prompt_text:

    mode = st.session_state.dialogue_mode
    template_data=""
    if mode =="腾讯云知识库对话":
        result = searchTvdb(prompt_text)
        str = listToString(result)
        # print(str)
        template_data = "请按照\""+prompt_text+"\"进行总结,内容是:"+str
        template_data = template_data[:20000]
    else:
        template_data =prompt_text
    
    input_placeholder.markdown(prompt_text)
    history = st.session_state.history
    past_key_values = st.session_state.past_key_values
    history = []
    for response, history, past_key_values in model.stream_chat(
        tokenizer,
        template_data,
        history,
        past_key_values=past_key_values, 
        max_length=max_length, 
        top_p=top_p,
        temperature=temperature,
        return_past_key_values=True,
    ):
        # print(response)
        message_placeholder.markdown(response)
    
    endString = ""
    # 更新历史记录和past key values
    if mode != "腾讯云知识库对话":
        st.session_state.history = history
        st.session_state.past_key_values = past_key_values
    else:
      for i,doc in enumerate(result[0]):
        # print(doc)
        endString = endString+"\n\n"+doc["title"]+"     "+doc["id"]
        response=response+"\n\n参考链接:\n\n\n"+endString
    message_placeholder.markdown(response)

Final result:

cover

Summarize

From the above practice, it can be concluded that it is possible to build your own enterprise's private AI model through crawler + vector database + LLM. The more powerful the LLM model is, the stronger the AI's awareness of the content length and awareness of the vector database will be. Tencent Cloud Vector Database has outstanding performance, is easy to use, and supports a variety of Embedding types. You can try it out.

Guess you like

Origin blog.csdn.net/cnor/article/details/134665954