Building a large language model question answering knowledge base based on Amazon cloud technology services

With the obvious improvement of the effect of the large language model, its related applications continue to emerge and show a more and more popular trend. One of the more widely concerned technical routes is the method of large language model (LLM) + knowledge recall (Knowledge Retrieval), which can well make up for some shortcomings of the general large language model in terms of private domain knowledge question answering, and solve the problem of general large language models. Language models answer lack of basis, hallucinations and other questions in the professional field. The basic idea is to slice the private domain knowledge documents and then vectorize them, then recall them through vector retrieval, and then input them as context into the large language model for induction and summary.

 In the specific practice of this technical direction, the knowledge base can be constructed using two index methods based on inversion and vector, which play a key role in the knowledge recall step in the knowledge question answering process, and ordinary document index or log index Different, the vectorization of knowledge requires the semantic capability of the deep model, and there are additional steps such as document segmentation, vector model deployment & reasoning. In the process of knowledge vectorization database construction, not only the magnitude of the original document needs to be considered, but also the segmentation granularity, vector dimension and other factors. In the end, the number of knowledge items indexed by the vector database may reach a very large magnitude, which may be determined by Caused by the following two aspects:

  • The amount of existing documents in various industries is very high, such as finance, medicine, and legal fields, and the amount of new documents is also large.

  • In order to pursue the recall effect, segmentation of documents often adopts multi-granularity redundant storage by sentence or segment.

 These details bring certain challenges to the writing and query performance of the knowledge vector database. In order to optimize the construction and management of the vectorized knowledge base, based on the services of Amazon cloud technology, the knowledge base construction process as shown in the following figure is constructed:

  • The Lambda is triggered in real time through the Handler of the S3 Bucket to start the Glue job corresponding to the knowledge file storage

  • Document parsing and splitting will be performed in the Glue Job, and SageMaker's Embedding model will be called for vectorization

  • Inject into Amazon OpenSearch via Bulk

 It also summarizes some best practices and experiences on various aspects involved in the whole process, including how to vectorize knowledge and optimize vector database.

 knowledge vectorization

 Document Splitting

 The pre-step of knowledge vectorization is to split knowledge, and the maintenance of semantic integrity is the most important consideration. Discuss in two aspects. How to choose the following two points of focus respectively summed up some experience:

 a. Method of splitting fragments

 Regarding this part of the work, Langchain, as a popular large language model integration framework, provides a lot of Document Loader and Text Spiltters, some of which are of reference value, but many of them are repetitive.

 At present, the most used basic method is to use the RecursiveCharacterTextSplitter in Langchain, which is the default splitter of Langchain. It uses this multi-level delimited character list - ["\n\n", "\n", " ", ""] to split. By default, it will be split according to the paragraph first. If the chunk_size of the split result exceeds, Then continue to use the next level of separator characters to continue splitting until the chunk_size requirements are met.

 However, this approach is relatively rough, and it may still cause some key content to be disassembled. For some other document formats there can be some more nuanced practices.

  • The FAQ file must be split according to the granularity of one question and one answer. The subsequent vectorized input can use only questions or questions + answers

  • For Markdown files, "#" is a special character used to identify the title. MarkdownHeaderTextSplitter can be used as a splitter, which can better ensure that the content and title are extracted correspondingly.

 PDF files will contain richer formatting information. Langchain provides a lot of Loaders, but the segmentation effect of PDFMinerPDFasHTMLLoader in Langchain will be better. It converts PDF into HTML, and through HTML

Block segmentation, this method can retain the font size information of each block, so that the affiliation relationship of each block content can be deduced, and the title of a paragraph can be associated with the parent title of the previous level to make the information more complete.

 b. Model support for fragment length

 Since the split fragments need to be reasoned through the vectorized model, the Max_seq_length limit of the vectorized model must be considered. Exceeding this limit may cause truncation and incomplete semantics. Divided from the supported Max_seq_length, there are currently two types of Embedding models, as shown in the following table (these four are models with practical experience).

 model name

 Max_seq_length

 paraphrase-multilingual-mpnet-base-v2(sbert.net)

 128

 text2vec-base-chinese(text2vec)

 128

 text2vec-large-chinese(text2vec)

 512

 text-embedding-ada-002 (openai)

 8192

 The Max_seq_length here refers to the number of Tokens, which is not equivalent to the number of characters. According to previous test experience, a token of the first three models is about 1.5 Chinese characters. For large language models, such as chatglm, a token is generally about 2 characters. If it is inconvenient to calculate the number of tokens during segmentation, you can simply convert according to this ratio to ensure that there will be no truncation.

 The first three models belong to the Bert-based Embedding model, and OpenAI's text-embedding-ada-002 model is a GPT3-based model. The former is suitable for vectorization of sentences or short paragraphs, while the latter OpenAI's SAAS interface is suitable for vectorization of long texts, but cannot be deployed privately.

 Validation selection can be made based on recall effects. From the current practical experience, text-embedding-ada-002 can rank Chinese similarity scores, but the degree of discrimination is not enough (concentration is about 0.7), which is not conducive to directly judging whether there is similar knowledge recall through the threshold.

 In addition, there is another way to improve the length limit problem. The split fragments can be numbered, and the numbers of adjacent fragments are also close. When one of the fragments is recalled, the nearby fragments can also be searched through the range search of the vector database Recalling can also ensure the semantic integrity of the recalled content.

 Vectorized Model Selection

 The four models mentioned above only mentioned the difference in the support of the model for the length of the text, and there is currently no very authoritative conclusion on the effect. You can use the leaderboard to understand the performance of each model. The evaluation of most of the models on the list is still based on the benchmark of the public data set. Whether the benchmark conclusion is true in the real production scene needs to be viewed case by case. But in principle, there are the following aspects of experience that can be shared:

  • The model of Finetune in the vertical field has obvious advantages over the original vector model

  • Current vectorized models fall into two categories, symmetric and asymmetric. In the absence of fine-tuning, it is recommended to use symmetric recall for the FAQ, that is, the recall from Query to Question. For document fragment knowledge, it is recommended to use an asymmetric recall model, which is the recall from Query to Answer (document fragment).

  • If there is no obvious difference in effect, try to choose a model with a short vector dimension. High-dimensional vectors (such as openai's text-embedding-ada-002) will put pressure on the vector database in terms of retrieval performance and cost.

 vectorized parallelism

 In real business scenarios, the size of documents is on the order of one hundred to one million. According to the redundant multi-level recall method, the corresponding knowledge items may reach a scale of up to 100 million. Due to the large scale of the entire offline calculation, it must be performed concurrently, otherwise it cannot meet the requirements of knowledge addition and vector retrieval effect iteration. The steps are mainly divided into the following three calculation stages.

 Document Segmentation Parallel

 The concurrency granularity of calculation is at the file level, and the processed file formats are also diverse, such as TXT plain text, Markdown, PDF, etc., and the corresponding segmentation logic is also different. It is not appropriate to use a big data framework such as Spark for parallel processing. Using multi-core instances for multi-process concurrent processing is too primitive, and it is not convenient to observe and track tasks. So you can choose AWS Glue's Python shell engine for processing. The main advantages are as follows:

  • It is convenient to perform concurrency according to the file granularity, and the degree of concurrency is simple and controllable. With mechanisms such as retry and timeout, it is convenient to track and observe tasks, and the logs are directly connected to AWS CloudWatch

  • It is convenient to build and run the dependency package, which can be specified by the parameter –additional-python-modules. At the same time, the Glue Python operating environment already comes with opensearch_py and other dependencies

 Vectorized inference parallelism

 Since segmented paragraphs and sentences expand many times compared to the number of documents, the reasoning throughput of the vector model determines the throughput of the entire process. Here SageMaker Endpoint is used to deploy the vectorized model. Generally speaking, in order to provide the throughput capability of the model, GPU instance reasoning, multi-node Endpoint/Endpoint elastic scalability, and Server-Side/Client-Side Batch reasoning capabilities are some of these. effective measures. Specific to the scene of offline vector knowledge base construction, the following strategies can be adopted:

  • GPU instance deployment: Vectorized model CPU instances can be inferred. However, in the offline scenario, the reasoning concurrency is high, and the throughput of the GPU can be increased by about 20 times compared with the CPU. Therefore, GPU inference can be used in offline scenarios, and CPU inference strategies can be used in online scenarios.

  • For the temporary large concurrent vector generation of multi-node Endpoint, it is processed by deploying multi-node Endpoint, and can be closed after processing

 Using Client-Side Batch Reasoning: For offline reasoning, Client-side batch construction is very easy. There is no need to enable Server-side Batch reasoning. Generally speaking, Server-side batch has a waiting time, such as 50ms or 100ms, which is more effective for large language models with high reasoning delays, but not suitable for vectorized reasoning.

 OpenSearch batch injection

 The writing operation of Amazon OpenSearch can be implemented in batches through bulk, which has great advantages over single writing.

 Vector database optimization

 Which approximate search algorithm to choose for the vector database, the selection of an appropriate cluster size, and the optimization of cluster settings are also critical to the read and write performance of the knowledge base. The following aspects need to be considered:

 Algorithm selection

 In OpenSearch, two k-NN algorithms are provided: HNSW (Hierarchical Navigable Small World) and IVF (Inverted File).

 There are several factors to consider when choosing a k-NN search algorithm. If memory is not a limiting factor, it is recommended to give priority to using the HNSW algorithm, because the HNSW algorithm can guarantee both latency and recall. If memory usage needs to be controlled, consider using the IVF algorithm, which reduces memory usage while maintaining HNSW-like query speed and quality. However, if memory is the larger limiting factor, consider adding PQ encoding to the HNSW or IVF algorithms to further reduce memory usage. It should be noted that adding PQ encoding may reduce the accuracy rate. Therefore, when selecting algorithms and optimization methods, multiple factors need to be considered comprehensively to meet specific application requirements.

 Cluster Size Estimation

 After the algorithm is selected, the required memory can be calculated according to the formula to derive the k-NN cluster size

 Batch injection optimization

 When injecting a large amount of data into the knowledge vector library, some key performance optimizations need to be paid attention to. The following are some main optimization strategies:

  • Disable refresh interval

  • Increase indexing thread

  • Increase the knn memory ratio

Guess you like

Origin blog.csdn.net/caijingshiye/article/details/132479005