Application practice of knowledge question answering based on large language model – knowledge base construction (Part 2)

2c7249a72552401a36dcbc72cacf01d6.gif

The previous article introduced the general process of building the knowledge base and some optimization experience details, but did not combine a specific scenario to give more detailed practical experience and some related benchmarks, so this article will cut into a specific scenario for discussion.

Target scenario: construct a knowledge base for 1w articles in PubMed medical academic data, and achieve fast injection and query speed.

The main discussion will cover OpenSearch cluster scale design, knowledge base Index design and details of experimental steps.

01

resource estimation

Generally speaking, we need to choose the resource configuration of OpenSearch according to the following design guidelines of OpenSearch cluster:

  • If the workload is partial to search, then the shard size of 10-30GB should be used; if the workload of partial log is biased, the node size of 30-50GB should be used;

  • Please try to set the number of shards to an even multiple of the number of data nodes, which will help the shards to be evenly distributed on the data nodes;

  • The number of shards per node is proportional to the JVM heap memory, and the number of shards per GB of memory does not exceed 25;

  • Each 5 vCPUs can correspond to a shard, for example, 8 vCPUs can support up to 6 shards;

  • If the k-NN field is enabled, refer to the following table for memory estimation.

0e090e93f7306e1787c185ee81d35562.png

Based on current information, only the original number of documents to be indexed is known. Due to the intermediate processing process such as document segmentation, it is impossible to estimate the specific memory usage and storage capacity, so it is necessary to use small batches of experimental data for test deduction.

In the small batch experiment, 300 documents were indexed, and about 203k records were generated through segmentation, with a storage capacity of 4.5GB. Then proportional conversion, if you need to index 10,000 documents, then about 7 million records will be generated, and 150GB of storage will be generated. As shown below:

2a6b4f0b747c79bd9fceea85cf3ae749.png

Starting from the knowledge quiz Chatbot scenario, which belongs to the search workload, the size of the shard should be in the range of 10-30GB to increase search performance. The number of shards generally follows the principle of multiples of the number of nodes. If it is a 2-node cluster, it can be [2, 4, 8, 16 ...]. Based on the total storage capacity of 150GB, the shards can be 8, 10, 12, 14, and 16. When the number of shards is 8, the storage capacity of each shard is 18.75GB, which meets the requirements.


In terms of vector retrieval, in order to ensure both recall and latency, the HNSW algorithm is used. In addition, refer to the benchmark conclusion in , the value of m in the HNSW algorithm can be set to 16. In terms of memory planning, calculate the memory usage according to the formula in the table above:

35fa456797b5134b98326723d545ed39.png

Generally, the off-heap memory of each node accounts for 50%. According to the best practice setting of knn.memory.circuit_breaker.limit=70%  , then 35% of the node memory is occupied by KNN, so the memory of the entire node should be calculated as 22.9GB / 35% = 65GB.


In terms of vCPU planning, assuming that the number of shards is 8, multiplied by the coefficient of 1.5vCPU/Shard, the number of vCPUs must be at least 12 or more. Combining the following C-series and R-series instance configuration information and price information, and considering the memory and vCPU requirements, choose the 2-node c6g.4xlarge of the C-series or the 2-node r6g.2xlarge of the R-series.

77a138c280152f117f400d57545d7394.png

02

Index building experiments

There are three main points to pay attention to in index construction:

  • Data integrity ensures that all knowledge can be queried and data will not be lost due to abnormal intake.

  • Construction speed and knowledge recall may have repeated effect adjustments, which need to be repeatedly ingested. Speed ​​is very important for the efficiency of full-link development and optimization.

  • Query performance to ensure real-time session experience in the scene.

The entire ingestion process can be basically divided into three stages: text segmentation, text vectorization, and ingestion into Amazon OpenSearch. Among them, the processing of text segmentation and text vectorization is a temporary workload. In principle, the number of concurrent glue jobs and the number of nodes behind Amazon SageMaker Endpoint can be increased to linearly increase the processing speed of this workload, but OpenSearch belongs to a Pre-allocated resources (Note: OpenSearch Severless k-NN vector database to be released this year will change this). The latter two parts, namely vectorization and OpenSearch ingestion, may be the bottleneck of the whole process. The complete process test is not easy to disassemble and analyze the performance bottleneck, so this experiment will test these two parts separately.

Experiment 1 – Embedding Model Throughput Test

  1. Use paraphrase-multilingual-deploy.ipynb to deploy, and deploy 10 g4dn.xlarge models;

    https://github.com/aws-samples/private-llm-qa-bot/blob/main/notebooks/embedding/paraphrase-multilingual-deploy.ipynb

  2. Comment out the impact caused by writing to OpenSearch downstream, and temporarily comment out the relevant code;

    https://github.com/aws-samples/private-llm-qa-bot/blob/main/code/aos_write_job.py

  3. Use batch_upload_docs.py to start multiple glue jobs for concurrent operation.

    https://github.com/aws-samples/private-llm-qa-bot/blob/main/code/batch_upload_docs.py

In this part of the processing flow, the throughput of the vectorization step can be adjusted by adjusting the parallelism of the glue job and the client-side batch size. When the GPU utilization is insufficient, increasing the client-side batch size can improve the GPU utilization. After a simple test, it is found that this hypothesis can indeed be proved. For specific data, please refer to the following experimental results:

856c08ab8307016df1490e7d97dba4dd.png

Lab 2 – Amazon OpenSearch Ingestion Test

1. Randomly generate vectors, replace the Embedding model call, refer to the following code:

import numpy as np
AOS_BENCHMARK_ENABLED=True


def get_embedding(smr_client, text_arrs, endpoint_name=EMB_MODEL_ENDPOINT):
    if AOS_BENCHMARK_ENABLED:
        text_len = len(text_arrs)
        return [ np.random.rand(768).tolist() for i in range(text_len) ]
      
    # call sagemaker endpoint to calculate embeddings
    ...
    return embeddings

Swipe left to see more

2. Build the OpenSearch cluster and index, and optimize the settings;

a. Build the corresponding index

The two parameters ef_construction and m involved in the vector field. ef_construction specifies the size of the dynamic list when constructing the k-NN graph. The larger the value, the more accurate the graph of the vector data, but the slower the response of the index. m specifies the number of doubly linked lists for each vector in k-NN. The larger the number, the more accurate the retrieval, but the corresponding memory usage will increase significantly. Refer to the benchmark conclusion in the blog <Choose the k-NN algorithm for your billion-scale use case with OpenSearch> (https://aws.amazon.com/cn/blogs/big-data/choose-the-k-nn- algorithm-for-your-billion-scale-use-case-with-opensearch/), for the current data scale, the parameters ef_construction: 128 and m: 16 are enough to ensure the recall rate. In addition, you can pay attention to the following points when building the index :

  1. Add a publish_date field to facilitate subsequent deletion/update knowledge according to time;

  2. Add an idx integer field to record the order of the corresponding segment in the full text, and recall adjacent context segments based on range_search when recalling;

  3. Fields that only do filtering but not keyword recall are set to the keyword type, which is conducive to indexing speed. For details, please refer to the following code:

PUT chatbot-index
{
    "settings" : {
        "index":{
            "number_of_shards" : 8,
            "number_of_replicas" : 0,
            "knn": "true",
            "knn.algo_param.ef_search": 32,
            "refresh_interval": "60s"
        }
    },
    "mappings": {
        "properties": {
            "publish_date" : {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss"
            },
            "idx" : {
                "type": "integer"
            },
            "doc_type" : {
                "type" : "keyword"
            },
            "doc": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            },
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            },
            "doc_title": {
                "type": "keyword"
            },
            "doc_category": {
                "type": "keyword"
            },
            "embedding": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 128,
                        "m": 16
                    }
                }           
            }
        }
    }
}

Swipe left to see more

b. Set knn related parameters, refer to "Large language model-based knowledge question answering application practice - knowledge base construction (on)"

PUT /_cluster/settings
{
    "transient": {
        "knn.algo_param.index_thread_qty": 8,
        "knn.memory.circuit_breaker.limit": "70%"
    }
}

Swipe left to see more

c. To enable multiple glue jobs for concurrent ingestion, you can refer to the following code:

# 注意${Concurrent_num} 不能超过
# glue job->job detail->Advanced properties->Maximum concurrency 设置中最大限制
python batch_upload_docs.py \
--bucket "${bucket_name}" \
 --aos_endpoint "${OpenSearch_Endpoint}" \
 --emb_model_endpoint "${EmbeddingModel_Endpoint}" \
 --concurrent_runs_quota ${Concurrent_num} \
 --job_name "${Glue_jobname}"

Swipe left to see more

3. Details of some experimental results


In each round of experiments, the adjusted parameters have been marked in bold font for reference to guide the parameter adjustment in subsequent data injection.

0a67305bfdcda94a9c9bc79ad8e4aad4.png

Experiment 3 – Full Process Ingestion Test

a. Details of some experimental records

b23463714bf5e762dd17a65743babf01.png

b. Preliminary Experimental Conclusions


Referring to the above experimental records, we can see that after splitting 10,000 documents into 7 million vectors, by adjusting the concurrency of the client, the number of nodes of the inference endpoint and the inference batch size can be ingested in about 1 hour, and there is no problem with the integrity. It can meet the requirements of large-scale knowledge construction. If the amount of documents continues to grow, OpenSearch nodes and SageMaker Endpoint nodes can continue to be expanded.

03

Summary of Index Construction Experience

In the past, some best practices of OpenSearch ingestion did not include knn, so in the case of knn index, the previous conclusions cannot be completely referred to. Through the above three different experimental methods, in the process of multiple experiments, This paper has obtained the following practical experience and conclusions for reference:

a. The CPU utilization and the parameter ef_construction are significantly positively correlated with m. When using larger ef_construction and m in the experiment, the CPU can easily reach 100%. In the experiment, when other parameters are the same, when ef_construction is 512, the CPU utilization rate will remain at 100% for a long time. When it is changed to 2, the utilization rate is basically below 20%, and the peak value does not exceed 30%.

b. The amount of client parallelism is positively correlated with OpenSearch ingestion speed and load, but not linearly. Multi-client can improve the intake speed, but too many clients may cause a large number of (429, '429 Too Many Requests /_bulk') and (503, "No server available to handle the request.."), etc. mistake.

c. The exponential backoff retry mechanism can ensure the integrity of the intake and large-area write failures caused by the instantaneous unavailability of the cluster. The opensearch-py package has the following intake functions. If there are too many concurrent clients, it may cause CPU The utilization rate is always at 100%. Within the number of retries of max_retries, it will wait for initial_backoff * (attampt_idx ** 2) time each time. By setting a larger initial_backoff waiting time, it can avoid the concurrency on the client side. In the case of a large area of ​​429 errors. In addition, the number of clients should not be too large, otherwise a large number of 503 related errors will be more likely to occur. For occasional 503 errors, you can use glue's retry mechanism to ensure the integrity of the write.

# chunk_size 为文档数 默认值为 500
# max_chunk_bytes 为写入的最大字节数,默认 100M 过大,可以改成 10-15M
# max_retries 重试次数
# initial_backoff 为第一次重试时 sleep 的秒数,再次重试会翻倍
# max_backoff 最大等待时间
response = helpers.bulk(client,
    doc_generator,
    max_retries=3,
    initial_backoff=200, #默认值为 2,建议大幅提高
    max_backoff=800,
    max_chunk_bytes=10 * 1024 * 1024) #10M 社区建议值

Swipe left to see more

Note: In the production scenario of large-scale data ingestion, it is not recommended to use the vector database interface provided by LangChain. Looking at its source code, it can be seen that the default implementation of LangChain is a single client, and its internal implementation does not use the exponential backoff Retry mechanism, which cannot be guaranteed Ingest speed and completeness.

d. After the writing is completed, it is recommended to query the deduplication quantity of the document to ensure the integrity of the writing. You can use the following DSL statement in the Dev tools of OpenSearch Dashboard to query the total number of documents. Note that the cardinality statistics are not accurate statistics, and the precision_threshold parameter value can be increased to improve its accuracy.

POST /{index_name}/_search
{
  "size": 0,
  "aggs": {
    "distinct_count": {
      "cardinality": {
        "field": "{field_name}",
        "precision_threshold": 20000
      }
    }
  }
}


=> 10000

Swipe left to see more

At the same time, the number of corresponding chunks can be counted according to the document name, which can help to find potential document processing quality problems. Refer to the following code:

GET /{index_name}/_search
{
  "size": 0,
  "aggs": {
    "distinct_values": {
      "terms": {
        "field": "doc_title"
      }
    }
  }
}


=>
...
"aggregations": {
    "distinct_values": {
      "buckets": [
        {
          "key": "ai-content/batch/PMC10000335.txt",
          "doc_count": 42712
        },
        {
          "key": "ai-content/batch/PMC10005506.txt",
          "doc_count": 5279
        },
        ...
        {
          "key": "ai-content/batch/PMC10008235.txt",
          "doc_count": 9
        },
        {
          "key": "ai-content/batch/PMC10001778.txt",
          "doc_count": 1
        }
      ]
    }

Swipe left to see more

e. Refresh_interval is set to -1, under the same conditions of other related parameters, the number of 503 errors increases significantly. After changing to 60s, the situation has improved significantly, and similar adjustments can be made if similar problems occur.

04

Retrieval Performance Tuning

After the data is injected, the direct query performance is very poor, and the query delay may be several seconds or even ten seconds. Some necessary optimizations need to be made. There are two main points at the core:

a. Segment merge

Segment is the smallest search unit in OpenSearch. If each shard has only one segment, the search efficiency will be the highest. In order to achieve this goal, we can reduce the generation speed of small segments by controlling the refresh interval, or perform segment merge manually. This will help reduce overhead during the search process and increase search speed.


You can execute the merge through the following DSL in the Dev tools of OpenSearch Dashboard. The entire merge process is relatively long. Before execution, you can increase the maximum number of threads used for the merge, which can increase the speed of the merge.

# merge segments
POST /{index_name}/_forcemerge?max_num_segments=1?pretty


# increase max_thread_count for merge task
PUT {index_name}/_settings
{
  "index.merge.scheduler.max_thread_count": 8
}

Swipe left to see more

You can execute the following DSL before and after merging to check the current segments:

GET _cat/segments/{index_name}?v&h=index,segment,shard,docs.count,docs.deleted,size

Swipe left to see more

The following table shows the situation after merging the segments. After the merging is complete, there is only one segment under each shard, and the data is evenly distributed, and the data marked for deletion is also cleaned up.

2280832ca0fc4e381329dcba16772268.png

b. k-NN index warmup


Since the performance of the k-NN index is closely related to whether the index data structure is cached in memory, the capacity of the cache content that can be provided has a great impact on performance. The following DSL command can be executed to warm up the k-NN index

GET /_plugins/_knn/warmup/{index_name}?pretty

Swipe left to see more

The warm-up is performed very quickly, and the performance will be significantly improved after the warm-up is completed. You can go to CloudWatch to check the KNNGraphMemoryUsagePercentage indicator in OpenSearch Domain to confirm whether the execution is complete, as shown in the figure:

a533dbd3c3b0f2c01655278dee477a6b.png

05

epilogue

Based on the previous blog in this series , this article elaborates in more detail through the practice of a real data scenario. The focus of the discussion is more on the faster and more complete construction of vector data-based knowledge bases for large-scale documents. , which has guiding significance for the construction of industry knowledge bases in some industries such as finance, law, and medical care.

The first part of this article gives some method references for the cluster configuration selection of Amazon OpenSearch, and the second, third and fourth parts give some preliminary experience summary for data intake and retrieval performance.


There are several related blog posts that will go further in this series, including:

  • "Performance Evaluation and Selection Analysis of Amazon OpenSearch Vector Database" will focus on Amazon OpenSearch as a vector database, discuss its advantages and positioning, give more detailed benchmarks in terms of indexing and query, and provide users with richer reference information.

  • "Large language model-based knowledge question answering application practice-knowledge recall optimization" will discuss how to better recall the corresponding knowledge under the premise and background of knowledge base construction, including various applicable recall methods and practical skills. In addition, the code details mentioned in this article can refer to the supporting materials:

  1. Code repository aws-samples/private-llm-qa-bot

    https://github.com/aws-samples/private-llm-qa-bot

  2. workshop <Intelligent Question Answering System Based on Amazon OpenSearch+Large Language Model> (Chinese and English versions)

    https://github.com/aws-samples/private-llm-qa-bot

references:

1. Choose the k-NN algorithm for your billion-scale use case with OpenSearch

https://aws.amazon.com/cn/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/

The author of this article

ddf3de8a09c31b00d4e582af34580c49.jpeg

Li Yuanbo

Amazon Cloud Technology Analytic and AI/ML solution architect, focusing on the end-to-end architecture design and business optimization of AI/ML scenarios, and responsible for Amazon Clean Rooms product services in terms of data analysis. Working in the Internet industry for many years, he has rich practical experience in user portraits, refined operations, recommendation systems, and big data processing.

945c64d9a27d6fee0931db2325354d80.jpeg

Sun Jian

Amazon cloud technology big data solution architect, responsible for the consulting and architecture design of big data solutions based on Amazon cloud technology, and is committed to the research and promotion of big data. He has extensive experience in big data operation and maintenance tuning, container solutions, integration of lakes and warehouses, and big data enterprise applications.

85b0292672727820473211f848fd42d4.jpeg

Tang Shijian

Amazon cloud technology data analysis solution architect, responsible for consulting and architecture design of customer big data solutions.

6928836329e6341e8b3d3354b57ac5b2.jpeg

Guo Ren

Amazon cloud technology AI and machine learning direction solution architect, responsible for the consulting and design of machine learning solution architecture based on Amazon cloud technology, dedicated to the implementation and promotion of machine learning solutions in games, e-commerce, Internet media and other industries. Before joining Amazon Cloud Technology, he was engaged in the open source and standardization of data intelligence related technologies, and has rich design and practical experience.

27b8e557639989c59b677061a3edc90b.gif

92d5339b4c29278dc9a33659166cd110.gif

I heard, click the 4 buttons below

You will not encounter bugs!

49a84d74478e1e8514b6421a5f88514c.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/132551082