Langchain-ChatGLM configuration file parameter test

1 Known parameters that may affect the dialogue effect (located in the configs/model_config.py file):

# 文本分句长度 
SENTENCE_SIZE = 100
  # 匹配后单段上下文长度 
CHUNK_SIZE = 250 
# 传入LLM的历史记录长度 
LLM_HISTORY_LEN = 3 
# 知识库检索时返回的匹配内容条数 
VECTOR_SEARCH_TOP_K = 5 
# 知识检索内容相关度 Score, 数值范围约为0-1100,如果为0,则不生效,经测试设置为小于500时,匹配结果更精准 
VECTOR_SEARCH_SCORE_THRESHOLD = 0

Among them, the variables that may have a greater impact on reading the knowledge base are CHUNK_SIZE (the length of a single reference context), VECTOR_SEARCH_TOP_K (the number of reference paragraphs in the knowledge base), and VECTOR_SEARCH_SCORE_THRESHOLD (the minimum correlation required for matching content in the knowledge base). This experiment will ask the models under different parameter configurations and rank the answers of different models to each question. Finally, we experimented with Friedman test and Nemenyi follow-up test to analyze whether the answer rankings of different models are significantly different.

2 Design Questions
The knowledge base of the model uses the book "Introduction to Deep Learning: Theory and Implementation Based on Python" as the knowledge base. Questions to the model cover the following types:
Knowledge type (K - knowledge): general knowledge about deep learning, does not necessarily need to rely on the content of the knowledge base to answer, but the reference answer content
type (C - context) of the knowledge base: about the book Questions about specific content can only be answered by relying on the knowledge base.
Questions can also be divided into the following two categories:
general type (G - general): questions about general conceptual knowledge, or require a summary of part of the content of the book in detail
(S - specific ): Ask questions about deep learning knowledge or technical details of the content in the book

Each question is described using the following two classifications. For example, KG stands for knowledge-general questions, such as "what is deep learning", and CS stands for content-specific questions, such as "what data set was used as a training example for the handwritten digit recognition example in the book?" data set". We designed 5 questions for each category, and a total of 20 questions were asked for the model.

Design Questions
KG:
1 What is deep learning
2 What is neural network
3 What is convolutional neural network
4 A brief introduction to neural network backpropagation
5 What is neural network over-fitting and how to avoid it
CG:
1 book What kinds of neural networks are mainly introduced
2 What kinds of methods for updating neural network parameters are introduced in the book
3 What kinds of commonly used activation functions are introduced in the book
4 What kinds of methods for setting the initial value of neural network weights are introduced in the book
5 What kinds of methods
KS to suppress overfitting are introduced in the book :
1 Why can’t the initial value of neural network weights be all set to 0
2 Why the calculation of weight gradients generally uses back propagation instead of numerical differentiation
3 Why only nonlinear activation functions can be deepen The number of network layers
4 What are the functions of the convolutional layer and the pooling layer of the convolutional neural network
5 Why the training data set and the test data set should be separated
CS:
1 What activation function is used for the output layer of the neural network that solves the classification problem in the book
2 What data set is used as the training data set in the sample program of handwritten digit recognition in the book
3 Why does the sample program of handwritten digit recognition in the book batch process the input data set
4 What are the advantages of batch normalization mentioned in the book
5 What are mentioned in the book Conditions prone to overfitting

For each type of question, we rank according to the following criteria:
K question:
1 Correctness of answer: whether there are knowledge errors in the answer of the model
2 Citation relevance: whether the original text cited by the model is related to the answer
C question:
1 Comprehensiveness of content : Whether the model correctly recounts all the relevant content in the book
2 Original text fit: Whether the model is fabricated and the content not mentioned in the book (whether the fabricated part is correct or not)
3 Citation relevance: Whether the original content cited by the model is related to the answer

3 Experimental steps
1 Modify the relevant parameters in the model configuration file, start the webui.py program of langchain-ChatGLM to open the online questioning interface
2 In the questioning interface, choose to import the "Introduction to Deep Learning: Theory and Implementation Based on Python" pdf file as knowledge base.
3 Pass the 20 questions designed above into the model one by one, and keep the complete answer of the model and save the original text citation
4 Repeat steps 1-3 under different parameter configurations

test group:

1 VECTOR_SEARCH_SCORE_THRESHOLD has an impact on dialogue effects
Serial number CHUNK_SIZE VECTOR_SEARCH_TOP_K VECTOR_SEARCH_SCORE_THRESHOLD
1 250 5 0
2 250 10 0
3 500 5 0
4 250 5 500

answer score
insert image description here

Data analysis:
After using the Friedman test and Nemenyi post test (for specific analysis, see the excel file model dialogue scoring). There is no significant difference in the ability of the four models to answer all types of questions (p-value = 0.8368)

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

For the answers to the four classification questions (knowledge, content, general, and detail), there is still no significant difference in the answers of the four models. It is worth noting that for content-type questions, the four models see a large gap, and model 1 and model 2, model and model 3 have a large inter-group gap. But these differences are not statistically significant (overall p-value=0.233, model 1 and model 2 Q test p-value=0.350, model 2 and model 3 Q test p-value=0.350).

Experimental conclusions and parameter tuning suggestions:
The langchain-ChatGLM model combined with the local knowledge base to answer the answer will be affected by the CHUNK_SIZE, VECTOR_SEARCH_TOP_K, VECTOR_SEARCH_SCORE_THRESHOLD parameters. But these parameter changes did not have a significant effect on the overall accuracy of the model's answers.

It is also worth noting that in the second and third groups of experiments, the higher CHUNK_SIZE and VECTOR_SEARCH_TOP_K make the model reply content significantly longer, which makes the model significantly increase the memory consumption of the server. (In groups 1 and 4, the video memory is generally full when there are about 15 questions, while in groups 2 and 3, only 1 to 2 questions are needed). In practical applications, these two parameters should be properly selected with lower values, or directly use the default initial values ​​of 250 and 5

There may be the following loopholes in this experiment:
1. Only the book "Introduction to Deep Learning: Theory and Implementation Based on Python" was used for testing when selecting knowledge base data. There is no test on a large-scale knowledge base, nor is it tested whether introducing texts of different themes in the same knowledge base will interfere with the model
. Recall is graded, which has a certain degree of subjectivity. In addition, due to the inability to conduct double-blind experiments, my expectations for different models may cause errors in scoring.
3 LLM responses make reference to historical conversations. Since it would be too time-consuming and labor-intensive to restart the model for every question and answer, in this experiment, the current round of dialogue will only be terminated when the model’s memory is full, which makes the historical problems of the model may have an impact on the model’s answers.

Live: The complete dialogue is too long (almost 100,000 words) to be displayed in the article

Guess you like

Origin blog.csdn.net/Raine_Yang/article/details/131707718