[2023 Huazhong Cup Mathematical Modeling] Detailed Modeling Scheme and Implementation Code for Question B of Elementary School Mathematical Application Questions Similarity Measurement and Difficulty Assessment

Please add a picture description

Update time: 2023-5-1 14:00

1 topic

Similarity Measurement and Difficulty Evaluation of Question B in Elementary School Mathematics Problems

A MOOC online education platform hopes to carry out personalized teaching and realize users' independent learning. When the user is studying, the system randomly selects a number of in-class test questions that are synchronized with the sample questions from the question bank, records and analyzes the students' learning and answer information, and automatically generates homework questions (or practice questions) after class. In addition, the system can regularly review the content involved in students' error-prone questions, and automatically recommend other questions with similar question types and levels of difficulty for users to carry out extended exercises. In order to realize such a function, how to measure the similarity between the questions and how to evaluate the difficulty of the questions are the key problems to be solved by this product. Taking elementary school mathematics application question 1 as an example, there are two main basis for measuring the similarity between questions:

  1. Title stem text. This method can generally only find topics similar to the text of the topic as similar topics. However, some questions have similar stem texts, but different key words, and the meanings of the questions are quite different; some questions may have irrelevant backgrounds, and the stem texts are almost different, but the ideas and methods of solving the questions are exactly the same . Therefore, this method has limited effect.
  2. Mark the knowledge points and other information of the topic in advance. The recommendation effect of this method depends on the division method and granularity of knowledge points. If the division of knowledge points is too coarse, the recommendation result may be too different from the sample questions or the user's error-prone questions; if the division of knowledge points is too fine, the recommendation result may be too single. In both cases, the purpose of expanding exercises cannot really be achieved.

There are two common ways to assess the difficulty of questions:

  1. Determined by the type of exam. For example, the test questions of a math competition are generally more difficult than the questions of a certain elementary school final exam.

  2. Teachers make subjective judgments based on experience.

The above methods of judging topic similarity and assessing difficulty have obvious limitations. The company hired your team to try to solve these problems. Taking elementary school mathematics application problems as an example, the specific tasks are as follows:

  1. Design a measure of the similarity between two primary school math word problems.

  2. Establish a mathematical model for assessing the difficulty of elementary school mathematics word problems.

  3. Attachment 1 is a sample question bank containing 100 application questions. Please classify the topics in Attachment 1 by similarity or difficulty (it is not limited that a topic can only belong to one category). If there is no similar topic for a certain topic, it can be classified into a separate category. Evaluate the complexity of the algorithm and whether it can be applied to a larger question bank.

  4. Attachment 2 contains 10 questions, please use the above model or method to analyze the difficulty of these questions, and for each of these questions, find the most similar one or several questions in Appendix 1 (if there is no similar question, write "no )". Evaluate the complexity of the algorithm and whether it can be applied to a larger question bank.

Note 1: The primary school mathematics application questions mentioned in the title refer to the problems that focus on the four arithmetic operations and have a certain practical background.
Note 2: There is also a common practice in teaching to determine the difficulty of the topic, that is, to define the difficulty of the topic according to the actual scoring rate of the topic. However, the actual score rate of the questions is not only related to the students' learning situation before the test, but also related to many "non-technical" factors, such as the words, sentence patterns, voices used in the questions, and even the order in which the questions appear in the test paper, etc. etc.; the actual scoring rate can only be obtained by collecting real test paper information, and the workload is heavy. Therefore, what this question is concerned with is the "technical" difficulty of the question, regardless of the actual scoring rate. Attachment description:

  1. Attachment 1 is a CSV format file with no header row, 2 columns and 100 rows in total. The first column is the topic number, in the form of "P001", "P002" and so on. The second column is the title content.

  2. Attachment 2 is a file in CSV format, with no header row, 2 columns and 10 rows in total. The first column is the title number, in the form of "Q001", "Q002" and so on. The second column is the title content.

Attachment 2.csv part content,

P001 Distribute a batch of candy to the children in the kindergarten class. If each person distributes 3 candies, there will be 21 remaining; if each person distributes 4 candies, there will be 6 remaining. How many children are there in the kindergarten class? How many candies are there in this batch?
P002 The two sisters go to school from home, the elder sister walks 50 meters per minute, and the younger sister walks 45 meters per minute. If the younger sister walks 5 minutes earlier than the older sister, the two sisters can arrive at school at the same time. Q: How far is it from home to school?
P003 The iron and steel plant uses two transport vehicles to transport ore back from the mine 90 kilometers away from the plant. There are two transport vehicles A and B. Vehicle A starts from the mine and vehicle B departs from the iron and steel plant at the same time, and they go towards each other at speeds of 40 kilometers and 50 kilometers per hour. They return immediately after arriving at the destination, and so on. repeatedly. If the loading and unloading time is not counted, and the two vehicles do not make any stops, how many kilometers away are the two vehicles from the mine when they meet for the third time?

Attachment 2.csv part content,

Q001 A train of passenger cars is 150 meters long and travels 30 meters per second; a train of freight cars is 200 meters long and travels 20 meters per second. The two cars are heading towards each other. When the wrong vehicle passes by, how often can the bus driver see the truck passing? How often can a truck driver see a passenger car pass by?
Q002 A group of passengers decides to take several buses so that each vehicle can take the same number of people. At first, there were 22 people in each car, and it was found that one person could not get on the car; if an empty car was driven away, then all the passengers just shared the rest of the car equally. It is known that the capacity of each car is not more than 32 people, how many cars are there? How many of these travelers are there?

2 Mathematical model

2.1 Question 1

According to the topic description, we need to design an algorithm to measure the similarity between the topics and evaluate the difficulty of the topics.

First, we can convert each question into a vector form, which contains the key information of each question (such as numbers, keywords, etc. in the question), and then we can use it . Please download the full document to measure the similarity between two vectors. Specifically, let the topic aaa andbbThe vectors corresponding to b are a = ( a 1 , a 2 , … , an ) \boldsymbol{a}=(a_{1},a_{2},\ldots,a_{n})a=(a1,a2,,an) b = ( b 1 , b 2 , … , b n ) \boldsymbol{b}=(b_{1},b_{2},\ldots,b_{n}) b=(b1,b2,,bn) , then the similarity between the two topics can be expressed as:

. . . . slightly, please download the full document

where a ⋅ b \boldsymbol{a}\cdot\boldsymbol{b}ab represents the vectora \boldsymbol{a}a andb \boldsymbol{b}The inner product of b , ∥ a ∥ \|\boldsymbol{a}\|a and∥ b ∥ \|\boldsymbol{b}\|b respectively represent the vectora \boldsymbol{a}a andb \boldsymbol{b}The norm of b .

However, directly using all the key information in the title as a vector may lead to errors in similarity calculation. Therefore, we need to screen and weight the key information of the topic to improve the accuracy of similarity measurement. The specific implementation steps are as follows

。。。略

2.2 Question 2

The questions will be divided into three categories using the KMeans clustering algorithm, difficult, medium, and easy. First cluster, and then analyze the data of each category to analyze which category belongs to which difficulty.
The TF-IDF model is used to represent each topic description as a vector, and each element in the vector represents the weight of the word in the topic description to represent the similarity of different descriptions. The K-Means clustering model is used to cluster topic description vectors and group similar topics into the same category. The number of clustering models can be set to the number of categories that need to be classified. Each topic will be divided into different categories.
Therefore, the above Python code can be expressed as a mathematical model:
Let the topic set be QQQ , where the number of items isNNN. _
TF-IDF model:
define word frequency matrixX ∈ RN × M \mathbf{X} \in \mathbb{R}^{N \times M}XRN × M , of whichiirow i , jjElement xi of column j , j x_{i,j}xi,jIndicates the topic iiword jjin ij 's word frequency.
Define the inverted document frequency (IDF) vectoridf ∈ RM \mathbf{idf} \in \mathbb{R}^{M}idfRM , where thejjthj elementsidfj = log ⁡ N dfj idf_j = \log \frac{N}{df_j}idfj=logdfjN, where dfj df_jdfjmeans the word jjThe number of times j appears in the total number of items.
Define TF-IDF matrixW ∈ RN × M \mathbf{W} \in \mathbb{R}^{N \times M}WRN × M , of whichiirow i , jjThe element wi of column j , j w_{i,j}wi,jRepresents the topic ii calculated by the TF-IDF modelword jjin iThe weight of j , namely wi , j = xi , j × idfj w_{i,j} = x_{i,j} \times idf_jwi,j=xi,j×idfj.
K-Means clustering model:
define the clustering result vector c ∈ RN \mathbf{c} \in \mathbb{R}^{N}cRN , of whichiii elementsci c_iciIndicates the topic iiThe category number that i belongs to, the range is[ 1 , k ] [1, k][1,k ] , wherekkk represents the number of categories in the cluster.
Define the cluster centroid vectorμ ∈ R k × M \mathbf{\mu} \in \mathbb{R}^{k \times M}mRk × M , of whichiii lineμ i \mu_imiRepresents the cluster center vector, that is, belongs to the iiThe average value of the title description vectors for class i .
Define the sample distance metricdist ( xi , μ j ) \mathrm{dist}(x_i,\mu_j)dist(xi,mj) , of whichxi x_ixiIndicates the iii topic description vectors,μ j \mu_jmjIndicates the jjthj cluster center vectors. Euclidean distance or Cosine similarity can be used as the sample distance measure.
Define the loss functionJ ( { μ j } 1 k , c , W ) \mathrm{J}(\{\mathbf{\mu}_j\}_1^k,\mathbf{c},\mathbf{W})J({ μj}1k,c,W ) , used to measure the accuracy of clustering, can use the sum of squared errors (SSE) or other clustering indicators.
The goal of the clustering model is to minimize the loss functionJ \mathrm{J}J , and get the optimal clustering resultc \mathbf{c}c and cluster centroid vectorμ \mathbf{\mu}m .

The evaluation indicators of Kmeans clustering effect generally include Silhouette Coefficient, Calinski-Harabasz Index and Davies-Bouldin Index, etc. Among them, the silhouette coefficient is the most commonly used evaluation index, and the calculation formula is:

s ( i ) = b ( i ) − a ( i ) m a x { a ( i ) , b ( i ) } s(i)=\frac{b(i)-a(i)}{max\{a(i),b(i)\}} s(i)=max{ a(i),b(i)}b(i)a(i)

Among them, a ( i ) a(i)a ( i ) display numberiiThe average distance between i samples and other samples in the same cluster,b ( i ) b(i)b ( i ) display numberiiThe average distance between i samples and all samples in the nearest cluster.

Silhouette factor ssThe value range of s is[ − 1 , 1 ] [-1,1][1,1 ] , the larger the value, the better the clustering effect. If the ssof a sampleThe s value is negative, indicating that the sample should be divided into other clusters.

2.3 Question 3

Use the clustering algorithm of question 2, or other clustering algorithms. There are several clustering algorithms that do not specify the number of clusters:

  1. DBSCAN(Density-Based Spatial Clustering of Applications with Noise)
  2. OPTICS(Ordering Points To Identify the Clustering Structure)
  3. HDBSCAN(Hierarchical Density-Based Spatial Clustering of Applications with Noise)

Among them, DBSCAN is more commonly used, and its main idea is to define clusters according to local density, which is insensitive to noise data points.

Note that for the Kmeans algorithm, the discrete points have a great influence on the algorithm. You can first remove the discrete points and make a separate category, or use the DBSCAN algorithm.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from pyecharts import Scatter

# 读取csv文件
df = pd.read_csv('附件1.csv',header=None)
# 将题目描述作为特征
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df[1])
# 聚类模型训练
kmeans = KMeans(n_clusters=3, random_state=0).fit(features)

# 评价聚类效果
score = silhouette_score(features, kmeans.labels_)
print('聚类效果评价指标(Silhouette Score):', score)

# 可视化聚类效果
。。。。略
# 打印聚类结果
for i in range(len(kmeans.labels_)):
    print('题目ID:{}, 题目描述:{}'.format(df[0][i], df[1][i]))
    print('题目类别:{}'.format(kmeans.labels_[i]))
    print('---------------------------')

insert image description here

It can be seen that the clustering effect is not particularly ideal, and further improvement is needed.

insert image description here

2.4 Question 4

Use the similarity calculation method of question 1. Traverse Annex 1 and Annex 2, and calculate the similarity of the two questions in turn.

import pandas as pd
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 读取csv文件
data = pd.read_csv('附件1.csv')
questions = data['题目']

# 定义一个处理题目的函数,该函数将题目转换为关键词列表,并对不同关键词进行加权,最终返回一个向量:

def process_question(question):
   。。。略
    return ' '.join(key_words)  # 返回空格分隔的关键字字符串

# 将所有题目转换为关键词向量:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([process_question(question) for question in questions])
# 最后,计算任意两个题目之间的相似度并输出结果:
for i in range(len(questions)):
    for j in range(i+1, len(questions)):
        similarity = cosine_similarity(vectors[i], vectors[j])[0][0]
        print(f"题目{
      
      i+1} 和 题目{
      
      j+1} 的相似度为: {
      
      similarity:.4f}")

insert image description here

3 full download

Private message me, or open the browser

betterbench.top/#/68/detail

Guess you like

Origin blog.csdn.net/weixin_43935696/article/details/130455693