2023 Certification Cup Mathematical Modeling Phase 2 Complete Original Paper Explanation of Question C

Hello everyone, from the release of the competition questions yesterday until now, I have finally completed a complete finished paper on the second stage of the Certification Cup C.

This paper can guarantee originality and high quality. It is by no means a random reference to a lot of models and codes that are copied and pasted into garbage semi-finished papers that have no application to fool people.

C The second stage of the complete paper is 64 pages in total, 7 pages for some revision instructions, 47 pages for the main text, and 10 pages for the appendix.
The biggest difference between the second stage of the certification cup question C and the first stage is that the question gives a specific category, so that machine learning can be used for classification prediction. First of all, of course, the feature is extracted by merging data. After that, I first established the original classification model. It is to directly predict the 6 categories, and the accuracy is low, but it can give a calculation table for misjudgments and false alarms, and can also pave the way for the future. After that, two classification prediction models were made for the most critical situation and other situations, and the accuracy of the test set was greatly improved. Here I will have a page of award-winning explanations in the paper, and those who get the paper should read it carefully. For the second question, you need to refine the classification or cluster analysis. The most difficult part of the second question is the analysis of the medical mechanism of various abnormal sub-categories given in combination with the selection of cluster analysis indicators. Here I spent 3 hours on the analysis The feature mechanism of each sub-category is then combined with brainless clustering. So far, the first question of a segment can judge the classification, and the second question can distinguish and refine the classification, and the solution is completed.

I really have limited energy, and I don’t have the strength to type too many words to explain in text version. Maybe I didn’t explain it in detail enough. You can watch my video explanation :

2023 Certification Cup Mathematical Modeling Phase 2 Complete Hands-on Nanny-Level Teaching of Question C! _哔哩哔哩_bilibili

This article is very long, everyone can't finish it in one breath, don't forget to like and bookmark it to prevent getting lost.

ok here is mine

Table of contents:

Summary:

Questions in the second stage: ECG data were carefully interpreted by experts and divided into several arrhythmia categories. The data files have been renamed according to the interpretation results. It is divided into 6 levels according to the degree of risk, and the degree of risk is descending in descending order:
1. Life-threatening arrhythmia that requires immediate rescue: ventricular flutter (VFL); ventricular fibrillation (VF).
2. Life-threatening arrhythmia: Torsades de Pointes (VTTdP).
3. Life-threatening ventricular arrhythmias: high-frequency ventricular tachycardia (VTHR).
4. Potentially dangerous ventricular arrhythmias: low-frequency ventricular tachycardia (VTLR); ventricular premature bigeminy (B); high-grade ventricular ectopy (HGEA); ventricular escape rhythm (VER).
5. Supraventricular arrhythmias: atrial fibrillation (AFIB); supraventricular tachycardia (SVTA); sinus bradycardia (SBR); first-order conduction block (BI); nodal rhythm (NOD).
6. No significant risk and normal sinus rhythm: sinus rhythm with bundle branch block (BBB); sinus rhythm with a single extrasystole (Ne); normal sinus rhythm (N).

Question 1:

We hope to construct a classification algorithm for the degree of danger and deploy it on the electrocardiogram machine, which can issue an alarm when encountering a life-threatening situation. The judgment time for critical situations cannot exceed 2 seconds, so only the data in the data set can be used for verification (each data file in the data set is processed from ECG segments with a duration of 2 seconds), and if it is necessary to extract data The time domain feature does not allow operations such as period extension on the data in the time domain. Of course, the judgment time can be shorter than 2 seconds. The judgment of the threat level is reduced by 1 level (for example, a level 3 is misjudged as a level 4) is called a minor misjudgment, and a reduction of 2 levels or above (such as a level 3 is misjudged as a level 5) is called a serious misjudgment, and an increase of 1 A level is called a minor false alarm, and an increase of 2 levels or above is called a severe false alarm. We at least require that, for the most critical situations (levels 1 and 2), neither the sensitivity nor the specificity of the judgment is lower than the maximum value that can be calculated by this data set (that is, other situations cannot be calculated using this data set). difference from 1). For situations with other risk levels, you and your team are asked to put forward reasonable requirements for the performance of the algorithm based on medical practice (you can use the concepts given above). And establish an effective mathematical model, and construct a classification algorithm to meet this requirement.

First merge the data:

Each category should be merged, for example the first category:

Extract features:

After extraction, the original classification model of 6 categories is first established:

Actual classification predictions:

Prediction Accuracy:

Predict classification results:

The calculation can get the misjudgment and false alarm situation table:

The most critical situation to be judged in practice:

forecast result:

The training set reaches 1, which is already the maximum sensitivity (sensitivity) and specificity (specificity) that this data set can calculate.

The test set shows a large improvement over the original classification model.

In the thesis, I wrote a description of the winning points:

Take a look at the classification prediction results table:

There are a total of 305 test sets:

For situations of other hazard levels:

The same steps, first divide the categories:

Then machine learning classification is enough, let's take a look at the prediction results:

The accuracy is also greatly improved compared to the original classification model. The first question is over.

Question 2:

We hope to further distinguish different types of arrhythmias on the basis of question 1. The final discriminant algorithm should meet the requirements in question 1 and carry out specific classification in each hazard level. Please establish a reasonable mathematical model with your team to construct a classification algorithm for different types of arrhythmias, and make a detailed review of the performance of the algorithm.

The most difficult part of this question is to determine what features are used to divide these subcategories:

For these subcategories, how should the features be sorted.

For example, if you use the mean value of each segment to divide, then you must first analyze from the perspective of medical mechanism, whether there is any difference in the mean value of each sub-category, and if so, what is the difference between each sub-category? Whose segment data has a larger mean?

Therefore, the difficulty lies in the need to collect and read a large number of explanations on the medical mechanism analysis of each sub-category. Here I spent more than 3 hours studying the electrocardiogram of each sub-category and doing the medical mechanism analysis of each category. :

Then there is the cluster analysis:

So far, after importing any piece of data, we can not only obtain the classification and discrimination of the most critical situation and other situations through the machine learning algorithm of the first question, but also use the cluster analysis model we have established after the discrimination. Identify the specific sub-categories it belongs to.

OK to end.

Attach some template code:

Note that it is not what I actually use for solving:

function  [tree,discrete_dim] = train_C4_5(S, inc_node, Nu, discrete_dim)  
      
    % Classify using Quinlan's C4.5 algorithm  
    % Inputs:  
    %   training_patterns   - Train patterns 训练样本  每一列代表一个样本 每一行代表一个特征
    %   training_targets    - Train targets  1×训练样本个数 每个训练样本对应的判别值
    %   test_patterns       - Test  patterns 测试样本,每一列代表一个样本  
    %   inc_node            - Percentage of incorrectly assigned samples at a node  一个节点上未正确分配的样本的百分比
    %   inc_node为防止过拟合,表示样本数小于一定阈值结束递归,可设置为5-10
    %   注意inc_node设置太大的话会导致分类准确率下降,太小的话可能会导致过拟合  
    %  Nu is to determine whether the variable is discrete or continuous (the value is always set to 10)  
    %  Nu用于确定变量是离散还是连续(该值始终设置为10)
    %  这里用10作为一个阈值,如果某个特征的无重复的特征值的数目比这个阈值还小,就认为这个特征是离散的
    % Outputs  
    %   test_targets        - Predicted targets 1×测试样本个数 得到每个测试样本对应的判别值
    %   也就是输出所有测试样本最终的判别情况
      
    %NOTE: In this implementation it is assumed that a pattern vector with fewer than 10 unique values (the parameter Nu)  
    %is discrete, and will be treated as such. Other vectors will be treated as continuous  
    % 在该实现中,假设具有少于10个无重复值的特征向量(参数Nu)是离散的。 其他向量将被视为连续的
    train_patterns = S(:,1:end-1)';      
    train_targets = S(:,end)';   
    [Ni, M]     = size(train_patterns); %M是训练样本数,Ni是训练样本维数,即是特征数目
    inc_node    = inc_node*M/100;  % 5*训练样本数目/100
    if isempty(discrete_dim)  
        %Find which of the input patterns are discrete, and discretisize the corresponding dimension on the test patterns  
        %查找哪些输入模式(特征)是离散的,并离散测试模式上的相应维
        discrete_dim = zeros(1,Ni); %用于记录每一个特征是否是离散特征,初始化都记为0,代表都是连续特征,
        %如果后面更改,则意味着是离散特征,这个值会更改为这个离散特征的无重复特征值的数目 
        for i = 1:Ni  %遍历每个特征
            Ub = unique(train_patterns(i,:));  %取每个特征的不重复的特征值构成的向量 
            Nb = length(Ub);    %得到无重复的特征值的数目
            if (Nb <= Nu)  %如果这个特征的无重复的特征值的数目比这个阈值还小,就认为这个特征是离散的  
                %This is a discrete pattern  
                discrete_dim(i) = Nb; %得到训练样本中,这个特征的无重复的特征值的数目 存放在discrete_dim(i)中,i表示第i个特征
    %             dist            = abs(ones(Nb ,1)*test_patterns(i,:) - Ub'*ones(1, size(test_patterns,2))); 
    %             %前面是把测试样本中,这个特征的那一行复制成Nb行,Nb是训练样本的这个特征中,无重复的特征值的数目
    %             %后面是把这几个无重复的特征值构成的向量复制成测试样本个数列
    %             %求这两个矩阵相应位置差的绝对值
    %             [m, in]         = min(dist);  %找到每一列绝对差的最小值,构成m(1×样本数目)   并找到每一列绝对差最小值所在行的位置,构成in(1×样本数目)
    %             %其实,这个in的中每个值就是代表了每个测试样本的特征值等于无重复的特征值中的哪一个或者更接近于哪一个
    %             %如=3,就是指这个特征值等于无重复的特征值向量中的第3个或者更接近于无重复的特征值向量中的第3个
    %             test_patterns(i,:)  = Ub(in);  %得到这个离散特征
            end  
        end  
    end

All the data tables and solution results I used above are as follows:

OK, let’s stop here. I’m too tired. The explanation may not be detailed enough. For the detailed explanation video, as well as all the above data tables and complete finished papers, please click on my personal card below to view it↓:

Guess you like

Origin blog.csdn.net/smppbzyc/article/details/130765497