2022 National Mathematical Modeling-C Question Review Composition Analysis and Identification of Ancient Glass Products

Last month, I participated in the National Mathematical Modeling Contest and did problem C. I was lucky enough to win the provincial one and send it to the national award. I sorted out the program and reviewed it again. Here I will briefly show the small parts of the first, second and third questions . Problem- solving ideas and processes, as well as python programs.

Read and display data:

# 读取数据
data1 = pd.read_excel('附件.xlsx',sheet_name='表单1')
data2 = pd.read_excel('附件.xlsx',sheet_name='表单2')
data3 = pd.read_excel('附件.xlsx',sheet_name='表单3')

data1: 

 data2:

data3:

Preprocessing:

1) Fill in the missing values ​​of color in form 1 and the missing values ​​of chemical composition in form 2

Group the three features except color: type, decoration, and surface weathering, and extract the mode of the color to fill the missing color of the specific sample point.

def get_yanse(x):
    return x.value_counts().index[0]
display(data1.groupby(['类型','表面风化','纹饰'])['颜色'].agg(get_yanse,))
"""
类型  表面风化  纹饰
铅钡  无风化   A     浅蓝
          C      紫
    风化    A     浅蓝
          C     浅蓝
高钾  无风化   A     蓝绿
          C     浅蓝
    风化    B     蓝绿
Name: 颜色, dtype: object
"""

data1[data1['颜色'].isna()]
""""""
文物编号	纹饰	类型	颜色	表面风化
18	19	A	铅钡	NaN	风化
39	40	C	铅钡	NaN	风化
47	48	A	铅钡	NaN	风化
57	58	C	铅钡	NaN	风化
""""""

# 发现缺失数据的对应众数颜色均为浅蓝,使用浅蓝进行填充
data1.fillna('浅蓝',inplace=True)

 According to the meaning of the question, fill the missing value of the chemical composition in Form 2 with 0

data2.fillna(0,inplace=True)

2) Extract the cultural relic number information in the characteristics of the cultural relic sampling points in the form 2 data, and link the form 1 through the cultural relic number information 

# 提取文物采样点中的文物编号信息
def get_number(x):
    number = re.findall('\d*',x)[0]
    number = int(number[1]) if number[0]=='0' else int(number)
    return number
data2['文物编号'] = data2['文物采样点'].apply(get_number)

# 通过文物编号关联表单1、表单2
data_merge = pd.merge(data1,data2,on = '文物编号')

3) Remove chemical components and invalid data outside 85%~105%

View invalid data:

data_merge[~((zong_chengfen >= 85) & (zong_chengfen <= 105))]

 

# 取出有效数据
data_merge = data_merge[(zong_chengfen >= 85) & (zong_chengfen <= 105)]
# data_merge.to_excel('问题一连接两表并处理后数据.xlsx')

 4) For the combined data, obtain the weathering type of the sampling point according to the information description in the title attachment, the number of the cultural relic, and the information of the sampling point of the cultural relic.

# 获取采样点风化类型信息
# 第一步
def get_fh(x):
    list_ = list(filter(lambda x:len(x)>0, re.findall('[^\d*]*',x)))
    return list_[0] if list_ else np.nan
data_merge['采样点风化类型'] = data_merge['文物采样点'].apply(get_fh)
data_merge['采样点风化类型'].value_counts()
"""
部位       10
未风化点     10
严重风化点     3
Name: 采样点风化类型, dtype: int64
"""

# 第二步
data_merge.replace({'部位':np.nan},inplace=True)
data_merge['采样点风化类型'] = data_merge['采样点风化类型'].fillna(data_merge['表面风化'])
data_merge['采样点风化类型'].value_counts()
"""
风化       29
无风化      25
未风化点     10
严重风化点     3
Name: 采样点风化类型, dtype: int64
"""

# 第三步
data_merge['采样点风化类型'] = data_merge['采样点风化类型'].replace({'无风化':'未风化点', '风化':'风化点'})
data_merge['采样点风化类型'].value_counts()
"""
未风化点     35
风化点      29
严重风化点     3
Name: 采样点风化类型, dtype: int64
"""

exhibit:

with sns.color_palette('rainbow'):
    # 作者封装绘图程序,需要源程序可在作者博客'seaborn'封装中寻找
    count_pieplot(data_merge,1,2,vars = ['采样点风化类型'],hue = '类型',show_value=True)

Question one:

1. Combined with the type of glass, analyze the statistical laws of whether there are weathered chemical components on the surface of cultural relic samples

        For high-potassium/lead-barium glass, statistically analyze the difference in chemical composition between unweathered and weathered samples to explore the statistical law of weathered chemical composition. Use line graphs and box plots for analysis. Box plots can not only show sample points The distribution range of the chemical composition, and at the same time, the mean value, minimum and maximum value can be analyzed concisely.

Line graph analysis: Take out the mean value of each chemical composition with or without weathering, draw a line graph, and observe the difference of chemical composition with or without weathering.

# 提取化学名称
def get_huaxue(x):
    return re.findall('\((.*)\)',x)[0]

plt.figure(figsize=(10, 8))
plt.subplot(211)
plt.plot(range(14),data_merge.query("采样点风化类型 == '风化点' & 类型 == '高钾'").iloc[:,6:-1].mean(),label = '风化')
plt.plot(range(14),data_merge.query("采样点风化类型 == '未风化点' & 类型 == '高钾'").iloc[:,6:-1].mean(),label = '未风化')
plt.legend()
plt.xticks(range(14),list(map(get_huaxue, data_merge.iloc[:,6:-1].columns)),fontsize = 10)
plt.title('高钾风化与未风化的各成分均值统计对比图',fontsize = 15)
# plt.savefig('高钾风化与未风化的各成分均值统计对比图')

plt.subplot(212)
plt.plot(range(14),data_merge.query("采样点风化类型 == '风化点' & 类型 == '铅钡'").iloc[:,6:-1].mean(),label = '风化')
plt.plot(range(14),data_merge.query("采样点风化类型 == '严重风化点' & 类型 == '铅钡'").iloc[:,6:-1].mean(),label = '严重风化')
plt.plot(range(14),data_merge.query("采样点风化类型 == '未风化点' & 类型 == '铅钡'").iloc[:,6:-1].mean(),label = '未风化')

plt.legend()
plt.xticks(range(14),list(map(get_huaxue, data_merge.iloc[:,6:-1].columns)),fontsize = 10)
plt.title('铅钡风化与未风化的各成分均值统计对比图',fontsize = 15)
# plt.savefig('风化与未风化的各成分均值统计对比图')
plt.subplots_adjust(hspace=0.35)
plt.savefig('风化与未风化的各成分均值统计对比图')

Boxplot Analysis: Take Out Separately

def get_colors(color_style):
    cnames = sns.xkcd_rgb
    if color_style =='light':
        colors = list(filter(lambda x:x[:5]=='light',cnames.keys()))
    elif color_style =='dark':
        colors = list(filter(lambda x:x[:4]=='dark',cnames.keys()))
    elif color_style =='all':
        colors = cnames.keys()
    colors = list(map(lambda x:cnames[x], colors))
    return colors

# 封装箱线图
def boxplot(data, rows = 3, cols = 4, figsize = (13, 8), vars  =None, hue = None, width = 0.25,
            order = None, color_style ='light',subplots_adjust = (0.2, 0.2)):
    
    fig = plt.figure(figsize = figsize)
    hue = data[hue] if isinstance(hue,str) and hue in data.columns else hue
    data = data if not vars else data[vars]
    
    colors = get_colors(color_style)
    ax_num = 1
    for col in data.columns:
        if isinstance(data[col].values[0],(np.int64,np.int32,np.int16,np.int8,np.float16,np.float32,np.float64)):
            plt.subplot(rows, cols, ax_num)
            sns.boxplot(x = hue,y = data[col].values,color=random.sample(colors,1)[0],width= width,order = order)
            plt.xlabel(col)
#             data[col].plot(kind = 'box',color=random.sample(colors,1)[0])
            ax_num+=1
    
    plt.subplots_adjust(hspace = subplots_adjust[0],wspace=subplots_adjust[1])
# data_merge
data = pd.read_excel('问题一连接两表并处理后数据.xlsx').iloc[:,1:]

boxplot(data.query("类型 == '高钾'"), 4, 4, hue = '采样点风化类型',
        vars = data.columns[1:].tolist(), figsize=(11, 8), subplots_adjust=(0.52,0.25))
plt.savefig('高钾有无风化化学成分箱线图分析.jpg')

High potassium:

boxplot(data.query("类型 == '铅钡'"), 4, 4, hue = '采样点风化类型', vars = data.columns[1:].tolist(),
        figsize=(13, 8), subplots_adjust=(0.55,0.25), order = ['未风化点','风化点','严重风化点'])
plt.savefig('铅钡有无风化化学成分箱线图分析.jpg')

 Lead barium: 

Combining the descriptive statistics of chemical elements and box plots and line graphs, we can roughly observe the statistical laws of weathered chemical components:

  1. The silica content of the high-potassium glass is relatively high, and the proportion of potassium oxide and aluminum oxide decreases before and after weathering, while the silica content increases, and the proportion of other chemical components does not change much.
  2. The proportion of silicon dioxide in lead-barium glass gradually decreases with the degree of weathering and the degree of decrease is relatively large. The proportion of lead oxide, barium oxide and phosphorus pentoxide increases before and after weathering. Compared with the weathering data, the proportion of serious The proportion of weathered sulfur dioxide increased, while the content of weathered and unweathered sulfur dioxide was not much different, and the proportions of other chemical components did not change much.

2. According to the detection data of the weathering point, predict the chemical composition content before weathering.

     Most people will get stuck here. I basically had an idea for the second question on the first night, and started to cluster the second question, but I got stuck on this small question until the early morning. Teammates I didn’t have a good idea either. In fact, I knew at the beginning that machine learning methods could not be used for modeling, because it is obvious that there are no sample matching points before and after weathering, and they are all data belonging to different cultural relics numbers , so machine learning cannot be used. I am considering It took a little time on other methods, and then I thought of the simplest statistical average ratio of the chemical composition of the unweathered sample and the weathered sample to get a proportional coefficient, and then use the weathered chemical composition to multiply this ratio when predicting to get Chemical composition content before weathering, which is also the 'average' method mentioned on the answer.

        However, this method of directly counting the average value is not suitable, because there are still some large differences in the same chemical composition content of the same type of glass with the same degree of weathering. There is a large amount of individual sample points, which leads to a large average value after taking the mean value, which is not in line with most other data.

For this phenomenon, we use the normal distribution function to weight         the sample points to obtain the chemical composition change ratio before and after weathering.

from scipy import stats
def zhengtai(x):
    return stats.norm.pdf(x)

# 封装函数,传入高钾或铅钡数据data以及风化采样点类型
def get_mean(data, label):
    d_ = data[data['采样点风化类型'] == label].iloc[:,6:-1].copy()
    # 标准化
    d_scaler = (d_ - d_.mean(axis = 0))/(d_.std(axis = 0))
    # 加权平均
    mean_ = (zhengtai(d_scaler.T).T * d_).sum(axis = 0)/zhengtai(d_scaler.T).T.sum(axis = 0)
    mean_ = mean_.fillna(0)
    # 返回均值与加权均值
    return d_.mean(), mean_

 Get the mean and weighted mean of the high potassium data:

a1, a2 = get_mean(data_merge.query("类型 == '高钾'"), '未风化点')
b1, b2 = get_mean(data_merge.query("类型 == '高钾'"), '风化点')

gj_mean = pd.concat((a1,a2,b1,b2),axis = 1)
gj_mean.columns = ['未风化点均值','未风化点加权均值','风化点均值','风化点加权均值']
gj_mean

 

To visualize:

d_ = data_merge.query("类型 == '高钾'")
fig, axes = plt.subplots(4,4,figsize = (16,16))
axes = axes.ravel()
for num,fea in enumerate(data_merge.columns[6:-1]):
    sns.stripplot(x = '采样点风化类型',palette = sns.color_palette('deep',3),
                  y = fea,
                  data =d_,ax = axes[num],
                  jitter = True) # ‘抖
    wfh_mean,wfh_mean_jiaquan = gj_mean[['未风化点均值','未风化点加权均值']].loc[fea,:]
    fh_mean, fh_mean_jiaquan = gj_mean[['风化点均值','风化点加权均值']].loc[fea,:]

    color = ['b','r']
    axes[num].plot([-0.5,0.5],[wfh_mean]*2, c = color[0], label = '未风化均值')
    axes[num].plot([-0.5,0.5],[wfh_mean_jiaquan]*2, '--', c = color[0], label = '未风化加权均值')
    
    axes[num].plot([0.5, 1.5], [fh_mean]*2, c = color[1], label = '风化均值')
    axes[num].plot([0.5, 1.5],[fh_mean_jiaquan]*2, '--', c= color[1], label = '风化加权均值')
    if num==1:
        axes[num].legend()
axes[-1].axis('off')
axes[-2].axis('off')
plt.subplots_adjust(0.2,0.2)
plt.show()

 Get the mean and weighted mean of lead and barium data:

a1, a2 = get_mean(data_merge.query("类型 == '铅钡'"), '未风化点')
b1, b2 = get_mean(data_merge.query("类型 == '铅钡'"), '风化点')
c1, c2 = get_mean(data_merge.query("类型 == '铅钡'"), '严重风化点')
qb_mean = pd.concat((a1,a2,b1,b2,c1,c2),axis = 1)
qb_mean.columns = ['未风化点均值','未风化点加权均值','风化点均值','风化点加权均值','严重风化点均值','严重风化点加权均值']
qb_mean

 

Visualization: 

d_ = data_merge.query("类型 == '铅钡'")
fig, axes = plt.subplots(4,4,figsize = (16,16))
axes = axes.ravel()
for num,fea in enumerate(data_merge.columns[6:-1]):
    sns.stripplot(x = '采样点风化类型',palette = sns.color_palette('deep',3),
                  y = fea,
                  data =d_,ax = axes[num], 
                  order = ['未风化点','风化点','严重风化点'],
                  jitter = True) # ‘抖
    
    wfh_mean,wfh_mean_jiaquan = qb_mean[['未风化点均值','未风化点加权均值']].loc[fea,:]
    fh_mean, fh_mean_jiaquan = qb_mean[['风化点均值','风化点加权均值']].loc[fea,:]
    yz_fh_mean, yz_fh_mean_jiaquan = qb_mean[['严重风化点均值','严重风化点加权均值']].loc[fea,:]
    
    color = ['b', 'orange', 'g']
    axes[num].plot([-0.5, 0.5],[wfh_mean]*2, c = color[0], label = '未风化均值')
    axes[num].plot([-0.5, 0.5],[wfh_mean_jiaquan]*2, '--', c = color[0], label = '未风化加权均值')
    
    axes[num].plot([0.5, 1.5], [fh_mean]*2, c = color[1], label = '风化均值')
    axes[num].plot([0.5, 1.5],[fh_mean_jiaquan]*2, '--', c= color[1], label = '风化加权均值')
    
    axes[num].plot([1.5, 2.5],[yz_fh_mean]*2, c = color[2], label = '严重风化均值')
    axes[num].plot([1.5, 2.5],[yz_fh_mean_jiaquan]*2, '--',c = color[2], label = '严重风化加权均值')
    

    axes[num].set_xlabel(None)
    if num==1:
        axes[num].legend()
axes[-1].axis('off')
axes[-2].axis('off')
plt.subplots_adjust(0.2,0.2)
plt.show()

 

 Get the weighted mean ratio before and after weathering:

w_gj = (gj_mean.iloc[:,1]/gj_mean.iloc[:,3]).replace({np.inf:0})
w_qb = (qb_mean['未风化点加权均值']/qb_mean['风化点加权均值']).replace({np.inf:0})
w_qb_yz = (qb_mean['未风化点加权均值']/qb_mean['严重风化点加权均值']).replace({np.inf:0})

W = pd.DataFrame([w_gj,w_qb,w_qb_yz],index = ['高钾:风化前后加权比例','铅钡:风化前后加权比例','铅钡:严重风化前后加权比例']).T

 predict:

The chemical composition of the sample point after weathering is multiplied by the ratio before and after weathering to obtain the predicted value of the chemical composition before weathering:

Note: But in the process, we found that there is still a problem, that is, when a certain chemical composition of the weathered sampling point is 0, no matter how much the ratio is multiplied, the prediction result is still 0, which is different from most of the chemical composition of the unweathered sampling point Inconsistent, if the composition after weathering is 0, the prediction before weathering can only be 0, which is not ideal. Even if there are a small number of unweathered sampling points, the data is 0. For this phenomenon, we use the chemical composition before weathering The weighted mean is used as the predicted value.

For the analysis before and after treatment when the chemical composition of the weathered sampling point is 0 at the time of prediction:

High potassium unweathered sampling points:

High potassium weathering sampling points:

As above, most of the potassium oxide content before weathering contains a value and the composition is close to 10%, and most of the magnesium oxide also contains a smaller value that is not zero.

Direct prediction: 

 

 Use the unweathered weighted mean as the predictor:

 

 The results after the above processing are still within the valid range, so the author believes that this method is feasible and avoids too rigid predictions.

High Potassium Prediction Program:

# 高钾预测
fh_gj = data_merge.query("类型 == '高钾' &  采样点风化类型 == '风化点'").iloc[:, 6:-1]
fh_gj = fh_gj.replace({0:np.nan})

data_merge.query("采样点风化类型 == '风化点' & 类型 == '高钾'").iloc[:,6:-1]
gj_pred = fh_gj*W['高钾:风化前后加权比例']

gj_pred = gj_pred.fillna(gj_mean['未风化点加权均值'])
# 规整到100% 注:自己分析是否合理
gj_pred = (gj_pred.T/gj_pred.sum(axis = 1)*100).T

# 补上成分外的特征
d_ = data_merge.loc[gj_pred.index,:][['文物编号','纹饰','类型','颜色',
                                                              '表面风化','文物采样点','采样点风化类型']]
gj_pred = pd.concat((d_, gj_pred), axis = 1)

Lead and Barium Prediction Program: 

# 铅钡预测
fh_qb = data_merge.query("类型 == '铅钡' &  采样点风化类型 == '风化点'").iloc[:, 6:-1]
fh_qb = fh_qb.replace({0:np.nan})

yz_fh_qb = data_merge.query("类型 == '铅钡' &  采样点风化类型 == '严重风化点'").iloc[:, 6:-1]
yz_fh_qb = yz_fh_qb.replace({0:np.nan})

qb_pred = (fh_qb * W['铅钡:风化前后加权比例'])
yz_qb_pred = (yz_fh_qb * W['铅钡:严重风化前后加权比例'])

qb_pred = qb_pred.fillna(qb_mean['未风化点加权均值'])
yz_qb_pred = yz_qb_pred.fillna(qb_mean['未风化点加权均值'])

# 规整到100%
qb_pred = (qb_pred.T/qb_pred.sum(axis = 1)*100).T
yz_qb_pred = (yz_qb_pred.T/yz_qb_pred.sum(axis = 1)*100).T

# 合并风化点与严重风化点预测数据
qb_pred = pd.concat((yz_qb_pred,qb_pred))

# 补上成分外特征
d_ = data_merge.loc[qb_pred.index,:][['文物编号','纹饰','类型','颜色',
                                   '表面风化','文物采样点','采样点风化类型']]
qb_pred = pd.concat((d_, qb_pred), axis = 1)

Result = pd.concat((gj_pred, qb_pred)).reset_index().iloc[:, 1:]
# Result.to_excel('加权平均法预测风化前含量结果.xlsx')
Result

Prediction result: just take a screenshot for everyone to see  

question two

1. Analyze the classification rules of high-potassium glass and lead-barium glass according to the attached data;

Visually analyze the differences in the chemical composition of high-potassium and lead-barium glasses by boxplot to explore the classification rules that determine their glass categories:

# data_merge
data = pd.read_excel('问题一连接两表并处理后数据.xlsx').iloc[:,1:]
# 作者封装函数
boxplot(data,4,4,hue = '类型', vars = data.columns[1:].tolist(),
        figsize=(11, 8), subplots_adjust=(0.52,0.25))
plt.savefig('高钾铅钡玻璃化学成分箱线图分析.jpg')

 Density plot analysis feature distribution

# stat = "count", "frequency", "density", "probability"
def distplot(data, rows = 3, cols = 4, bins = 10, vars = None, hue = None, kind = 'hist',stat = 'count', shade = True,
             figsize = (12, 5), color_style = 'all', alpha = 0.7, subplots_adjust = (0.3, 0.2)):
    assert kind in ['hist', 'kde','both'], "kind must == 'hist' or 'kde'"
    assert stat in ["count", "frequency", "density", "probability"], 'stat must in ["count", "frequency", "density", "probability"]'
    
    fig = plt.figure(figsize = figsize)
    hue_name = hue  if isinstance(hue,str) or hue==None else hue.name
    hue = data[hue] if isinstance(hue,str) and hue in data.columns else hue
    data = data if not vars else data[vars]
    
    colors = get_colors(color_style)
    
    ax_num = 1
    for col in data.columns:
        if isinstance(data[col].values[0],(np.int64,np.int32,np.int16,np.int8,np.float16,np.float32,np.float64)) and col!=hue_name:
            plt.subplot(rows, cols, ax_num)
            if kind == 'hist':
                sns.histplot(x = data.loc[:,col],bins = bins,color=random.sample(colors,1)[0],hue = hue,alpha = alpha,stat = stat)
            elif kind == 'kde':
                sns.kdeplot(x = data.loc[:,col],color=random.sample(colors,1)[0],alpha = alpha,hue = hue,shade = shade)
            else:
                sns.distplot(x = data.loc[:,col],color=random.sample(colors,1)[0],kde=True,bins = bins,)
                plt.xlabel(col)
            ax_num+=1

    plt.subplots_adjust(hspace = subplots_adjust[0],wspace=subplots_adjust[1])
with sns.color_palette('rainbow_r'):
    distplot(data.iloc[:,1:],4,4,kind = 'kde',figsize=(11,7),subplots_adjust=(0.5,0.35),hue = '类型')

       According to the above figure, the characteristic distribution of silicon dioxide, potassium oxide, lead oxide, barium oxide, phosphorus pentoxide, and strontium oxide in high-potassium and lead-barium glass is quite different, especially lead oxide. The content of lead oxide in high-potassium glass is basically The distribution is at 0, and the lead oxide content of lead barium is relatively large, so these characteristics, especially lead oxide , are likely to be important features for distinguishing high potassium and lead barium, and then a glass type prediction model is established for further determination and verification .

        The model established here can be directly used in the prediction of Question 3, which is mainly to predict the results. The model selected here is relatively simple, with a small amount of data and obvious distinguishing features. Using L1 regularized logistic regression and decision trees can completely distinguish high-potassium and lead-barium glasses, and the cross-validation accuracy reaches 100%. The reason for using these two models is not only simple, but also has the function of screening important features , and can not be affected by multicollinearity . I see that many people use neural networks, and apply too complicated models as soon as they come up. I don’t think it is necessary. , the data set itself is small and logistic regression can satisfy the prediction. Of course, random forest or support vector machine can also be used.

        Since the process is very simple, it only needs to be coded and standardized according to the tradition to build the model through the sklearn library. Here, I will introduce the above ideas, and will not show the program. Let me show you a few visualizations of the model:

   Analyzing the weight map in the above figure, we can find that the main decisive factors are silicon dioxide, potassium oxide, lead oxide, barium oxide, etc. Among them, lead oxide has the largest weight, that is, it has the greatest decisiveness on the classification results. These influencing factors are related to The conclusions obtained in the boxplot analysis are consistent.

 

     The decision tree can perfectly classify this sample. Due to the principle and nature of the decision tree, the classification conditions here are too simple, indicating that the characteristics of lead oxide can distinguish high potassium and lead-barium , and the decision boundary is a linear hyperplane, such as oxidation in the box diagram In the analysis of lead characteristics, it can be observed that only a straight line is needed to distinguish them, but the prediction is only based on the content of lead oxide, and too much reliance on a single feature makes the model usable in the face of abnormal data such as noise or lack of silica content. Poor, use random forest instead.

     Both logistic regression and support vector machine use all the features as the basis for prediction, which can achieve the effect we want in reality. Considering the principle of support vector machine and the upper limit is higher than that of logistic regression, so the final model is selected here as the support vector machine.

Decision Plane Display:

 

      

2. Select the appropriate chemical composition for each category to divide it into subcategories, give the specific division method and division results, and analyze the rationality and sensitivity of the classification results. 

         Here we use the weathering type characteristics of high-potassium and lead-barium sampling points as the clustering basis for subclassification. According to the meaning of the question and data analysis, the chemical composition of the same type of glass changes greatly before and after weathering. Therefore, the weathered, unweathered and severely weathered glasses are used as sub-categories. High potassium is divided into two categories: unweathered and weathered. It is divided into three categories: unweathered, weathered, and severely weathered. According to the degree of weathering, the following data are obtained from the sampling points of the original data samples, and the subsequent clustering results are evaluated based on this:

 Filter features:

        There are several ways to filter features:

        The degree of dispersion of each chemical composition (selecting features with a large standard deviation) can be used as the basis for subclassification, because for a single glass type data, if the degree of dispersion of a certain feature is small, that is, the composition content is similar, it is not suitable and difficult to use as a basis for subclassification. The basis for further division of categories, on the contrary, features with a large degree of dispersion can be considered to have space for further division.

        It is also possible to train the classification model with the chemical composition as a variable and the weathering type as a label, and then select features by the weight of the feature or the importance of the feature. In this problem, this method may be better than the standard deviation selection method, because since the purpose of clustering is clear, the clustering results after screening features need to be considered to be more suitable for the weathering situation, and the standard deviation is large. The characteristics are not necessarily suitable as the basis characteristics of the clustering, but may affect the result. Even so, on the basis of the first method, the selected features can still be slightly adjusted by analyzing the clustering results.

Random forest selection features:

        Here, the random forest classification model is used for feature selection. The decision tree is not used because the data set is too simple during training, and it can be easily divided by a single feature, which will make the importance of the single feature 1 and other features. Both are 0, which cannot provide a basis for us to select multiple features.

from sklearn.cluster import KMeans, AgglomerativeClustering,DBSCAN
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score 
from itertools import permutations


data = pd.read_excel('问题一连接两表并处理后数据.xlsx').iloc[:,1:]
d1 = data.query("类型 == '高钾'")
d1.index = range(len(d1))
d2 = data.query("类型 == '铅钡'")
d2.index = range(len(d2))

  High potassium:

x = d1.drop('采样点风化类型', axis = 1).iloc[:, 6:]
y = d1['采样点风化类型']

model = RandomForestClassifier()
# 参数调优
parameters = {'max_depth':range(1,5),'min_samples_leaf':[1,2],'criterion':['gini','entropy'],'min_impurity_decrease':[0.01,0.02,0.025]}

grid_search = GridSearchCV(model, parameters,cv=5) 
grid_search.fit(x, y)  # 传入数据

print('网格搜索最高精度为:',grid_search.best_score_)
print('参数最优值:',grid_search.best_params_) 
model.fit(x, y)

gj_fea_df = pd.DataFrame([x.columns,model.feature_importances_],index = ['化学成分','特征重要性']).T
gj_fea_df.sort_values('特征重要性', ascending = False)
"""
化学成分	特征重要性
0	二氧化硅(SiO2)	0.231851
5	氧化铝(Al2O3)	0.170588
2	氧化钾(K2O)	0.146777
3	氧化钙(CaO)	0.138207
6	氧化铁(Fe2O3)	0.097733
10	五氧化二磷(P2O5)	0.096698
7	氧化铜(CuO)	0.034813
4	氧化镁(MgO)	0.02908
8	氧化铅(PbO)	0.017656
11	氧化锶(SrO)	0.017181
12	氧化锡(SnO2)	0.009706
9	氧化钡(BaO)	0.007264
1	氧化钠(Na2O)	0.002448
13	二氧化硫(SO2)	0.0
"""
# 获取降序后的重要性特征列表
gj_fea = gj_fea_df.sort_values('特征重要性', ascending = False)['化学成分'].tolist()

Lead barium:

gj_fea = dt.T.loc[:,'高钾(标准差)'].sort_values(ascending = False).index
dt.T.loc[:,'高钾(标准差)'].sort_values(ascending = False)

"""
二氧化硅(SiO2)     14.466726
氧化钾(K2O)        5.307753
氧化钙(CaO)        3.308126
氧化铝(Al2O3)      3.076662
氧化铁(Fe2O3)      1.566033
氧化铜(CuO)        1.492236
五氧化二磷(P2O5)     1.280603
氧化钠(Na2O)       1.088707
氧化钡(BaO)        0.841629
氧化镁(MgO)        0.711802
氧化锡(SnO2)       0.556257
氧化铅(PbO)        0.514144
二氧化硫(SO2)       0.157164
氧化锶(SrO)        0.043866
Name: 高钾(标准差), dtype: float64
"""

 Lead barium:

x = d2.drop('采样点风化类型', axis = 1).iloc[:, 6:]
y = d2['采样点风化类型']

# 参数调优
model = RandomForestClassifier()
parameters = {'max_depth':range(1,5),'min_samples_leaf':[1,2],'criterion':['gini','entropy'],'min_impurity_decrease':[0.01,0.02]}

grid_search = GridSearchCV(model, parameters,cv=5) 
grid_search.fit(x, y)  # 传入数据

print('网格搜索最高精度为:',grid_search.best_score_)
print('参数最优值:',grid_search.best_params_) 
model.fit(x, y)

qb_fea_df = pd.DataFrame([x.columns,model.feature_importances_],index = ['化学成分','特征重要性']).T
qb_fea_df.sort_values('特征重要性', ascending = False)
"""
化学成分	特征重要性
0	二氧化硅(SiO2)	0.227769
8	氧化铅(PbO)	0.207812
3	氧化钙(CaO)	0.092374
10	五氧化二磷(P2O5)	0.087535
11	氧化锶(SrO)	0.06342
5	氧化铝(Al2O3)	0.056736
7	氧化铜(CuO)	0.050561
9	氧化钡(BaO)	0.042228
13	二氧化硫(SO2)	0.03454
4	氧化镁(MgO)	0.033355
2	氧化钾(K2O)	0.030859
1	氧化钠(Na2O)	0.027895
6	氧化铁(Fe2O3)	0.027078
12	氧化锡(SnO2)	0.017839
"""
# 获取铅钡重要性降序特征列表
qb_fea = qb_fea_df.sort_values('特征重要性', ascending = False)['化学成分'].tolist()

Screening features: By continuously selecting features from high to low feature importance, and after clustering, use f1-score to match the weathering type of sampling points, quantify the fitting effect to select features, and delete the feature if it causes a decline.

def pinggu_gj(pred):
    score = 0
    for i in permutations([0,1]):
        true1 = d1['采样点风化类型'].replace({'未风化点':i[0],'风化点':i[1]})
        score_ = f1_score(pred, true1, average='weighted')
        score = score_ if score_>score else score
    return score

def pinggu_qb(pred):
    score = 0
    for i in permutations([0,1,2]):
        true2 = d2['采样点风化类型'].replace({'未风化点':i[0], '风化点':i[1], '严重风化点':i[2]})
        score_ = f1_score(pred, true2, average='weighted')
        score = score_ if score_>score else score
    return score

gj_score = 0
qb_score = 0
gj_del_fea = []
qb_del_fea = []
best_gj_fea = []
best_qb_fea = []
for i in range(1, 15):
    fea1 = gj_fea[:i].copy()
    for g in gj_del_fea:
        fea1.remove(g)
    d1_x = d1[fea1]
    
    fea2 = qb_fea[:i].copy()
    for g in qb_del_fea:
        fea2.remove(g)
    d2_x = d2[fea2]
    
    model1 = KMeans( n_clusters=2)
    scaler1 = StandardScaler()
    a1 = scaler1.fit_transform(d1_x)
    y1 = model1.fit_predict(a1)
    d1_['label'] = y1
    if pinggu_gj(d1_['label']) > gj_score: 
        gj_score = pinggu_gj(d1_['label'])
        best_gj_fea = fea1
    else:
        gj_del_fea.append(gj_fea[i-1])
        
    scaler2 = StandardScaler()
    model2 = KMeans( n_clusters=3)
    a2 = scaler2.fit_transform(d2_x)
    y2 = model2.fit_predict(a2)
    d2_['label'] = y2
    if pinggu_qb(d2_['label']) > qb_score:
        qb_score = pinggu_qb(d2_['label'])
        best_qb_fea = fea2
    else:
        qb_del_fea.append(qb_fea[i-1])

        
print(f'高钾数据选择的特征:{best_gj_fea}')
# 高钾数据选择的特征:['二氧化硅(SiO2)', '五氧化二磷(P2O5)']

print(f'铅钡数据选择的特征:{best_qb_fea}')
# 铅钡数据选择的特征:['二氧化硅(SiO2)', '氧化铅(PbO)', '氧化钡(BaO)', '二氧化硫(SO2)']

print(f'高钾f1-score:', gj_score, '铅钡f1-score:', qb_score )
# 高钾f1-score: 0.9435154217762913 铅钡f1-score: 0.8987315891105979

As above result:

The characteristics of high potassium glass selection are: silicon dioxide, phosphorus pentoxide

The selected characteristics of lead-barium glass are: silicon dioxide, lead oxide, barium oxide, sulfur dioxide

The matching accuracy of the feature clustering results selected for high potassium, lead and barium and the weathering type of the sampling point of the original sample are 0.94 and 0.89 respectively

Use the selected features to cluster again, and save the clustering results.

Visualize the clustering results, draw two-dimensional and three-dimensional scatter plots, and select 2 and 3 features with high importance as two-dimensional and three-dimensional axes (there are only two high-potassium features selected in three-dimensional drawing, and only need to add other components The most important one is enough, here is mainly to show the approximate clustering effect).

d1_ = pd.concat((d1.loc[:,['文物编号','类型','文物采样点']], d1.loc[:,'二氧化硅(SiO2)':'二氧化硫(SO2)']),axis = 1)
d2_ = pd.concat((d2.loc[:,['文物编号','文物采样点']], d2.loc[:,'二氧化硅(SiO2)':'二氧化硫(SO2)']),axis = 1)

d1_x = d1[['二氧化硅(SiO2)', '五氧化二磷(P2O5)']]
d2_x = d2[['二氧化硅(SiO2)', '氧化铅(PbO)', '氧化钡(BaO)', '二氧化硫(SO2)']]

# 均值聚类
model1 = KMeans( n_clusters=2)
scaler1 = StandardScaler()
a1 = scaler1.fit_transform(d1_x)
y1 = model1.fit_predict(a1)
d1_['label'] = y1

scaler2 = StandardScaler()
model2 = KMeans( n_clusters=3)
a2 = scaler2.fit_transform(d2_x)
y2 = model2.fit_predict(a2)
d2_['label'] = y2

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(15,8))

# ax = fig.add_subplot(121, projection='3d')
ax = plt.subplot(121)
for i in range(d1_['label'].nunique()):
    data = d1_[d1_['label']==i]
    ax.scatter(data['二氧化硅(SiO2)'],data['五氧化二磷(P2O5)'],c=colors[i],label =i)
#     ax.scatter(data['二氧化硅(SiO2)'],data['五氧化二磷(P2O5)'],data['氧化铝(Al2O3)'],c=colors[i],label =i)
ax.set_xlabel('二氧化硅(SiO2)')
ax.set_ylabel('五氧化二磷(P2O5)')
# ax.set_zlabel('氧化铝(Al2O3)')
plt.title('高钾:Kmeans聚类结果图',fontsize = 15)
plt.legend()
# plt.savefig('高钾聚类效果三维散点图')


# ax = fig.add_subplot(122, projection='3d')
ax = plt.subplot(122)
for i in range(d2_['label'].nunique()):
    data = d2_[d2_['label']==i]
    ax.scatter(data['二氧化硅(SiO2)'],data['氧化铅(PbO)'],c=colors[i],label =i)
#     ax.scatter(data['二氧化硅(SiO2)'],data['氧化铅(PbO)'],data['氧化钡(BaO)'],c=colors[i],label =i)
ax.set_xlabel('二氧化硅(SiO2)')
ax.set_ylabel('氧化铅(PbO)')
# ax.set_zlabel('氧化钡(BaO)')
plt.title('铅钡:Kmeans聚类聚类结果图',fontsize = 15)
plt.legend()
# plt.savefig('铅钡聚类效果三维散点图')
plt.savefig('Kmeans聚类聚类结果图')
plt.show()

 

 3D:

 

The effect of hierarchical clustering is basically the same as that of mean clustering.

Rationality analysis:

As shown in the above figure, the high-potassium three-dimensional map is clustered into two categories. Combined with the weathering type of the sample points, it can be seen that the one with a higher silica content on the map is after weathering, and the other with a lower content is the data before weathering, which is in line with the original data. The same is true for lead-barium glass. For example, the red point in the two-dimensional three-point diagram is the matching point of severe weathering, so the clustering result is considered reasonable.

Sensitivity analysis:

Idea: In order to explore the sensitivity of the classification results, we use the control variable method to change the content of each feature in a range, while controlling the content of other features during the period, and use the trained K-means clustering model to analyze the changes . The sample points are predicted, and f1-score is used as an indicator to quantify the prediction results. If the sample point prediction results remain unchanged, the classification results are not sensitive to changes in individual features, and the classification results are stable.

Question 3: Analyze the chemical composition of unknown glass cultural relics in Annex Form 3, identify their type, and analyze the sensitivity of the classification results.

Simply analyze whether the feature distribution of the test data is too different from the feature distribution of the training set, and use the model established in question 2 to predict the test data. The results of using support vector machines or random forests are consistent.

Visual analysis:

def boxplot1(data1,data2, rows = 3, cols = 4,figsize = (13, 8),color_style ='light'):
    fig = plt.figure(figsize = figsize)
    colors = get_colors(color_style=color_style)
    ax_num = 1
    for col in data1.columns:
        if isinstance(data1[col][0],(np.int, np.float)):
            plt.subplot(rows, cols, ax_num)
            data_ = pd.DataFrame([data1[col].values,data2[col].values],index = ['原数据','表单3数据']).T
            sns.boxplot(data = data_,color=random.sample(colors,1)[0],width=0.2,)
#             sns.histplot(data = data_,color=random.sample(colors,1)[0],kde = True)

            plt.xlabel(col)
#             data[col].plot(kind = 'box',color = random.sample(darkcolors,1)[0])
            ax_num+=1
    plt.subplots_adjust(hspace = 0.5)
data3 = data3.fillna(0)
d_ = data_merge.iloc[:,6:-1]

boxplot1(data4,d_,4,4)
plt.savefig('原数据与表单3数据箱线图分析分布情况')
plt.show()

          Change the content of each feature in a range, and control the content of other features during the period. Use f1-score as an indicator to quantify the prediction results. Has stability.

Guess you like

Origin blog.csdn.net/weixin_46707493/article/details/127348898