machine learning

1. Statistics and visualization of training data

Each sample in the training data set "Training-set.csv" has five attribute values, namely "id", "a", "b", "c", and "t". "id" stands for number which is not helpful for model training. The attribute "t" represents the target value. There are two types of 0 and 1. Among them, there are 3476 samples with the label 0 and 3524 samples with the label 1, a total of 7000 training sample data.

As shown in Table 1, various statistical values of the training data are displayed. From the data in the table, it can be preliminarily inferred that the three attribute values of the sample with label 0 are all in the [-12, 12] interval, and the three attribute values of the data are all concentrated near 0 (from the 25th hundredth of the three attributes The quantile is greater than -4 and the 75th percentile is less than 4 (inferred); the three attribute values of the sample labeled 1 are all in the [-22, 22] interval, and the three attribute values of the data are uniformly distributed Within the interval (inferred from the 25th percentile around -10 and the 75th percentile around 10 for the three attributes).

Table 1 Various statistical values of training set data

	Samples labeled 0	Sample labeled 1
Attributes	a	b	c	a	b	c
average	0.086	- 0.133	- 0.038	-0.223	0.007	0.365
variance	4.852	4.736	4.751	12.308	12.521	12.258
minimum value	-11.592	-11.743	-11.828	-21.985	-21.994	-21.971
25th percentile	-3.733	-3.815	-3.719	-10.659	-10.332	-9.942
median	-0.008	-0.143	-0.065	-0.385	-0.055	0.550
75th percentile	3.802	3.455	3.544	10.310	10.249	10.554
maximum value	11.941	11.838	11.567	21.999	21.988	21.989

The three attribute values can be corresponded to the coordinates in the three-dimensional space, and the samples labeled 0 and 1 can be marked with two colors. As shown in Figure 1, the upper left abscissa "a" ordinate "b", the upper right abscissa "a", ordinate "c", the lower left abscissa "b" ordinate "c", the lower right figure is a cross-section along the axis perpendicular to the attribute "c", blue points are samples with label 0 and green points are samples with label 1. It can be seen that the samples with label 0 are distributed in a sphere whose center is the coordinate origin and the radius is 12, and the samples with label 1 are distributed in a hollowed-out cube.

In order to observe the distribution of data more clearly, calculate the Euclidean distance to the origin of the coordinates for each sample of the training data set, and perform relevant statistics, as shown in Table 2. It can be seen that the center of the sphere is coordinates (0, 0, 0), and the sphere with a radius of 11 can approximately divide the data of the two labels.

Figure 1 Distribution of training set data in space

Table 2 Euclidean distance statistics from the training set data to the coordinate origin

Label	average	variance	minimum value	25th percentile	median	75th percentile	maximum value
0	7.988	2.178	0.687	6.594	8.343	9.513	11.995
1	20.540	6.057	10.008	15.884	21.112	25.109	36.506

Table 3 below shows various statistical values of the test set data. The test set has a total of 1000 samples.

Table 3 Various statistical values of test set data

	test set sample
Attributes	a	b	c
average	0.086	- 0.133	- 0.038
variance	4.852	4.736	4.751
minimum value	-11.592	-11.743	-11.828
25th percentile	-3.733	-3.815	-3.719
median	-0.008	-0.143	-0.065
75th percentile	3.802	3.455	3.544
maximum value	11.941	11.838	11.567

Figure 2 below shows the position of the test set data in three-dimensional space, the upper left abscissa "a", the ordinate "b", the upper right abscissa "a", the ordinate "c", and the lower left abscissa "b" Ordinate "c", the figure on the lower right is a cross-section along the axis perpendicular to the attribute "c".

Figure 2 Distribution of test set data in space

Calculate the Euclidean distance to the origin of the coordinates for each sample in the test set, and perform statistics to obtain the data in Table 4 below.

Table 4 Euclidean distance statistics from test set data to coordinate origin

	Euclidean distance statistics from the test set samples to the coordinate origin
average	11.062
variance	0.575
minimum value	10.000
25th percentile	10.550
median	11.091
75th percentile	11.562
maximum value	11.995

It can be seen from Table 4 that the test set samples are 10 closest to the origin of coordinates and 12 farthest away from the coordinate origin, and are mainly distributed near the spherical surface whose center is coordinates (0, 0, 0) and diameter is 22.

2. Model training and testing

Through the previous data visualization and analysis of data set statistics, I decided to use K Nearest Neighbor (K Nearest Neighbor) and Support Vector Machine (Support Vector Machine) to build the model.

Using KNN does not require prior training, and can directly use the training set to predict the data of the test set. As shown in Table 5 below, the table shows the change of test accuracy (Accuracy) with the change of K value. The distance function of KNN here uses Euclidean distance.

Table 5 Test accuracy of KNN model

K value	1	2	3	4	5	6	7	8	9
Accuracy	77.2%	61.8%	62.6%	58.3%	58.9%	56.2%	56.3%	54.8%	56.3%

The second algorithm is the support vector machine. In the experiment, Gaussian kernel was used, and soft interval and regularization were used. Table 6 below shows the change of the accuracy rate of the training set and test set with the change of the coefficient C of the regularization term.

Table 6 Changes in the accuracy of the training set and test set of the SVM model

C value	1	5	10	20	80	100	200	400	500
Training set accuracy (%)	96.66	98.34	99.00	99.47	99.91	99.93	99.97	99.97	99.99
Test set accuracy (%)	65.0	72.1	73.6	75.1	76.8	76.5	76.9	77.2	77.2

3. Analysis and Summary

The KNN model does not require pre-training, and the test set data can be directly predicted. SVM is to train the training set and use the trained model to make predictions. When the training set collects more data, the KNN model can make predictions at any time, while the SVM needs to be retrained. From another point of view, SVM can extract some information from the data set, while KNN has no advantage in this regard.

从前边的分析结果可以看出来，在训练集中，两类数据在球心为坐标原点且半径为 10 到 12 之间的球面附近相互交错，两类数据在这一区域附近过渡。对于这一区域，建立有效的边界对两类数据进行划分是比较困难的。而测试集数据也主要分布在这一区域附近，所以不管是 KNN 还是 SVM，模型的测试数据集的准确率均只有 77.2%，若要继续提升测试集准确率则比较困难。

由于两类数据交界处需要不规则的边界才可以比较好的对数据进行划分，所以可以通过训练深度神经网络模型来获得不规则的边界，而如何设置神经网络的层数以及每层的神经元个数则又是一个新的问题。

♻️ 资源

在这里插入图片描述

大小： 6.87MB
➡️ 资源下载：https://download.csdn.net/download/s1t16/87575179
注：如当前文章或代码侵犯了您的权益，请私信作者删除！