Machine learning based on Python (statistics and visualization of training data) [100011285]

machine learning

1. Statistics and visualization of training data

Each sample in the training data set "Training-set.csv" has five attribute values, namely "id", "a", "b", "c", and "t". "id" stands for number which is not helpful for model training. The attribute "t" represents the target value. There are two types of 0 and 1. Among them, there are 3476 samples with the label 0 and 3524 samples with the label 1, a total of 7000 training sample data.

As shown in Table 1, various statistical values ​​of the training data are displayed. From the data in the table, it can be preliminarily inferred that the three attribute values ​​of the sample with label 0 are all in the [-12, 12] interval, and the three attribute values ​​of the data are all concentrated near 0 (from the 25th hundredth of the three attributes The quantile is greater than -4 and the 75th percentile is less than 4 (inferred); the three attribute values ​​​​of the sample labeled 1 are all in the [-22, 22] interval, and the three attribute values ​​​​of the data are uniformly distributed Within the interval (inferred from the 25th percentile around -10 and the 75th percentile around 10 for the three attributes).

Table 1 Various statistical values ​​of training set data

Samples labeled 0 Sample labeled 1
Attributes a b c a b c
average 0.086 - 0.133 - 0.038 -0.223 0.007 0.365
variance 4.852 4.736 4.751 12.308 12.521 12.258
minimum value -11.592 -11.743 -11.828 -21.985 -21.994 -21.971
25th percentile -3.733 -3.815 -3.719 -10.659 -10.332 -9.942
median -0.008 -0.143 -0.065 -0.385 -0.055 0.550
75th percentile 3.802 3.455 3.544 10.310 10.249 10.554
maximum value 11.941 11.838 11.567 21.999 21.988 21.989

The three attribute values ​​can be corresponded to the coordinates in the three-dimensional space, and the samples labeled 0 and 1 can be marked with two colors. As shown in Figure 1, the upper left abscissa "a" ordinate "b", the upper right abscissa "a", ordinate "c", the lower left abscissa "b" ordinate "c", the lower right figure is a cross-section along the axis perpendicular to the attribute "c", blue points are samples with label 0 and green points are samples with label 1. It can be seen that the samples with label 0 are distributed in a sphere whose center is the coordinate origin and the radius is 12, and the samples with label 1 are distributed in a hollowed-out cube.

In order to observe the distribution of data more clearly, calculate the Euclidean distance to the origin of the coordinates for each sample of the training data set, and perform relevant statistics, as shown in Table 2. It can be seen that the center of the sphere is coordinates (0, 0, 0), and the sphere with a radius of 11 can approximately divide the data of the two labels.

Figure 1 Distribution of training set data in space

Table 2 Euclidean distance statistics from the training set data to the coordinate origin

Label average variance minimum value 25th percentile median 75th percentile maximum value
0 7.988 2.178 0.687 6.594 8.343 9.513 11.995
1 20.540 6.057 10.008 15.884 21.112 25.109 36.506

Table 3 below shows various statistical values ​​of the test set data. The test set has a total of 1000 samples.

Table 3 Various statistical values ​​of test set data

test set sample
Attributes a b c
average 0.086 - 0.133 - 0.038
variance 4.852 4.736 4.751
minimum value -11.592 -11.743 -11.828
25th percentile -3.733 -3.815 -3.719
median -0.008 -0.143 -0.065
75th percentile 3.802 3.455 3.544
maximum value 11.941 11.838 11.567

Figure 2 below shows the position of the test set data in three-dimensional space, the upper left abscissa "a", the ordinate "b", the upper right abscissa "a", the ordinate "c", and the lower left abscissa "b" Ordinate "c", the figure on the lower right is a cross-section along the axis perpendicular to the attribute "c".

Figure 2 Distribution of test set data in space

Calculate the Euclidean distance to the origin of the coordinates for each sample in the test set, and perform statistics to obtain the data in Table 4 below.

Table 4 Euclidean distance statistics from test set data to coordinate origin

Euclidean distance statistics from the test set samples to the coordinate origin
average 11.062
variance 0.575
minimum value 10.000
25th percentile 10.550
median 11.091
75th percentile 11.562
maximum value 11.995

It can be seen from Table 4 that the test set samples are 10 closest to the origin of coordinates and 12 farthest away from the coordinate origin, and are mainly distributed near the spherical surface whose center is coordinates (0, 0, 0) and diameter is 22.

2. Model training and testing

Through the previous data visualization and analysis of data set statistics, I decided to use K Nearest Neighbor (K Nearest Neighbor) and Support Vector Machine (Support Vector Machine) to build the model.

Using KNN does not require prior training, and can directly use the training set to predict the data of the test set. As shown in Table 5 below, the table shows the change of test accuracy (Accuracy) with the change of K value. The distance function of KNN here uses Euclidean distance.

Table 5 Test accuracy of KNN model

K value 1 2 3 4 5 6 7 8 9
Accuracy 77.2% 61.8% 62.6% 58.3% 58.9% 56.2% 56.3% 54.8% 56.3%

The second algorithm is the support vector machine. In the experiment, Gaussian kernel was used, and soft interval and regularization were used. Table 6 below shows the change of the accuracy rate of the training set and test set with the change of the coefficient C of the regularization term.

Table 6 Changes in the accuracy of the training set and test set of the SVM model

C value 1 5 10 20 80 100 200 400 500
Training set accuracy (%) 96.66 98.34 99.00 99.47 99.91 99.93 99.97 99.97 99.99
Test set accuracy (%) 65.0 72.1 73.6 75.1 76.8 76.5 76.9 77.2 77.2

3. Analysis and Summary

The KNN model does not require pre-training, and the test set data can be directly predicted. SVM is to train the training set and use the trained model to make predictions. When the training set collects more data, the KNN model can make predictions at any time, while the SVM needs to be retrained. From another point of view, SVM can extract some information from the data set, while KNN has no advantage in this regard.

从前边的分析结果可以看出来,在训练集中,两类数据在球心为坐标原点且半径为 10 到 12 之间的球面附近相互交错,两类数据在这一区域附近过渡。对于这一区域,建立有效的边界对两类数据进行划分是比较困难的。而测试集数据也主要分布在这一区域附近,所以不管是 KNN 还是 SVM,模型的测试数据集的准确率均只有 77.2%,若要继续提升测试集准确率则比较困难。

由于两类数据交界处需要不规则的边界才可以比较好的对数据进行划分,所以可以通过训练深度神经网络模型来获得不规则的边界,而如何设置神经网络的层数以及每层的神经元个数则又是一个新的问题。

♻️ 资源

在这里插入图片描述

大小: 6.87MB
➡️ 资源下载:https://download.csdn.net/download/s1t16/87575179
注:如当前文章或代码侵犯了您的权益,请私信作者删除!

Guess you like

Origin blog.csdn.net/s1t16/article/details/131675127