machine learning
1. Statistics and visualization of training data
Each sample in the training data set "Training-set.csv" has five attribute values, namely "id", "a", "b", "c", and "t". "id" stands for number which is not helpful for model training. The attribute "t" represents the target value. There are two types of 0 and 1. Among them, there are 3476 samples with the label 0 and 3524 samples with the label 1, a total of 7000 training sample data.
As shown in Table 1, various statistical values of the training data are displayed. From the data in the table, it can be preliminarily inferred that the three attribute values of the sample with label 0 are all in the [-12, 12] interval, and the three attribute values of the data are all concentrated near 0 (from the 25th hundredth of the three attributes The quantile is greater than -4 and the 75th percentile is less than 4 (inferred); the three attribute values of the sample labeled 1 are all in the [-22, 22] interval, and the three attribute values of the data are uniformly distributed Within the interval (inferred from the 25th percentile around -10 and the 75th percentile around 10 for the three attributes).
Table 1 Various statistical values of training set data
Samples labeled 0 | Sample labeled 1 | |||||
---|---|---|---|---|---|---|
Attributes | a | b | c | a | b | c |
average | 0.086 | - 0.133 | - 0.038 | -0.223 | 0.007 | 0.365 |
variance | 4.852 | 4.736 | 4.751 | 12.308 | 12.521 | 12.258 |
minimum value | -11.592 | -11.743 | -11.828 | -21.985 | -21.994 | -21.971 |
25th percentile | -3.733 | -3.815 | -3.719 | -10.659 | -10.332 | -9.942 |
median | -0.008 | -0.143 | -0.065 | -0.385 | -0.055 | 0.550 |
75th percentile | 3.802 | 3.455 | 3.544 | 10.310 | 10.249 | 10.554 |
maximum value | 11.941 | 11.838 | 11.567 | 21.999 | 21.988 | 21.989 |
The three attribute values can be corresponded to the coordinates in the three-dimensional space, and the samples labeled 0 and 1 can be marked with two colors. As shown in Figure 1, the upper left abscissa "a" ordinate "b", the upper right abscissa "a", ordinate "c", the lower left abscissa "b" ordinate "c", the lower right figure is a cross-section along the axis perpendicular to the attribute "c", blue points are samples with label 0 and green points are samples with label 1. It can be seen that the samples with label 0 are distributed in a sphere whose center is the coordinate origin and the radius is 12, and the samples with label 1 are distributed in a hollowed-out cube.
In order to observe the distribution of data more clearly, calculate the Euclidean distance to the origin of the coordinates for each sample of the training data set, and perform relevant statistics, as shown in Table 2. It can be seen that the center of the sphere is coordinates (0, 0, 0), and the sphere with a radius of 11 can approximately divide the data of the two labels.
Figure 1 Distribution of training set data in space
Table 2 Euclidean distance statistics from the training set data to the coordinate origin
Label | average | variance | minimum value | 25th percentile | median | 75th percentile | maximum value |
---|---|---|---|---|---|---|---|
0 | 7.988 | 2.178 | 0.687 | 6.594 | 8.343 | 9.513 | 11.995 |
1 | 20.540 | 6.057 | 10.008 | 15.884 | 21.112 | 25.109 | 36.506 |
Table 3 below shows various statistical values of the test set data. The test set has a total of 1000 samples.
Table 3 Various statistical values of test set data
test set sample | |||
---|---|---|---|
Attributes | a | b | c |
average | 0.086 | - 0.133 | - 0.038 |
variance | 4.852 | 4.736 | 4.751 |
minimum value | -11.592 | -11.743 | -11.828 |
25th percentile | -3.733 | -3.815 | -3.719 |
median | -0.008 | -0.143 | -0.065 |
75th percentile | 3.802 | 3.455 | 3.544 |
maximum value | 11.941 | 11.838 | 11.567 |
Figure 2 below shows the position of the test set data in three-dimensional space, the upper left abscissa "a", the ordinate "b", the upper right abscissa "a", the ordinate "c", and the lower left abscissa "b" Ordinate "c", the figure on the lower right is a cross-section along the axis perpendicular to the attribute "c".
Figure 2 Distribution of test set data in space
Calculate the Euclidean distance to the origin of the coordinates for each sample in the test set, and perform statistics to obtain the data in Table 4 below.
Table 4 Euclidean distance statistics from test set data to coordinate origin
Euclidean distance statistics from the test set samples to the coordinate origin | |
---|---|
average | 11.062 |
variance | 0.575 |
minimum value | 10.000 |
25th percentile | 10.550 |
median | 11.091 |
75th percentile | 11.562 |
maximum value | 11.995 |
It can be seen from Table 4 that the test set samples are 10 closest to the origin of coordinates and 12 farthest away from the coordinate origin, and are mainly distributed near the spherical surface whose center is coordinates (0, 0, 0) and diameter is 22.
2. Model training and testing
Through the previous data visualization and analysis of data set statistics, I decided to use K Nearest Neighbor (K Nearest Neighbor) and Support Vector Machine (Support Vector Machine) to build the model.
Using KNN does not require prior training, and can directly use the training set to predict the data of the test set. As shown in Table 5 below, the table shows the change of test accuracy (Accuracy) with the change of K value. The distance function of KNN here uses Euclidean distance.
Table 5 Test accuracy of KNN model
K value | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
Accuracy | 77.2% | 61.8% | 62.6% | 58.3% | 58.9% | 56.2% | 56.3% | 54.8% | 56.3% |
The second algorithm is the support vector machine. In the experiment, Gaussian kernel was used, and soft interval and regularization were used. Table 6 below shows the change of the accuracy rate of the training set and test set with the change of the coefficient C of the regularization term.
Table 6 Changes in the accuracy of the training set and test set of the SVM model
C value | 1 | 5 | 10 | 20 | 80 | 100 | 200 | 400 | 500 |
---|---|---|---|---|---|---|---|---|---|
Training set accuracy (%) | 96.66 | 98.34 | 99.00 | 99.47 | 99.91 | 99.93 | 99.97 | 99.97 | 99.99 |
Test set accuracy (%) | 65.0 | 72.1 | 73.6 | 75.1 | 76.8 | 76.5 | 76.9 | 77.2 | 77.2 |
3. Analysis and Summary
The KNN model does not require pre-training, and the test set data can be directly predicted. SVM is to train the training set and use the trained model to make predictions. When the training set collects more data, the KNN model can make predictions at any time, while the SVM needs to be retrained. From another point of view, SVM can extract some information from the data set, while KNN has no advantage in this regard.
从前边的分析结果可以看出来,在训练集中,两类数据在球心为坐标原点且半径为 10 到 12 之间的球面附近相互交错,两类数据在这一区域附近过渡。对于这一区域,建立有效的边界对两类数据进行划分是比较困难的。而测试集数据也主要分布在这一区域附近,所以不管是 KNN 还是 SVM,模型的测试数据集的准确率均只有 77.2%,若要继续提升测试集准确率则比较困难。
由于两类数据交界处需要不规则的边界才可以比较好的对数据进行划分,所以可以通过训练深度神经网络模型来获得不规则的边界,而如何设置神经网络的层数以及每层的神经元个数则又是一个新的问题。
♻️ 资源
大小: 6.87MB
➡️ 资源下载:https://download.csdn.net/download/s1t16/87575179
注:如当前文章或代码侵犯了您的权益,请私信作者删除!