My CSDN blog column: https://blog.csdn.net/yty_7
Github address: https://github.com/yot777/
Single and multiple features
In the example of labels and features in the previous section, we used a one-to-one correspondence between labels and features:
Features (height, in meters) | label | |
A | 1.51 | 0 |
B | 1.61 | 1 |
C | 1.76 | 1 |
D | 2.1 | 1 |
E | 1.58 | 0 |
F | 1.68 | 1 |
In fact, in real life, there are a large number of multiple features corresponding to a label. Take the well-known linear programming problem as an example:
In this problem, there are two characteristics, x1 and x2, which together determine the sales profit of the machine tool plant.
According to the graphical method of linear programming, we can get:
The yellow part in the figure above is the feasible region, and each coordinate (x1, x2) in the feasible region is recorded as label 1 (feasible)
Except for the feasible region, each coordinate (x1, x2) is marked as label 0 (not feasible)
In this way, we have completed the matching of multiple features corresponding to a label.
Training set and test set
According to the results of the graphical method, we can get the characteristics and labels of the following 10 sample data
Among them, the label of 5 data is 0, and the label of 5 data is 1, a total of 10 data constitute the source data set:
sample | Feature x1 | Feature x2 | label |
1 | 1 | 2 | 1 |
2 | 4 | 5 | 0 |
3 | 2 | 1 | 1 |
4 | 4 | 2 | 1 |
5 | 6 | 1 | 0 |
6 | 3 | 3 | 1 |
7 | 5 | 2 | 0 |
8 | 4 | 5 | 0 |
9 | 2 | 7 | 0 |
10 | 2 | 6 | 1 |
We learned in the introduction that the first two steps of machine learning are:
Learned knowledge: Mathematical modeling through a large amount of training data allows the machine to "learn" a certain pattern of data.
Review on time: run the test data through the model that has been built to verify whether the laws learned are correct.
So now we only have 10 data. Generally speaking, 70% ~ 80% of the data will be used for mathematical modeling. This part of the data is called the training set.
The remaining 20% to 30% of the data is used to verify whether the data after the modeling operation is correctly labeled. This part of the data is called the test set.
In the field of machine learning, we usually use the following 4 variables to represent the training set, test set, and their features and labels:
X_train represents training set features, y_train represents training set labels
X_test indicates the characteristics of the test set, and y_test indicates the test set label.
Note: X is uppercase, indicating a matrix (a piece of data can have multiple features); y is lowercase, indicating a vector (a piece of data can only have one label)
According to the principle of 80% training set and 20% test set, we divide the above table into the following 4 parts:
Python implementation training set and test set
code show as below:
import numpy as np
#源数据矩阵
S = np.array([[1,2,1],[4,5,0],[2,1,1],[4,2,1],[6,1,0],[3,3,1],[5,2,0],[4,5,0],[2,7,0],[2,6,1]])
print('源数据矩阵是\n',S)
#X_train训练集特征矩阵,先取行再取列
X_train = S[:8][:,0:-1]
print('X_train训练集特征矩阵是\n',X_train)
#y_train训练集标签向量,先取行再取列
y_train = S[:8][:,-1]
print('y_train训练集标签向量是\n',y_train)
#X_test测试集特征矩阵,先取行再取列
X_test = S[8:][:,0:-1]
print('X_test测试集特征矩阵是\n',X_test)
#y_train训练集标签向量,先取行再取列
y_test = S[8:][:,-1]
print('y_test测试集标签向量是\n',y_test)
operation result:
源数据矩阵是
[[1 2 1]
[4 5 0]
[2 1 1]
[4 2 1]
[6 1 0]
[3 3 1]
[5 2 0]
[4 5 0]
[2 7 0]
[2 6 1]]
X_train训练集特征矩阵是
[[1 2]
[4 5]
[2 1]
[4 2]
[6 1]
[3 3]
[5 2]
[4 5]]
y_train训练集标签向量是
[1 0 1 1 0 1 0 0]
X_test测试集特征矩阵是
[[2 7]
[2 6]]
y_test测试集标签向量是
[0 1]
to sum up
In real life, there are a large number of multiple features corresponding to a label.
The data used for mathematical modeling in the source data is called the training set , which generally accounts for 70% to 80% of the source data
The source data is used to verify whether the data after the modeling operation is correctly labeled. This part of the data is called the test set , which generally accounts for 20% to 30% of the source data.
In the field of machine learning, we usually use the following 4 variables to represent the training set, test set, and their features and labels:
X_train represents training set features, y_train represents training set labels
X_test indicates the characteristics of the test set, and y_test indicates the test set label.
Note: X is uppercase, indicating a matrix (a piece of data can have multiple features); y is lowercase, indicating a vector (a piece of data can only have one label)
My CSDN blog column: https://blog.csdn.net/yty_7
Github address: https://github.com/yot777/
If you think this chapter is helpful to you, welcome to follow, comment and like! Github welcomes your Follow and Star!