Starfruit Python machine learning 3-single and multiple features, training set and test set

My CSDN blog column: https://blog.csdn.net/yty_7

Github address: https://github.com/yot777/

 

Single and multiple features

In the example of labels and features in the previous section, we used a one-to-one correspondence between labels and features:

  Features (height, in meters) label
A 1.51 0
B 1.61 1
C 1.76 1
D 2.1 1
E 1.58 0
F 1.68 1

In fact, in real life, there are a large number of multiple features corresponding to a label. Take the well-known linear programming problem as an example:

In this problem, there are two characteristics, x1 and x2, which together determine the sales profit of the machine tool plant.

According to the graphical method of linear programming, we can get:

The yellow part in the figure above is the feasible region, and each coordinate (x1, x2) in the feasible region is recorded as label 1 (feasible)

Except for the feasible region, each coordinate (x1, x2) is marked as label 0 (not feasible)

In this way, we have completed the matching of multiple features corresponding to a label.

Training set and test set

According to the results of the graphical method, we can get the characteristics and labels of the following 10 sample data

Among them, the label of 5 data is 0, and the label of 5 data is 1, a total of 10 data constitute the source data set:

sample Feature x1 Feature x2 label
1 1 2 1
2 4 5 0
3 2 1 1
4 4 2 1
5 6 1 0
6 3 3 1
7 5 2 0
8 4 5 0
9 2 7 0
10 2 6 1

 

We learned in the introduction that the first two steps of machine learning are:

Learned knowledge: Mathematical modeling through a large amount of training data allows the machine to "learn" a certain pattern of data.

Review on time: run the test data through the model that has been built to verify whether the laws learned are correct.

So now we only have 10 data. Generally speaking, 70% ~ 80% of the data will be used for mathematical modeling. This part of the data is called the training set.

The remaining 20% ​​to 30% of the data is used to verify whether the data after the modeling operation is correctly labeled. This part of the data is called the test set.

In the field of machine learning, we usually use the following 4 variables to represent the training set, test set, and their features and labels:

X_train represents training set features, y_train represents training set labels

X_test indicates the characteristics of the test set, and y_test indicates the test set label.

Note: X is uppercase, indicating a matrix (a piece of data can have multiple features); y is lowercase, indicating a vector (a piece of data can only have one label)

According to the principle of 80% training set and 20% test set, we divide the above table into the following 4 parts:

Python implementation training set and test set

code show as below:

import numpy as np
#源数据矩阵
S = np.array([[1,2,1],[4,5,0],[2,1,1],[4,2,1],[6,1,0],[3,3,1],[5,2,0],[4,5,0],[2,7,0],[2,6,1]])
print('源数据矩阵是\n',S)
#X_train训练集特征矩阵,先取行再取列
X_train = S[:8][:,0:-1]
print('X_train训练集特征矩阵是\n',X_train)
#y_train训练集标签向量,先取行再取列
y_train = S[:8][:,-1]
print('y_train训练集标签向量是\n',y_train)
#X_test测试集特征矩阵,先取行再取列
X_test = S[8:][:,0:-1]
print('X_test测试集特征矩阵是\n',X_test)
#y_train训练集标签向量,先取行再取列
y_test = S[8:][:,-1]
print('y_test测试集标签向量是\n',y_test)

operation result:

源数据矩阵是
 [[1 2 1]
 [4 5 0]
 [2 1 1]
 [4 2 1]
 [6 1 0]
 [3 3 1]
 [5 2 0]
 [4 5 0]
 [2 7 0]
 [2 6 1]]
X_train训练集特征矩阵是
 [[1 2]
 [4 5]
 [2 1]
 [4 2]
 [6 1]
 [3 3]
 [5 2]
 [4 5]]
y_train训练集标签向量是
 [1 0 1 1 0 1 0 0]
X_test测试集特征矩阵是
 [[2 7]
 [2 6]]
y_test测试集标签向量是
 [0 1]

to sum up

In real life, there are a large number of multiple features corresponding to a label.

The data used for mathematical modeling in the source data is called the training set , which generally accounts for 70% to 80% of the source data

The source data is used to verify whether the data after the modeling operation is correctly labeled. This part of the data is called the test set , which generally accounts for 20% to 30% of the source data.

In the field of machine learning, we usually use the following 4 variables to represent the training set, test set, and their features and labels:

X_train represents training set features, y_train represents training set labels

X_test indicates the characteristics of the test set, and y_test indicates the test set label.

Note: X is uppercase, indicating a matrix (a piece of data can have multiple features); y is lowercase, indicating a vector (a piece of data can only have one label)

 

My CSDN blog column: https://blog.csdn.net/yty_7

Github address: https://github.com/yot777/

If you think this chapter is helpful to you, welcome to follow, comment and like! Github welcomes your Follow and Star!

Published 55 original articles · won praise 16 · views 6111

Guess you like

Origin blog.csdn.net/yty_7/article/details/105038648