Fundamental Concepts of Machine Learning
data
- Famous iris data https://en.wikipedia.org/wiki/lris_flower_data_set
lris setossa lris versicolor lris verginica
Here is the data for iris:
- The data as a whole is called a data set
- Each row of data is called a sample
- Except for the last column, each column expresses a feature of the sample
- The last column, called the label
The i-th sample row is written , also called the feature vector. The jth eigenvalue of the ith sample The label of the ith sample is written as
For the convenience of visualizing the features, we only extract the first two features in the features, in which the length of the sepal is taken as the horizontal axis, and the width of the sepal is taken as the vertical axis.
Draw the following image:
For each sample, a point will be represented in the coordinate system. Assuming we have three features, we can represent it in three-dimensional space. Similarly, if there are 1000 features, we can represent it in 1000-dimensional space, and This space for drawing samples is called the feature space .
After drawing the sample points visually, we can easily draw a straight line, the red samples are on one side of the line and the blue samples are on the other side of the line.
The essence of the classification task is to segment in the feature space, and the same is true in the high-dimensional space.
The iris has 4 features, which should be analyzed in a 4-dimensional feature space.
Features can be very abstract
- Image, each pixel is a feature
- A 28*28 image has 28*28=784 features
- If it is a color image feature more