Python machine learning (3) feature preprocessing, iris case--classification, linear regression, cost function, gradient descent method, using numpy, sklearn to realize unary linear regression

K-Nearest Neighbor Algorithm (K-Nearest Neighboor)

feature preprocessing

The process of data preprocessing. There are different dimensions of data and outliers in the data. It is necessary to convert the data stably. The processed data can better train the model and reduce the occurrence of errors.

standardization

Normalization of datasets is a common requirement for most machine learning algorithms implemented in scikit-learn, and many cases require normalization. Individual features can be less expressive if they don't look more or less like a standard normal distribution (with zero mean and unit variance, with mean 0 and unit variance 1). Process the data into a standardized, or close to normally distributed data.
In practice, we often ignore the distribution shape of the feature, and implement it by the following formula: X ′ = x − mean σ X'= \frac{x-mean}{\sigma}X=pxmean
De-meaning is centered, and the sample subtracts the mean divided by the standard deviation of the vector to scale. Disadvantages: When there are many features, the variance of a certain feature is particularly large, and the particularly large variance occupies a large position in the learning algorithm, resulting in errors between machine learning and our expectations.

sklearn.preprocessing.MinMaxScaler(feature_range=(0,1)...)
MinMaxScaler.fit_transform(X)
X:numpy array格式的数据[n_samples,n_features]

Data preprocessing is performed under preprocessing, and MinMaxScaler is a standardized (normalized) class.
insert image description here

  • example:
    insert image description here
    insert image description here
sklearn.perprocessing.StandardScaler()
StandardScaler.fit_transform(X)
X:numpy array格式的数据[n_samples,n_features]

StandardScaler is also used for standardization, the default parameters, with_mean = True, with_std=True, the default is decentralized, and the standard deviation is removed.
insert image description here

The Iris Case – Assortment

Implementation steps:

  • retrieve data
  • Basic Data Processing
  • feature engineering
  • Machine Learning (Model Training)

  • insert image description here
    insert image description here
    The accuracy rate obtained by model evaluation is: 0.767. Find better n_neighbors to obtain a model with a higher accuracy rate. The value of n_neighbors can be {1, 3, 5, 7...}. If you cycle the value during model training, you will create many classifiers, which can be automatically imported using cross-validation .

Cross-validation

Cross-validation, also called loop estimation, cuts the data set into 10 parts. When modeling, take out most of them to train the model, keep a small part for testing, and keep cutting until each sample set is used as a test set. After each sample set is used as a test set, the results of each obtained are averaged to obtain the returned result.
What you get is an average result, which is more reliable than a single result. For example, the average result obtained after ten exams is more stable than the result obtained by occasional exams.
Purpose: To get a reliable and stable model

grid search

Grid search, simply put, is to manually give the parameters you want to change in a model, and the program will automatically run all the parameters for you using the exhaustive method.

cross validation, grid search api

sklearn.model_selection.GridSearchCV(estimator,param_grid=None,cv=None)对估计器的指定参数值进行详尽搜索
estimator:估计器对象
param_grid:估计器参数(dict){
    
    "n_neighbors":[1,3,5]}
cv: 指定几折交叉验证
fit: 输入训练数据
score: 准确率

insert image description here
insert image description here
insert image description here
The way to choose parameters:

  • Cross-validation grid search is only suitable for small data volumes, and this method requires a large amount of calculation
  • random search
  • Bayesian Tuning

K Nearest Neighbor Algorithm - Regression

The KNN algorithm can not only do classification problems, but also solve regression problems. The most commonly used is the mean square error. The sum of the squares of the predicted value minus the actual value is divided by the number. MSE = ∑ i = 1 m ( f ( xi ) − yi ) 2 m MSE= \frac{\displaystyle{\sum_{i=1}^{m}(f(x_i)-y_i)^2}}{m}MSE=mi=1m(f(xi)yi)2

Analyze the rental situation of the house, if there is a one-bedroom house, how much can you rent.
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
The obtained value is larger than the mean square error value of the model evaluation, and one more feature is not more accurate.
There are characteristic data and housing prices of houses. After training the characteristics of houses, they are passed to the model to obtain predicted housing prices. There is an error between the predicted house price and the true value, and the goal is to find the best fitting line.

linear regression

Galton found that if the parents are taller, the children will be taller; if the parents are shorter, the children will be shorter. If both parents are abnormally tall or abnormally short, the height of the children will tend to be the average height of the person. Pearson collected the height records of nearly a thousand members and found that in the group with tall fathers, the average height of the offspring is lower than that of the parents; in the group with short fathers, the average height of the offspring is higher than that of the parents. In this way, the tall and short offspring "return" to the average height of all men, returning to the middle. The most basic regression algorithm is linear regression.

Linear regression definition

An analytical method that uses a regression equation (function) to model the relationship between one or more independent variables (eigenvalues) and a dependent variable (target value).
According to the given real estate price, predict the house price
insert image description here
The points in the above figure will roughly regress to a straight line y = kx + by=kx+by=kx+b , k is the slope, and b is the intercept. If you want to buy a 100-square-meter house, you can roughly predict that the price is about 430,000. The size of the house is the feature, the price is the target (label), and the linear regression algorithm is a supervised learning algorithm.
insert image description here
Each row is a sample data, a total of m sample data, size is a feature, and price is a label.
insert image description here
The process of linear regression is as shown in the figure above. Use the training set to train the algorithm, and then use the algorithm to model to obtain the model. New data is passed in to predict the target data.
insert image description here
θ 0 θ_0i0is the intercept, θ 1 θ_1i1is the slope, and the straight line in the figure is a simple unary linear regression model, which is predicted by only one variable, also known as a univariate linear regression model.
insert image description here

cost function

Goal: Find the best fitting line that minimizes the error, that is, the minimum cost function.
There is a real y value for the given point, there is a y value relative to the line on the predicted line, and the real value yiy^iyi and predicted valueh ( xi ) h(x^i)h(xThere is an error between i ), the resulth ( xi ) − yih(x^i)-y^ih(xi)yi is positive or negative, the absolute value is summed to be the total error, and divided by the number of sample points is the mean square error. Adding one-half is for the convenience of subsequent calculations, and the formula is as follows.
insert image description here
Parametersθ 0 , θ 1 θ_0, θ_1i0,i1is unknown, the goal is to find θ 0 , θ 1 θ_0,θ_1i0,i1, make j ( θ 0 , θ 1 ) j(θ_0,θ_1)j(θ0,i1) is the smallest, and the simplification makes the intercept the smallest, so thatθ 0 = 0 θ_0=0i0=0 , the goal becomes onlyθ 1 θ_1i1, such that j ( θ 1 ) j(θ_1)j(θ1) minimum.
insert image description here
Understand the cost function andj ( θ 1 ) j(θ_1)j(θ1) relationship between.
The right 3 sample points in the figure below, at this timeθ 0 = 0 , θ 1 = 1 θ_0=0, θ_1=1i0=0,i1=1 , into the formula,h ( x ) = xh(x)=xh(x)=x , getj ( θ 1 ) = 0 j(θ_1)=0j(θ1)=0 , the cost function is 0.
insert image description here
Ifθ 0 = 0 , θ 1 = 0.5 θ_0=0,θ_1=0.5i0=0,i1=0.5 h ( x ) = x 2 h(x)= \frac{x}2 h(x)=2x, into the formula, the result of the cost function j ( θ 1 ) = 0.58 j(θ_1)=0.58j(θ1)=0.58
insert image description here

If θ 0 = 0 , θ 1 = 0 θ_0=0,θ_1=0i0=0,i1=0 h ( x ) = 0 h(x)= 0 h(x)=0 , the drawn line coincides completely with the x-axis, and the error is very large if a vertical line is drawn from the point to the line.
insert image description here
Aθ 1 θ_1i1corresponds to a h ( x ) h(x)h ( x ) function,θ 1 = 1 θ_1=1i1=When 1 corresponds to a coincident line connected by three points,θ 1 = 0.5 θ_1=0.5i1=0.5 corresponds to a line that deviates from three points downward,θ 1 = 0 θ_1=0i1=0 corresponds to the line that coincides with the x-axis.
Addθ 0 θ_0i0The final graph is a three-dimensional graph, and the value of the cost function is the depth of the surface. On the same contour line, j ( θ 0 , θ 1 ) j(θ_0,θ_1)j(θ0,i1) are equal in value. Each line on the right side of the figure below is a contour line, j ( θ 0 , θ 1 ) j(θ_0,θ_1)
insert image description here
on each linej(θ0,i1) are equal, and the smallest circle is the base of the figure. Randomly pick a point on the graph, the correspondingθ 0 θ_0i0About 800, corresponding to θ 1 θ_1i1It is about -1.2, and the graph on the left is a downward blue line, which has little relationship with the scatter points, indicating that the rationalization effect is not good, which means that the cost function is not the smallest, and the smallest is in the center. θ 0 θ_0
insert image description here
in the figure belowi0About 380, corresponding to θ 1 θ_1i1is 0, the intercept is about 380, the slope is 0, parallel to the x-axis, the reasonable effect is not good, and the cost function is not the smallest.
insert image description here
Assuming the point is at the bottom center, θ 0 θ_0i0About 220, corresponding to θ 1 θ_1i1It is about 0.12, the intercept is about 220, and the slope is 0.12. It is reasonable to form an upward straight line, indicating that the cost function is infinitely close to the lowest point and the local minimum.
j ( θ 0 , θ 1 ) on the same contour line j(θ_0,θ_1)j(θ0,i1) are equal, findθ 0 , θ 1 θ_0,θ_1i0,i1, so that the error is the smallest and the reasonable effect is the best. If you click one by one, it will take a lot of time. When there are more features, the dimension will become higher, and it is difficult to visualize the graphics to assist in obtaining the final result. The goal is to automatically find the θ 0 , θ 1 θ_0,θ_1 corresponding to the minimum value of the cost function through the programi0,i1, you can use the gradient descent method.

Gradient Descent

The gradient descent method is an algorithm that can minimize the cost function. It is a very commonly used algorithm and is widely used in various fields and algorithms when it is applied to the cost function to obtain the optimal solution.
Goal: Find θ 0 , θ 1 θ_0,θ_1i0,i1, make j ( θ 0 , θ 1 ) j(θ_0,θ_1)j(θ0,i1) minimum.
Implementation method: At the beginning, I didn't know to putθ 0 , θ 1 θ_0, θ_1i0,i1What kind of value is defined, first initialize θ 0 , θ 1 θ_0, θ_1i0,i1, and then keep changing θ 0 , θ 1 θ_0, θ_1i0,i1, make j ( θ 0 , θ 1 ) j(θ_0,θ_1)j(θ0,i1) is the smallest untilj ( θ 0 , θ 1 ) j(θ_0,θ_1)j(θ0,i1) or a local minimum, the iteration stops. In essence, it is equivalent to a loop, constantly changingθ 0 , θ 1 θ_0, θ_1i0,i1, until the optimal one is found.
insert image description here
In the figure above, the graph curve is like a mountain, j ( θ 0 , θ 1 ) j(θ_0,θ_1)j(θ0,i1) is equivalent to the height of the mountain, to find the minimum cost function is to find the bottom of the mountain. When going from the highest point to the lowest point, you need to find a direction, which is equivalent to switching the direction every time you walk, and re-determining which direction you should go until you find the local lowest point. The local minimum points obtained in different directions are different.

Mathematical Applications of Gradient Descent

insert image description here
By changing θ 0 , θ 1 θ_0,θ_1i0,i1, to keep adjusting the direction until the optimal point is found. The learning rate is also the gradient descent value, that is to say, the range to move (how many steps to take), α αWhen the value of α is large, it means that the direction is adjusted after walking a long distance, and the speed of gradient descent is faster;α αWhen the α value is relatively small, the downward movement speed is relatively slow (the walking pace is relatively small), and the gradient descent is relatively slow. The following content is equivalent toθ j θ_jijConduct guidance.

Derivation

insert image description here
When calculating, θ 0 , θ 1 θ_0,θ_1i0,i1To update at the same time, if one remains unchanged and the other changes, it means that one direction has stabilized, and adjusting the other direction is not suitable for finding an optimal solution. If the two are updated at the same time, it is equivalent to a 360-degree non-stop search to achieve the optimal process.
insert image description here
The x-axis is θ 1 θ_1i1, the curve is j ( θ 1 ) j(θ_1)j(θ1) , at this timeθ 0 θ_0i0is 0, the derivative is to find the slope of the point on the curve, the slope of the point on the graph is a positive value, after subtraction θ 1 θ_1i1It will decrease, move to the left of the x-axis, go down from the curve, iterate to the optimal position, j ( θ 1 ) j(θ_1)j(θ1) , the lowest point of the curve.
insert image description here
In the figure above, the slope at the point is negative,θ 1 θ_1i1Subtract negative values, meaning θ 1 θ_1i1Gradually increase, move to the right of the x-axis, and go through non-stop iterations until the optimal point of the x-axis.

learning rate

a aWhen α is relatively small, the coefficient is relatively small, and the moving speed is relatively slow. It is like going downhill in small steps. It takes many steps to reach the lowest point. When the learning rate is small, the gradient descent will be very slow.
insert image description here
whenα αWhen α is relatively large, the coefficient is relatively large, and the influence on the slope is relatively large,θ 1 θ_1i1The value of α also varies greatly, as shown in the figure below, the point itself is already very close to the optimal value, α αWhen α is relatively large, it may cross the lowest point at once and reach another area far from the lowest point, which makes the value span of the cost function large and oscillates back and forth, making it difficult to find the optimal point.
insert image description here
At the first point, the slope is positive, and the point moves to the left of the x-axis. When it reaches the second point, the slope is still positive, but the slope becomes smaller, and the overall value also becomes smaller, and the subsequent movement is not so fast. At the third point, the slope is still decreasing, and the moving range is even lower.
insert image description here
There is no need to deliberately reduceα αα , when the learning rate is multiplied by the derivative, there will be an automatic adjustment process, regardless ofα αNo matter how much α is selected, there will be a process of regional optimum, and it is only a matter of moving speed.

Adjust θ j by adjusting slope and learning rate θ_jij, control θ j θ_jijTo find a minimum point, this is an iterative process applying gradient descent to the cost function.
insert image description here
put h ( x ) h(x)Substituting h ( x ) into the formula, we can obtain: to derive the cost function, we can respectively calculateθ 0 θ_0i0θ 1 θ_1i1Find the partial derivative
insert image description here
for θ 0 θ_0i0Find the partial derivative and get:
insert image description here
for θ 1 θ_1i1Find the partial derivative and get: optimize θ 0 , θ 1 θ_0,θ_1
insert image description here
according to these two derivativesi0,i1, conveniently for θ 0 , θ 1 θ_0,θ_1i0,i1For iteration, m is the number of samples, and the calculation is the sum of m samples.

Non-convex and convex functions

Local optimum, which can get several optimal solutions are all non-convex functions. There is no local optimum, and only one global optimal solution is a convex function. Using linear regression, it will converge to the global optimum because there is no other global optimum.insert image description here

Using numpy to implement the gradient descent method of unary linear regression

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
At the beginning of the 0th iteration, the intercept b=0, the slope k=0, and the loss value=2782.55. After 50 iterations, the current optimal fitting line is obtained, b=0.0305, the slope k=1.478, and the loss value=56.3248. Too many iterations will affect the efficiency of the operation, and the number of iterations can be reduced.

sklearn implements unary linear regression

insert image description here
insert image description here

directly fit the best results

Guess you like

Origin blog.csdn.net/hwwaizs/article/details/128736477