Article Directory
One, understand the iris flower data set
- The Iris flower data set is a classic multi-dimensional data set introduced by Sir Ronald Fisher in 1936, which can be used as a sample for discriminant analysis.
- This data set contains 50 samples of each of the three varieties of Iris flowers (Iris setosa, Iris virginica and Iris versicolor), and each sample also has 4 characteristic parameters (the length and width of the sepals and the length and width of the petals, in centimeters) Unit)
View data set
Category description
二、LogisticRegression
-
Logistic
Regression (Logistic Regression ) explains that Logistic Regression is used to deal with the regression problem where the dependent variable is a categorical variable. The common one is the binary or binomial distribution problem, and it can also handle the multi-classification problem. It is actually a kind of Classification. -
Use of LogisticRegression regression model in Sklearn①
Import modelfrom sklearn.linear_model import LogisticRegression #导入逻辑回归模型
②fit() training
call the fit(x,y) method to train the model, where x is the attribute of the data, and y is the typeclf = LogisticRegression() print(clf) clf.fit(train_feature,label)
③predict() prediction
Use the trained model to predict the data set and return the prediction resultpredict['label'] = clf.predict(predict_feature)
-
LogisticRegression regression model parameter description
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
①Regularization selection parameter (penalty) The
selectable values of the penalty parameter are "l1" and "l2", which correspond to L1 regularization and L2 regularization, respectively. The default is L2 regularization . When tuning parameters, if our main purpose is to solve over-fitting , generally penalty L2 regularization is enough. But if you choose L2 regularization and find that it is still over-fitting, that is, when the prediction effect is poor, you can consider L1 regularization. When the model has a lot of features, it is hoped that some unimportant feature coefficients will be zeroed to make the model coefficients sparse, and L1 regularization can be used.
②Optimization algorithm selection parameter (solver) The
solver parameter determines our optimization method for the logistic regression loss function. There are 4 algorithms to choose from: liblinear, lbfgs, newton-cg, sag
说明:
The choice of parameter solver
When penalty is L2 regularization, the optional algorithm {'newton-cg','lbfgs','liblinear','sag'}. When the penalty is L1 regularization, you can only choose'liblinear'.
This is because the loss function of L1 regularization is not continuous and differentiable, and the three optimization algorithms {'newton-cg','lbfgs','sag'} all require the first or second continuous derivative of the loss function.③Classification method selection parameter (multi_class) The
multi_class parameter determines our choice of classification method. There are two values to choose from, ovr and multinomial . The default is ovr . The difference between the two is mainly in multiple logistic regression.
说明:
If ovr is selected, the four loss function optimization methods liblinear, newton-cg, lbfgs and sag can all be selected. But if you choose multinomial, you can only choose newton-cg, lbfgs and sag.
④Type weight parameter (class_weight) The
class_weight parameter is used to indicate the weights of various types in the classification model. It does not need to be input, that is, the weights are not considered, or the weights of all types are the same. If you choose to input, you can choose balanced to let the class library calculate the type weights, or we can input the weights of each type ourselves. For example, for a binary model of 0,1, we can define class_weight={0:0.9, 1:0.1}, In this way, the weight of type 0 is 90%, and the weight of type 1 is 10%. If the class_weight is balanced, the class library will calculate the weight based on the training sample size. The larger the sample size of a certain type, the lower the weight, and the smaller the sample size, the higher the weight.
⑤The sample weight parameter (sample_weight)
is not an unbiased estimate of the overall sample due to the imbalance of the sample, which may lead to a decline in our model's predictive ability. You can try to solve this problem by adjusting the sample weight.
The method of adjusting sample weight: The
first is to use balanced in class_weight.
The second method is to adjust the weight of each sample by sample_weight when calling the fit function.
说明:
If the above two methods are used, then the real weight of the sample is class_weight*sample_weight.
Three, realize linear multi-classification
(1) Take the length and width of the sepals as features to classify
- Import related packages
#导入相关包 import numpy as np from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt import matplotlib as mpl from sklearn import datasets from sklearn import preprocessing import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline
- Get the data set
# 获取所需数据集 iris=datasets.load_iris() #每行的数据,一共四列,每一列映射为feature_names中对应的值 X=iris.data print(X) #每行数据对应的分类结果值(也就是每行数据的label值),取值为[0,1,2] Y=iris.target print(Y)
- Process the data
#归一化处理 X = StandardScaler().fit_transform(X) print(X)
- Training model
lr = LogisticRegression() # Logistic回归模型 lr.fit(X, Y) # 根据数据[x,y],计算回归参数
- Draw the classified image
Draw resultN, M = 500, 500 # 横纵各采样多少个值 x1_min, x1_max = X[:, 0].min(), X[:, 0].max() # 第0列的范围 x2_min, x2_max = X[:, 1].min(), X[:, 1].max() # 第1列的范围 t1 = np.linspace(x1_min, x1_max, N) t2 = np.linspace(x2_min, x2_max, M) x1, x2 = np.meshgrid(t1, t2) # 生成网格采样点 x_test = np.stack((x1.flat, x2.flat), axis=1) # 测试点 cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF']) cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b']) y_hat = lr.predict(x_test) # 预测值 y_hat = y_hat.reshape(x1.shape) # 使之与输入的形状相同 plt.pcolormesh(x1, x2, y_hat, cmap=cm_light) # 预测值的显示 plt.scatter(X[:, 0], X[:, 1], c=Y.ravel(), edgecolors='k', s=50, cmap=cm_dark) plt.xlabel('petal length') plt.ylabel('petal width') plt.xlim(x1_min, x1_max) plt.ylim(x2_min, x2_max) plt.grid() plt.show()
- Predictive model
forecast resulty_hat = lr.predict(X) Y = Y.reshape(-1) result = y_hat == Y print(y_hat) print(result) acc = np.mean(result) print('准确度: %.2f%%' % (100 * acc))
(2) Take the length and width of the petals as features to classify
The method is the same as above, mainly for data processing, take the next two eigenvalues
X=X[:,2:]
Plotting results
Predicting results
Of course, you can combine the four feature values (process the data), and then classify them in the above way.
Four, summary
Through the realization of linear multi-classification, we mainly understand the use process of logistic regression. Its use process mainly includes three processes, namely importing the entire model package, training the model, and predicting the results. Judging from the final prediction results, the accuracy of the entire model is still relatively high, and the accuracy can meet the requirements. From the comparison of the two results, the result of petal classification will be more accurate than the result of sepal classification.