Machine learning-linear multi-classification of the iris data set

One, understand the iris flower data set

  • The Iris flower data set is a classic multi-dimensional data set introduced by Sir Ronald Fisher in 1936, which can be used as a sample for discriminant analysis.
  • This data set contains 50 samples of each of the three varieties of Iris flowers (Iris setosa, Iris virginica and Iris versicolor), and each sample also has 4 characteristic parameters (the length and width of the sepals and the length and width of the petals, in centimeters) Unit)
    View data set
    Insert picture description here
    Category description
    Insert picture description here

二、LogisticRegression

  1. Logistic
    Regression (Logistic Regression ) explains that Logistic Regression is used to deal with the regression problem where the dependent variable is a categorical variable. The common one is the binary or binomial distribution problem, and it can also handle the multi-classification problem. It is actually a kind of Classification.

  2. Use of LogisticRegression regression model in Sklearn①
    Import model

    from sklearn.linear_model import LogisticRegression  #导入逻辑回归模型 
    

    ②fit() training
    call the fit(x,y) method to train the model, where x is the attribute of the data, and y is the type

    clf = LogisticRegression()
    print(clf)
    clf.fit(train_feature,label)
    

    ③predict() prediction
    Use the trained model to predict the data set and return the prediction result

    predict['label'] = clf.predict(predict_feature)
    
  3. LogisticRegression regression model parameter description

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
              penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
              verbose=0, warm_start=False)
    

    ①Regularization selection parameter (penalty) The
    selectable values ​​of the penalty parameter are "l1" and "l2", which correspond to L1 regularization and L2 regularization, respectively. The default is L2 regularization . When tuning parameters, if our main purpose is to solve over-fitting , generally penalty L2 regularization is enough. But if you choose L2 regularization and find that it is still over-fitting, that is, when the prediction effect is poor, you can consider L1 regularization. When the model has a lot of features, it is hoped that some unimportant feature coefficients will be zeroed to make the model coefficients sparse, and L1 regularization can be used.
    ②Optimization algorithm selection parameter (solver) The
    solver parameter determines our optimization method for the logistic regression loss function. There are 4 algorithms to choose from: liblinear, lbfgs, newton-cg, sag
    说明:

    The choice of parameter solver
    When penalty is L2 regularization, the optional algorithm {'newton-cg','lbfgs','liblinear','sag'}. When the penalty is L1 regularization, you can only choose'liblinear'.
    This is because the loss function of L1 regularization is not continuous and differentiable, and the three optimization algorithms {'newton-cg','lbfgs','sag'} all require the first or second continuous derivative of the loss function.

    ③Classification method selection parameter (multi_class) The
    multi_class parameter determines our choice of classification method. There are two values ​​to choose from, ovr and multinomial . The default is ovr . The difference between the two is mainly in multiple logistic regression.
    说明:

    If ovr is selected, the four loss function optimization methods liblinear, newton-cg, lbfgs and sag can all be selected. But if you choose multinomial, you can only choose newton-cg, lbfgs and sag.

    ④Type weight parameter (class_weight) The
    class_weight parameter is used to indicate the weights of various types in the classification model. It does not need to be input, that is, the weights are not considered, or the weights of all types are the same. If you choose to input, you can choose balanced to let the class library calculate the type weights, or we can input the weights of each type ourselves. For example, for a binary model of 0,1, we can define class_weight={0:0.9, 1:0.1}, In this way, the weight of type 0 is 90%, and the weight of type 1 is 10%. If the class_weight is balanced, the class library will calculate the weight based on the training sample size. The larger the sample size of a certain type, the lower the weight, and the smaller the sample size, the higher the weight.
    ⑤The sample weight parameter (sample_weight)
    is not an unbiased estimate of the overall sample due to the imbalance of the sample, which may lead to a decline in our model's predictive ability. You can try to solve this problem by adjusting the sample weight.
    The method of adjusting sample weight: The
    first is to use balanced in class_weight.
    The second method is to adjust the weight of each sample by sample_weight when calling the fit function.
    说明:

    If the above two methods are used, then the real weight of the sample is class_weight*sample_weight.

Three, realize linear multi-classification

(1) Take the length and width of the sepals as features to classify

  1. Import related packages
    #导入相关包
    import numpy as np
    from sklearn.linear_model import LogisticRegression
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    from sklearn import datasets
    from sklearn import preprocessing
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import Pipeline
    
  2. Get the data set
    # 获取所需数据集
    iris=datasets.load_iris()
    #每行的数据,一共四列,每一列映射为feature_names中对应的值
    X=iris.data
    print(X)
    #每行数据对应的分类结果值(也就是每行数据的label值),取值为[0,1,2]
    Y=iris.target
    print(Y)
    
  3. Process the data
    #归一化处理
    X = StandardScaler().fit_transform(X)
    print(X)
    
  4. Training model
    lr = LogisticRegression()   # Logistic回归模型
    lr.fit(X, Y)        # 根据数据[x,y],计算回归参数
    
  5. Draw the classified image
    N, M = 500, 500     # 横纵各采样多少个值
    x1_min, x1_max = X[:, 0].min(), X[:, 0].max()   # 第0列的范围
    x2_min, x2_max = X[:, 1].min(), X[:, 1].max()   # 第1列的范围
    t1 = np.linspace(x1_min, x1_max, N)
    t2 = np.linspace(x2_min, x2_max, M)
    x1, x2 = np.meshgrid(t1, t2)                    # 生成网格采样点
    x_test = np.stack((x1.flat, x2.flat), axis=1)   # 测试点
    
    cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
    y_hat = lr.predict(x_test)       # 预测值
    y_hat = y_hat.reshape(x1.shape)                 # 使之与输入的形状相同
    plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)     # 预测值的显示
    plt.scatter(X[:, 0], X[:, 1], c=Y.ravel(), edgecolors='k', s=50, cmap=cm_dark)    
    plt.xlabel('petal length')
    plt.ylabel('petal width')
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.grid()
    plt.show()
    
    Draw result
    Insert picture description here
  6. Predictive model
    y_hat = lr.predict(X)
    Y = Y.reshape(-1)
    result = y_hat == Y
    print(y_hat)
    print(result)
    acc = np.mean(result)
    print('准确度: %.2f%%' % (100 * acc))
    
    forecast result
    Insert picture description here

(2) Take the length and width of the petals as features to classify

The method is the same as above, mainly for data processing, take the next two eigenvalues

X=X[:,2:]

Plotting results
Insert picture description here
Predicting results
Insert picture description here
Of course, you can combine the four feature values ​​(process the data), and then classify them in the above way.

Four, summary

Through the realization of linear multi-classification, we mainly understand the use process of logistic regression. Its use process mainly includes three processes, namely importing the entire model package, training the model, and predicting the results. Judging from the final prediction results, the accuracy of the entire model is still relatively high, and the accuracy can meet the requirements. From the comparison of the two results, the result of petal classification will be more accurate than the result of sepal classification.

Five, reference link

Logistic Regression

Guess you like

Origin blog.csdn.net/qq_43279579/article/details/115120795