Linear classification of breast cancer dataset breastCancer and iris dataset iris by logistic regression and stochastic gradient descent


Linear Regression and Logistic Regression

Linear regression is to predict the value of a continuous variable. Linear regression assumes that there is a linear relationship between the dependent variable and the independent variable. Linear regression requires that the dependent variable is a continuous numerical variable. The characteristics of the training data are used as independent variables, and the category to which the training data belongs is used as The dependent variable, assuming that there is a linear functional relationship between the dependent variable and the independent variable, trains the model through a large number of training samples, and then continuously adjusts the weight in the functional relationship, that is, the parameter value, so that the obtained functional relationship can be gives expected results for each training example. In the process of training parameters, the square difference, that is, the square difference between the predicted value and the actual value is used as the cost function. To make the value of the parameter meet the minimum value of the cost function, the value of cost is continuously reduced. Optimize the value of the parameter. The function model obtained here is a continuous function, which can be represented by a straight line in two-dimensional space. When the eigenvalue of the data increases, it will be a linear hyperplane in the corresponding dimensional space.
Logistic regression is generally used for classification to indicate the possibility of something happening, because its output value between [0, 1] can be seen as a probability. Here we take the binary classification as an example. Assuming that the relationship between the dependent variable and the independent variable is nonlinear, the general output value is between [0, 1], that is, two categories, one is 0 and the other is 1. When x<0, make y=0, when x=0, y=0.5, when x>0, y=1, but such a functional relationship is not continuous, so use the sigmoid function for fitting, so Make the functional relationship between the dependent variable and the independent variable non-linear and limit the output between [0, 1]. Logistic regression can be seen as adding a Sigmoid function to linear regression, and logistic regression uses a logarithmic function as the cost function.
Linear regression generally solves regression problems, that is, predicting continuous, specific values. Applied to the fitting of some data, it is used to predict some values, and it predicts the values ​​that did not appear according to the previously trained model.


StandardScaler processing

对数据集进行StandardScaler处理和MinMaxScaler处理有所区别

StandardScaler is mainly used to normalize the mean and variance. Viewing the source code in pycharm shows Standardize features by removing the mean and scaling to unit variance. The calculation method used by this function is z = (x - u) / s, uis the mean of the training samples or zero if with_mean=False, and sis the standard deviation of the training samples or one if with_std=False. Through this operation, the data can be kept at 0 Nearby, the variance is 1, which makes the data more concentrated and aggregated. You can also use the parameters of the training set, that is, the variance and mean of the training set to standardize the test set, and convert the eigenvalues ​​​​of the samples to the same dimension. Commonly used and based on positive State-distributed algorithms, such as regression.
MinMaxScaler is mainly used to standardize the features of the data set to a specified range. The default is [0, 1]. Viewing the source code in pycharm shows Transform features by scaling each feature to a given range. The calculation method used by this function is X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)). This method is a linear transformation of the original data. Generally, the data is normalized to [0,1]. This method is only applicable to the case where the data is distributed within a range. It can improve the model convergence speed and improve the model accuracy. Common for neural networks.


# # 使用StandrdScaler预处理数据集,使其离差标准化
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaler = scaler.transform(X_train)
X_test_scaler = scaler.transform(X_test)

# # 使用MinMaxScaler预处理数据集,使其离差标准化
# scaler = MinMaxScaler().fit(X_train)
# X_train_scaler = scaler.transform(X_train)
# X_test_scaler = scaler.transform(X_test)

linear classification

The code uses LogisticRegression and SGDClassifier two classification methods (SGD, stochastic gradient descent, Stochastic Gradient Descent)

# 使用LogisticRegreeion分类器学习和测试
lr = LogisticRegression()
lr.fit(X_train_scaler, y_train)
y_pred_lr = lr.predict(X_test_scaler)

# 使用SGDClassifier分类器学习和测试
sgd = SGDClassifier()
sgd.fit(X_train_scaler, y_train)
y_pred_sgd = sgd.predict(X_test_scaler)

Every time the code is run, the position of the linear interface and the classification accuracy are different. The logistic regression classifier created for LogisticRegression uses the training samples and test samples that were previously classified using the train_test_split function. There is a random_state parameter, It is equivalent to a random number seed. If this parameter is set to an integer, the revisibility of the training set and the test set can be guaranteed. Therefore, after setting the parameters of this function, the accuracy of the LogisticRegression classifier will remain unchanged. , but the accuracy of the SGDClassifier classifier will still change, because SGDClassifier is a stochastic gradient descent classifier, each iteration randomly extracts a sample from the training set, so it may be used when not all the training data is used The value of the cost function is within the acceptable range, so the accuracy rate is different each time.
It does not mean that the performance of the classifier is unstable, because the accuracy rate can be guaranteed within a certain range, which can effectively prevent the phenomenon of overfitting.
The average accuracy of the model can be used to judge the performance of the model, because the model may be affected by noise points, because when the data set is divided, if the classification of noise points may cause overfitting or the accuracy of model judgment reduce.

breast cancer dataset breastCancer

The results with higher accuracy in the interception program are as follows, using 20 and 29 dimensional features:

insert image description here

Classification accuracy when using 20th and 25th dimension features

insert image description here

Iris data set iris

When using the 0th dimension ('sepal length (cm)') and the 2nd dimension ('petal length (cm)') the results are as follows:

insert image description here

When using the 0th dimension ('sepal length (cm)') and the 1st dimension ('sepal width (cm)') the results are as follows:

insert image description here

When using the 1st dimension (' sepal width (cm)') and the 2nd dimension ('petal length (cm)') the results are as follows:

insert image description here

When using the 2nd dimension ('petal length (cm)') and the 3rd dimension (petal width (cm)) the results are as follows:

insert image description here

Guess you like

Origin blog.csdn.net/qq_48068259/article/details/127891355