Classification practice of LogisticRegression algorithm

Classification practice of python Logistic algorithm

basic concept

Let me briefly introduce two concepts in machine learning

1. Loss function

The loss function is the most basic and critical element in machine learning, and its function is to measure the quality of the model's predictions.
Let's give a simple example to illustrate this function:

Suppose we model the sales of a company and obtain the actual model and the forecast model respectively. The difference between the two is the loss function, which can be expressed by the absolute loss function:

L(Yf(X))=|Yf(X)|——The absolute value of formula Y-actual Y

For different models, the loss function is different. For example, the square loss function is used instead of the absolute loss function:

L(Yf(X))=(Yf(X))^2——Formula Y-the square of actual Y

The loss function is a good tool to reflect the gap between the model and the actual data. Understanding the loss function can better analyze and understand the subsequent optimization tools (gradient descent, etc.). In many cases, complicated problems are encountered, and the most difficult one is how to write a loss function.

2. Regularization

Regularization refers to making explicit constraints on the model to avoid overfitting. The specific principles of regularization are not described here. Those who are interested can read this article: Intuitive understanding of regularization terms L1 and L2 in machine learning.

Introduction to Algorithm

LogisticRegression

The prediction formula of the model is as follows:

y=w[0]*x[0]+w[1]x[1]+w[2]x[2]+…+w[p]x[p]+b>0

This formula looks very similar to the linear regression formula. Although LogisticRegression contains regression in its name, it is a classification algorithm, not a regression algorithm, and should not be confused with LinearRegression. In this formula, we did not return the weighted summation of the features, but set a threshold (0) for the prediction. If the function value is less than 0, we predict the category as -1, and if the function value is greater than 0, we predict the category +1. For all linear models used for classification, this prediction rule is common.

Data Sources

Fetal Health Classification: https://www.kaggle.com/andrewmvd/fetal-health-classification

Insert picture description here

The data contains fetal electrocardiogram, fetal movement, uterine contraction and other characteristic values, and what we need to do is to classify the fetal health (fetal_health) through these characteristic values.

Insert picture description here

The data set contains 2126 feature records extracted from electrocardiogram examinations, which are then divided into 3 categories by three obstetric experts, and represented by numbers: 1-general, 2-suspected pathology, 3-determined pathology.

Data mining

1. Import third-party libraries

import pandas as pd
import numpy as np
import winreg
from sklearn.model_selection import train_test_split#划分数据集与测试集
from sklearn.linear_model import LogisticRegression#导入算法模块
from sklearn.metrics import accuracy_score#导入评分模块

The old rules, come up first to import each module needed for modeling in turn.

2. Read the file

import winreg
real_address = winreg.OpenKey(winreg.HKEY_CURRENT_USER,r'Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders',)
file_address=winreg.QueryValueEx(real_address, "Desktop")[0]
file_address+='\\'
file_origin=file_address+"\\源数据-分析\\avocado.csv"#设立源数据文件的桌面绝对路径
glass=pd.read_csv(file_origin)#https://www.kaggle.com/andrewmvd/fetal-health-classification

Because every time you download data, you have to transfer the file to the python root directory or read it in the download folder, which is very troublesome. So I set up an absolute desktop path through the winreg library, so that I only need to download the data to the desktop or paste it into a specific folder on the desktop to read it, and it won't be confused with other data.
In fact, up to this step, we are going through the process, basically every data mining has to be done again, there is nothing to say.

3. Clean the data

Find missing values:
Insert picture description here
From the above results, there are no missing values ​​in the data.

4. Modeling

train=health.drop(["fetal_health"],axis=1)
X_train,X_test,y_train,y_test=train_test_split(train,health["fetal_health"],random_state=1)
###考虑到接下来可能需要进行其它的操作,所以定了一个随机种子,保证接下来的train和test是同一组数

The column index is divided into feature values ​​and predicted values, and the data is divided into training set and test set.

logistic=LogisticRegression(penalty='l2',C=1,solver='lbfgs',max_iter=1000)
logistic.fit(X_train,y_train)
print("Logistic训练模型评分:"+str(accuracy_score(y_train,logistic.predict(X_train))))
print("Logistic待测模型评分:"+str(accuracy_score(y_test,logistic.predict(X_test))))

Introduce the LogisticRegression algorithm, and set up the parameters in the algorithm in turn. After modeling, the accuracy of the test set is scored, and the results obtained are as follows:
Insert picture description here

It can be seen that the accuracy of the model is about 88%.

5. Parameter explanation

Here we only explain a few important parameters, for other parameters, friends can explore by themselves.

LogisticRegression(penalty=‘l2’, dual=False, tol=0.0001, C=1.0,fit_intercept=True, intercept_scaling=1, class_weight=None,random_state=None, solver=‘liblinear’, max_iter=100,multi_class=‘ovr’, verbose=0, warm_start=False, n_jobs=1)

1.penalty: the choice of regularization items. There are two main types of regularization: L1 and L2, and L2 regularization is selected by default. (Compared with the two default regularization linear algorithms, ridge and lasso , free choice of regularization is a major advantage of this algorithm.)

2. C: The reciprocal of the regularization strength (regularization coefficient λ); it must be a floating-point number greater than 0. Like support vector machines, smaller values ​​specify stronger regularization, and usually default to 1.

3. Solver:'newton-cg','lbfgs','liblinear','sag','saga'. Default parameter: liblinear
liblinear: The open source liblinear library is used to implement it, and the coordinate axis descent method is used internally to iteratively optimize the loss function.
lbfgs: A type of quasi-Newton method that uses the second derivative matrix of the loss function, namely the Hessian matrix, to iteratively optimize the loss function.
newton-cg: It is also a kind of Newton method family, which uses the second derivative matrix of the loss function, that is, the Hessian matrix to iteratively optimize the loss function.
sag: Stochastic average gradient descent, which is a variant of the gradient descent method. The difference from the ordinary gradient descent method is that only a part of the sample is used to calculate the gradient in each iteration, which is suitable for when the sample data is large.
saga: Stochastic optimization algorithm with linear convergence.

4. max_iter: The maximum number of iterations for the solver to converge. (Usually the bigger the better)

6. Tuning

Insert picture description here
1. When using this algorithm for classification, occasionally such warnings will appear. It probably means that the maximum number of iterations cannot meet the convergence of the solver parameter. At this time, in addition to selecting another parameter, we can also increase the maximum number of iterations. To solve this problem.
Insert picture description here
From the above figure, we can see that in addition to the warning disappeared, the accuracy of the model has increased compared to the previous running results. Of course, increasing the maximum number of iterations will not infinitely improve the accuracy of the model. When it is increased to a certain level and the convergence of the parameters of the solver is satisfied, the accuracy of the model no longer improves.

2. Parameter C represents the intensity of regularization, the smaller the value, the stronger the regularization.
Insert picture description here
3.solver parameters

I think the biggest advantage of the solver parameter is that it uses one parameter to distinguish the two-class and multi-class problems and optimize them.
Briefly explain these two types of questions: to determine whether a girl likes you, it will only answer whether you like or dislike it. This is the problem of two classifications. Of course, this seems too rude to us. Either hope or despair is not good for our physical and mental health. Well, if it can tell me that she likes it a little, doesn't like it, or doesn't like it at all, this is a multi-category problem. And compared to the previous problem, there may be a different solution.

Model the two classification problem as follows:

health=health[health["fetal_health"]!=3]#去掉3,只保留1,2,将之前的多分类问题变成二分类问题
train=health.drop(["fetal_health"],axis=1)
X_train,X_test,y_train,y_test=train_test_split(train,health["fetal_health"],random_state=1)
logistic=LogisticRegression(penalty='l1',C=1,solver='liblinear')
logistic.fit(X_train,y_train)
print("Logistic训练模型评分:"+str(accuracy_score(y_train,logistic.predict(X_train))))
print("Logistic待测模型评分:"+str(accuracy_score(y_test,logistic.predict(X_test))))

Get the result:
Insert picture description here
The following is a summary of the applicable scope of the solver parameters, let’s take a look:
Insert picture description here
One thing to explain here is that the parameters used to handle multi-classification can also be used to handle two-class classification, but the effect will not be very ideal:
Insert picture description here
Compared with the model established by the last parameter, it can be seen that the accuracy of the model has decreased.

summary

So far, we have used several linear algorithms ( lasso , ridge , least squares ). The difference between these algorithms lies in the following two points:

1. The specific combination of coefficient (w) and intercept (b) is a measure of how well the training data fits (loss function)
2. Whether to use regularization, and which regularization method to use.

Different algorithms use different methods to measure "good or bad fit to training". Due to mathematical technical reasons, it is impossible to adjust w and b to minimize the number of misclassifications generated by the algorithm. So for our purposes, and for many applications, the choice of the first point above is sometimes not important.

One advantage of the linear model is that the training speed is very fast, and the prediction speed is also very fast. This model can be extended to very large data sets and is also effective for sparse data. If your data contains hundreds of thousands or even millions of samples, you may need to study how to use the solver="sag" option of LogisticRegression and Ridge models. This option is faster than the default value when dealing with large data.

Another advantage of linear models is that it is relatively easy to understand how to make predictions using the formulas we have used for classification and regression. Unfortunately, it is often not entirely clear why the coefficient is like this. This problem is especially prominent if your data set contains highly correlated features. In this case, it may be difficult to interpret the coefficients.

There are many places that are not doing very well. Netizens are welcome to make suggestions, and I hope to meet some friends to discuss together.

Guess you like

Origin blog.csdn.net/weixin_43580339/article/details/112277248
Recommended