Introductory Research on Machine Learning (14)-Logistic Regression

table of Contents

 

Overview

Sigmoid function

Loss function

Optimization loss

Corresponding sklearn API

Instance

Data set characteristics

to sum up


Overview

Classification model, used to estimate the possibility of a certain thing, is often used in the following scenarios:

Ad click rate: whether it was clicked

Spam: Whether it is spam

Are you sick

Financial fraud: whether it is a financial fraud

Fake account: whether it is a fake account

The common point in the above scenarios: both are two classification problems, and there is a positive example and a negative example.

The principle of logistic regression is a function constructed by taking the output of linear regression as the input of the Sigmoid function.

Sigmoid function

Also known as logistic function and activation function, it is expressed by the formula as follows:

Its function is shown in the figure:

This function is an s-shaped curve with a value range of [0,1].

Logistic regression is to use the output of linear regression as the input of the Sigmoid function.

Substituting the output of linear regression into the position of x in the above formula, and then transforming it into:

Where x is the sample set, which \thetais the weight of the solution we require. The output of this function is the probability value of [0,1], the default threshold is 0.5, and this threshold can be used to determine which category, and the default label 1 is a positive example, and the other category will be labeled 0, which is Counterexample.

Loss function

In linear regression, the offset and weight of the linear regression are obtained by solving the minimum value of the loss function, that is, the minimum value of the sum of squares of the predicted value and the true value is solved; but for logistic regression, the loss function is also solved The minimum value of obtains the weight and bias in the Sigmoid function, but at this time the predicted value is already 1 and 0, which cannot be compared with the true value, so log-likelihood loss is introduced here:

Among them, yi is the true value and the predicted value.

When yi=1, it is a positive example at this time, and the loss function is shown in the figure:

We can see that when y is closer to 1, the value of the loss function is closer to 0

When yi=0, it is a counter example at this time, and the loss function is shown in the figure:

It can also be seen that when y is closer to the position of 0, the value of the loss function is closer to 0.

Optimization loss

Like machine learning introductory research (12)-linear regression , the gradient descent optimization algorithm is used to reduce the value of the loss function, increase the probability of the original category 1, and reduce the probability of the original category 0.

Corresponding sklearn API

LogisticRegression(self, penalty='l2', dual=False, tol=1e-4, C=1.0,
                 fit_intercept=True, intercept_scaling=1, class_weight=None,
                 random_state=None, solver='warn', max_iter=100,
                 multi_class='warn', verbose=0, warm_start=False, n_jobs=None,
                 l1_ratio=None):

The meaning of the parameters is as follows

parameter meaning
penalty

Regularization selection parameters, the optional values ​​are l1, l2, and the default is l2. Generally l2 is enough. If L2 regularization is found to be over-fitting, that is, the prediction effect is very poor, consider l1. In addition, if the model has a lot of features, and hope that some unimportant feature coefficients are 0, so that the model is sparse, l1 can also be used.

Used in conjunction with solver.

dual

The default is False, which is only applicable when the regularized l2 solver is liblinear.

Usually the number of samples is greater than the number of features, the default is False

tolls The error range of the iterative termination judgment, the default is 1e-4
C \lambdaThe reciprocal of the regularization coefficient , the default is 1
fit_intercept Whether there is an intercept, the default is true
intercept_scaling Only valid when the solver is liblinear and fit_intercept is true
 class_weight

The weights of various types in the classification. It can be defined as class_weight={0:0.8,1:0.2}, that is, the sample size of category 0 is 80%, and the sample size of category 1 is 20%.

The main functions are:

The first type of misclassification is costly, such as whether to classify for legitimate users. The usual practice is to judge a legitimate user as an illegal user and then make a manual judgment. In this way, the weight of illegal users can be appropriately increased

The second is the high imbalance of the sample. For example, there are 10,000 legitimate user data and only 5 illegal users. If you do not consider the weight, they are likely to be judged as legitimate users, so you can choose balanced to automatically increase the sample of illegal users. Weights.

In addition, for the sample height imbalance, we can adjust the sample weight by adjusting the class_weight, and there is another way to adjust the weight of the sample by setting the parameter sample_weight of the function when calling the fit function.

 random_state Random state
solver

Used in conjunction with penalty.

If the penalty is l2, the value has

"newton-cg": A kind of Newton method family that uses the second derivative matrix of the loss function, namely the Hessian matrix, to iteratively optimize the loss function.

"lbfgs": A type of quasi-Newton method that uses the second derivative matrix of the loss function, namely the Hessian matrix, to iteratively optimize the loss function.

sag: stochastic average gradient descent

上述三者适用于较大数据集,支持one-vs-rest(OvS)和many-vs-many(MvM)两种多元逻辑回归。

如果是样本量非常大,比如>10万,sag是第一选择,但不能使用l1正则化。

前三者要求损失函数的一阶或者二阶连续导数

liblinear:开源的liblinear库,内部使用了坐标轴下降法来优化损失函数。适用于小数据集,只支持多元回归的OvR,不支持MvM

 

如果为l1,则只能取liblinear

max_iter 最大迭代次数
multi_class

分类方式的选择

ovr:即one-vs-rest

mulitmomial:即many-vs-many。

如果是二元逻辑回归,两者没有差别。主要差别在多元逻辑回归。

ovr:把多元逻辑回归看成二元逻辑回归。对于第K类的分类决策,把第K类的样本作为正例,其他除第K类的所有样本作为负例,得到第K类的分类模型。其他分类依次类推

MvM:如果模型有T类,每次在所有的T类样本里面选择出两类样本,记为T1和T2类,把所有的输出为T1和T2的样本放在一起,把T1作为正例,T2作为负例,进行二元逻辑回归,得到模型参数。共进行T(T-1)/2次分类

verbose

日志冗长度。

0:不输出训练过程;1偶尔输出

warm_start

是否热启动。默认False

True:则下次训练是以追加树的形式进行,重新使用上一次的调用作为初始化

  n_jobs

并行数。默认为1

-1:跟CPU核数一致

 l1_ratio  

实例

癌症分类预测-是良性还是恶性

数据集特征

有699样本,共11列数据,第一列是检索id,其他的都是跟肿瘤有关的医学特征,最后一列就是肿瘤类型,其中里面有16个缺失值,已经用?标出。

数据集的地址为:

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data

其中里面对应的字段说明为:

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

We can see the data structure through juptyer as follows:

For the final classification result, 2 means benign and 4 means malignant.

Let's look at the code as follows:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import  mean_squared_error
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

def regression():

    # 1)获取数据集
    path = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
    column_names = ["Sample code number", "Clump Thickness ", "Uniformity of Cell Size ", "Uniformity of Cell Shape"
        , "Marginal Adhesion", "Single Epithelial Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli",
                    "Mitoses", "Class"]
    data = pd.read_csv(path, names=column_names)
     # 2)对数据进行缺失值处理,并将?替换成NAN
    data = data.replace(to_replace="?",value=np.nan)
    data.dropna(inplace=True)
    # 3)划分数据集
    #除去最后一列是目标值,其他的列都是特征值
    x = data.iloc[:,1:-1]
    y = data["Class"]
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)
    # 4)进行标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    # 5)预估器
    estimator = LogisticRegression()
    estimator.fit(x_train, y_train)
    # 6)进行预测
    y_predict = estimator.predict(x_test)
    #print("预测值:", y_predict)
    print("真实值:", y_test==y_predict)
    print("逻辑回归模型的参数为 w :", estimator.coef_)
    print("逻辑回归模型的参数为 b:", estimator.intercept_)
    # 7)模型评估
    error = mean_squared_error(y_test, y_predict)
    print("逻辑回归 的误差值:", error)
    return

Look at the output results as follows:

逻辑回归模型的参数为 w : [[1.166673   0.1206053  0.72963858 0.60897593 0.10861572 1.47922335
  0.7462081  0.79426579 0.87710322]]
逻辑回归模型的参数为 b: [-0.97797973]
逻辑回归 的误差值: 0.0935672514619883

We see that there are several eigenvalues ​​and several eigenvalues ​​will be output, so the error value is already very small.

to sum up

At the end of 2019, I started to learn machine learning related content. It felt quite fun. Due to the busy project bidding in the middle, there was no time to watch it for a while, so I wrote this a few times before the holiday. After summarizing the documents of the day, after waiting for the Spring Festival to come back, you must try to squeeze the time to study.

Come on! ! !

Guess you like

Origin blog.csdn.net/nihaomabmt/article/details/103961039