Rookie Notes Python3 - Machine Learning (2) Logistic Regression Algorithm

References

<PYTHON_MACHINE_LEARNING> chapter3
A Tour of Machine Learning
Classifers Using Scikit-learn

introduction

When we classify, the eigenvalues ​​in the samples are generally distributed in the real number domain, but what we want is often a value with a similar probability in [0,1]. In other words, in order to prevent interference between eigenvalues ​​due to too large difference, for example, when only one feature value is particularly large, but other values ​​are small, we need to normalize the data. That is, we need to use an injective from R to [0,1] to process the eigenvalue matrix before doing machine learning. When the mapping used is a sigmoid function, we call such a machine learning algorithm logistic regression.
PS: Logistic regression is for classification! ! ! Not for linear regression! The inverse function of the sigmoid function is called the logit function, which is the origin of logistic regression, which has nothing to do with logic...

sigmoid function

 
 

The feature of this function is that the domain of the sigmoid is in R and the value domain is in [0,1]. At the
same time, it also represents the probability of y=1, and the probability of y=0 is 1-phi(z)
. figure to illustrate

#! /usr/bin/python <br> # -*-coding: utf8 -*-
import matplotlib.pyplot as plt
import numpy as np def sigmoid(z): return 1.0/(1.0+np.exp(-z)) z = np.arange(-10,10,0.1) p = sigmoid(z) plt.plot(z,p) #画一条竖直线,如果不设定x的值,则默认是0 plt.axvline(x=0, color='k') plt.axhspan(0.0, 1.0,facecolor='0.7',alpha=0.4) # 画一条水平线,如果不设定y的值,则默认是0 plt.axhline(y=1, ls='dotted', color='0.4') plt.axhline(y=0, ls='dotted', color='0.4') plt.axhline(y=0.5, ls='dotted', color='k') plt.ylim(-0.1,1.1) #确定y轴的坐标 plt.yticks([0.0, 0.5, 1.0]) plt.ylabel('$\phi (z)$') plt.xlabel('z') ax = plt.gca() ax.grid(True) plt.show() 
 
 

Logistic regression algorithm logistic regression

  • Basic principle The
    logistic regression algorithm is very similar to the Adaline linear adaptive algorithm, the difference is that the activation function is changed from **constant mapping y = z ** to y = sigmoid(z)
 
 
  • The loss function in logistic regression
    Recall the loss function cost function applied in the gradient descent model Adaline The squared difference function
     
     

    This is a loss function of linear regression,
    but for the sigmoid function of the sigmoid, this definition will be very close to zero when y approaches -1, 1. The
    logistic regression loss function is
    the log-likelihood loss defined in this way Function (cross entropy)
    Ps: All log is actually ln

 

 
 

Where does this loss function come from? Maximum Likelihood
First define the likelihood function (each sample is considered independent):
 
 

Likelihood function can be regarded as conditional
probability For the concept of likelihood function, please refer to kevinGao 's blog

 

http://www.cnblogs.com/kevinGaoblog/archive/2012/03/29/2424346.html

According to the concept of likelihood function, the probability that maximizes the likelihood function is the most reasonable. We want to maximize the likelihood function, but this form is still not good-looking, after all, it is the form of continuous multiplication, so let's take the logarithm

 

 
 

Well now, we know: when the weight vector w maximizes l , w is most reasonable,
then we define the J function: J = -l

 

 
 

For a better understanding, let's look at the loss function for a single sample:

 
 

 

 
 

Taking y=1 as an example, when the predicted value is close to the correct value, J will converge to 0

 

  • The weight update
    is the same as the gradient descent method, according to the formula
 
 

Has been calculated

 
 

We have the formula for weight update, which is
exactly the same as Adaline ,
is it unexpected? Surprised or not?

 

 
 

This means that when we write the LogisticRegression class separately, we only need to redefine the excitation function phi in the Adaline class.

 

practice

Let's use the Iris dataset to practice on the basis of the previous chapter sklearn implements the Perceptron perceptron

__author__ = 'Administrator'
#! /usr/bin/python <br> # -*- coding: utf8 -*-
from sklearn import datasets
from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score from PDC import plot_decision_regions import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap import numpy as np iris = datasets.load_iris() x = iris.data[:,[2,3]] y = iris.target X_train,X_test,y_train,y_test = train_test_split( x , y, test_size=0.3, random_state = 0 ) sc = StandardScaler() sc.fit(X_train) X_train_std = sc.transform(X_train) X_test_std = sc.transform(X_test) Ir = LogisticRegression(C=1000.0,random_state=0) Ir.fit(X_train_std,y_train) X_combined_std = np.vstack((X_train_std,X_test_std)) y_combined = np.hstack((y_train,y_test)) plot_decision_regions(X=X_combined_std,y=y_combined, classifier=Ir, test_idx=range(105,150)) plt.xlabel('petal length [standardized]') plt.ylabel('petal width [standardized]') plt.legend(loc='upper left') plt.savefig('Iris.png') plt.show() print(X_test_std[0,:]) a = Ir.predict_proba(X_test_std[0,:]) print(a) 
 
 

Overfitting, Underfitting and Regularization

Overfitting and underfitting are two common problems in machine learning

  • Overfitting is
    commonly known as overthinking. In order to fit the training set well, the model uses too many parameters and becomes very complicated. Even noise and errors are divided into one category. Although such a model simulates the training set well, it is not suitable for prediction. The data set is particularly unreliable, we say: such a model has a high variance (high variance)
    - underfitting
    correspondingly, the mind is too simplistic. The model is too simple to be reliable for the prediction dataset
    Models like ours has a high bias
 
 
  • Regularization Ruglarization
    To prevent overfitting, regularization is a commonly used method. Regularization, simply put, is to introduce additional bias to reduce the influence of some extreme weights.
    The most common regularization is L2 regularization, which adds such a term to the end of the loss function
     
     

    Lambda is called the regularization parameter
    so that the loss function form becomes:
 
 
Ir = LogisticRegression(C=1000.0,random_state=0)

The parameter C in the class LogisticRegression comes from the related concept of support vector machine ( SVM ), which will not be expanded here.

 
 

The final form of the loss function:

 

 
 
  • The effect of C value on the simulation
    Set different powers from -5 to 4 10 as the C value, let's take a look at the influence of the weight value
weights, params = [], []
for c in list(range(-5,5)): lr = LogisticRegression(C=10**int(c), random_state=0) lr.fit(X_train_std, y_train) weights.append(lr.coef_[1]) params.append(10**c) weights = np.array(weights) plt.plot(params, weights[:, 0],label='petal length') plt.plot(params,weights[:,1],linestyle='--',label='petal width') plt.ylabel('weight coefficient') plt.xlabel('C') plt.legend(loc='upper left') plt.xscale('log') plt.show() 
 


Author: Lingyu Zhenren
Link : https://www.jianshu.com/p/9db03938ea72
Source: Jianshu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326529730&siteId=291194637