【机器学习】2 逻辑回归

1 Logistic Regression

  • It is a classification algorithm although it has the name of regression

1.1 Differences between Logistic Regression and Linear Regression

Logistic Regression Linear Regression
clssification algorithm regression algorithm
0 ≤ h θ ( x ) ≤ 1 0≤h_{\theta}(x)≤1 0hθ(x)1 h θ ( x ) h_{\theta}(x) hθ(x) can be > 1 >1 >1 or < 0 <0 <0

1.2 Model

  • Hypothesis h θ ( x ) = P ( y = 1 ∣ x ; θ ) = g ( θ T x ) 【estimated probability that y=1, given x, parameterized by  θ 】 g ( z ) = 1 1 + e − z 【Sigmoid Function,Logistic Function】 \begin{aligned} h_\theta(x)&=P(y=1|x;\theta)=g(\theta^Tx)&\text{【estimated probability that y=1, given x, parameterized by $\theta$】}\\ g(z)&=\frac{1}{1+e^{-z}}&\text{【Sigmoid Function,Logistic Function】} \end{aligned} hθ(x)g(z)=P(y=1x;θ)=g(θTx)=1+ez1estimated probability that y=1, given x, parameterized by θSigmoid FunctionLogistic Function
    Sigmoid Function

suppose predict:
y = 1 y=1 y=1” if h θ ( x ) ≥ 0.5 ( θ T x ≥ 0 ) h_{\theta}(x)≥0.5(\theta^Tx≥0) hθ(x)0.5θTx0
y = 0 y=0 y=0” if h θ ( x ) < 0.5 ( θ T x < 0 ) h_{\theta}(x)<0.5(\theta^Tx<0) hθ(x)<0.5θTx<0

import numpy as np
def sigmoid(z):
	return 1 / (1 + np.exp(-z))
  • Parameters θ \theta θ
  • Decision Boundary:is a property not of the training set but of the hypothesis and of the patameters
  • Cost Functionsquare error function / square error cost function J ( θ ) = 1 m ∑ i = 1 m C o s t ( h θ ( x ( i ) , y ) ) C o s t ( h θ ( x , y ) ) = { − l o g ( h θ ( x ) ) , if  y = 1 − l o g ( 1 − h θ ( x ) ) , if  y = 0 变 式 : C o s t ( h θ ( x , y ) ) = − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( 1 − h θ ( x ) ) 【y=1 or 0】 \begin{aligned} J(\theta)&=\frac{1}{m}\sum_{i=1}^mCost(h_\theta(x^{(i)},y))\\ Cost(h_\theta(x,y))&=\begin{cases} -log(h_{\theta}(x))&,\text{if $y=1$}\\ -log(1-h_{\theta}(x))&,\text{if $y=0$}\end{cases}\\ 变式: Cost(h_\theta(x,y))&=-ylog(h_{\theta}(x))-(1-y)log(1-h_{\theta}(x))&\text{【y=1 or 0】} \end{aligned} J(θ)Cost(hθ(x,y))Cost(hθ(x,y))=m1i=1mCost(hθ(x(i),y))={ log(hθ(x))log(1hθ(x)),if y=1,if y=0=ylog(hθ(x))(1y)log(1hθ(x))y=1 or 0
import numpy as np
def cost(theta, X, y):
	theta = np.matrix(theta)
	X = np.matrix(X)
	y = np.matrix(y)
	first = np.multiply(-y, np.log(sigmoid(X* theta.T)))
	second = np.multiply((1-y), np.log(1 - sigmoid(X* theta.T)))
	return np.sum(first - second) / (len(X))
  • Goal(Object Function) minimize θ J ( θ ) \mathop{\text{minimize}}\limits_{\theta} J(\theta) θminimizeJ(θ)

1.3 use Gradient Descent for J ( θ ) J(\theta) J(θ)

  • repeat{
       θ j : = θ j − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) ) \theta_j:=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)}) θj:=θjαm1i=1m((hθ(x(i))y(i))xj(i))
    (simultaneously update all θ j \theta_j θj)
    }

1.4 Advanced Optimization replace for Gradient Descent

  • Optimization Algorithms:
  1. Conjugate gradient(共轭梯度)
  2. BFGS(局部优化,Broyden Fletcher Goldfarb Shann)
  3. L-BFGS(有限内存局部优化)
  • Advantages:
  1. No need to manually pick α \alpha α
  2. Often faster than gradient descent
  • Disadvantages:more complex
Octave代码
function [jval, gradient] = costFunction(theta)
	jVal = [...code to compute J(theta)...];
	gradient = [... code to compute derivative of J(theta)...];
end
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);

[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

2 Multi-class classification:One-vs-all

  • One-versus-all Classification / One-versus-rest
  • Train a logistic regression classifier h θ ( i ) ( x ) h_\theta^{(i)}(x) hθ(i)(x) for each class i i i to predict the probability that y = i y=i y=i
  • On a new input x x x to make a prediction, pick the class i i i that maximizes max i h θ ( i ) ( x ) \mathop{\text{max}}\limits_{i} h_\theta^{(i)}(x) imaxhθ(i)(x)

就是当解决多类别分类问题时,每次只分类一个类别A,而将其他类别都看作是一个类别B

3 Reference

吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记

猜你喜欢

转载自blog.csdn.net/qq_44714521/article/details/108281376