Logistic regression study notes
This article is only for personal study and understanding
logistic regression
Logistic regression is used to solve binary classification problems
. The output of the regression model is continuous ; the output of the classification model is discrete.
- Logistic regression = linear regression + sigmoid function
- Linear regression: z = w ∗ x + bz=w*x+bz=w∗x+b
- sigmoid function: y = 1 1 + e − z = 1 1 + e − ( w ∗ x + b ) y=\frac{1}{1+e^{-z} } =\frac{1}{1+ e^{-(w*x+b)} }y=1+e−z1=1+e−(w∗x+b)1
- Logistic regression loss function (the smaller the loss function, the better the model, and the training process is the optimization process that minimizes the loss function ):
C = − [ y ln a + ( 1 − y ) ln ( 1 − a ) ] C=- [y \ln a+(1-y) \ln (1-a)]C=−[ylna+(1−y)ln(1−a)]
loss function
cost = { − log ( p ^ ) if y = 1 − log ( 1 − p ^ ) if y = 0 \text { cost }=\left\{\begin{array}{ccc} -\log (\hat{p}) & \text { if } & y=1 \\ -\log (1-\hat{p}) & \text { if } & y=0 \end{array}\right. cost ={
−log(p^)−log(1−p^) if if y=1y=0
A single sample loss function is expressed as:
cost = − y log ( p ^ ) − ( 1 − y ) log ( 1 − p ^ ) \text { cost }=-y \log (\hat{p})- (1-y) \log (1-\hat{p}) cost =−ylog(p^)−(1−y)log(1−p^)
所有样本损失函数(求和):
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( p ^ ( i ) ) + ( 1 − y ( i ) ) log ( 1 − p ^ ( i ) ) J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(\hat{p}^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right) J(θ)=−m1i=1∑my(i)log(p^(i))+(1−y(i))log(1−p^(i))
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( σ ( X b ( i ) θ ) ) + ( 1 − y ( i ) ) log ( 1 − σ ( X b ( i ) θ ) ) J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(\sigma\left(X_{b}^{(i)} \theta\right)\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(X_{b}^{(i)} \theta\right)\right) J(θ)=−m1i=1∑my(i)log( p(Xb(i)i ) )+(1−y(i))log(1−p(Xb(i)θ ) )
The above formula cannot be solved mathematically, but the analytical function isa convex function(there is no global optimal solution, only a local optimal solution), which can be solved by the gradient descent method.
Gradient Descent
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( σ ( X b ( i ) θ ) ) + ( 1 − y ( i ) ) log ( 1 − σ ( X b ( i ) θ ) ) J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(\sigma\left(X_{b}^{(i)} \theta\right)\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(X_{b}^{(i)} \theta\right)\right) J(θ)=−m1i=1∑my(i)log( p(Xb(i)i ) )+(1−y(i))log(1−p(Xb(i)θ))
∇ J ( θ ) = ( ∂ J ( θ ) ∂ θ ∂ ∂ J ( θ ) ∂ θ 1 ⋯ ∂ J ( θ ) ∂ θ n ) \nabla J(\theta)=\left(\begin{array}{c} \frac{\partial J(\theta)}{\partial \theta_{\partial}} \\ \frac{\partial J(\theta)}{\partial \theta_{1}} \\ \cdots \\ \frac{\partial J(\theta)}{\partial \theta_{n}} \end{array}\right) ∇ J ( θ )=⎝
⎛∂θ∂∂ J ( θ )∂θ1∂ J ( θ )⋯∂θn∂ J ( θ )⎠
⎞
First look at the sigmoid function derivation
σ ( t ) = 1 1 + e − t = ( 1 + e − t ) − 1 \sigma(t)=\frac{1}{1+e^{-t}}=\ left(1+e^{-t}\right)^{-1}s ( t )=1+e−t1=(1+e−t)−1
σ ( t ) ′ = − ( 1 + e − t ) − 2 ⋅ e − t ⋅ ( − 1 ) = ( 1 + e − t ) − 2 ⋅ e − t ⋅ \sigma(t)^{\prime}=-\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \cdot(-1)=\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \cdot s ( t )′=−(1+e−t)−2⋅e−t⋅(−1)=(1+e−t)−2⋅e−t⋅
再扩一层
( log σ ( t ) ) ′ = 1 σ ( t ) ⋅ σ ( t ) ′ = 1 σ ( t ) ⋅ ( 1 + e − t ) − 2 ⋅ e − t = 1 ( 1 + e − t ) − 1 ⋅ ( 1 + e − t ) − 2 ⋅ e − t = ( 1 + e − t ) − 1 ⋅ e − t \begin{aligned} (\log \sigma(t))^{\prime}&=\frac{1}{\sigma(t)} \cdot \sigma(t)^{\prime}=\frac{1}{\sigma(t)} \cdot\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \\ &=\frac{1}{\left(1+e^{-t}\right)^{-1}} \cdot\left(1+e^{-t}\right)^{-2} \cdot e^{-t}=\left(1+e^{-t}\right)^{-1} \cdot e^{-t} \end{aligned} (logs ( t ) )′=s ( t )1⋅s ( t )′=s ( t )1⋅(1+e−t)−2⋅e−t=(1+e−t)−11⋅(1+e−t)−2⋅e−t=(1+e−t)−1⋅e−t
( log σ ( t ) ) ′ = ( 1 + e − t ) − 1 ⋅ e − t = e − t 1 + e − t = 1 + e − t − 1 1 + e − t = 1 − 1 1 + e − t = 1 − σ ( t ) \begin{aligned} (\log \sigma(t))^{\prime} &=\left(1+e^{-t}\right)^{-1} \cdot e^{-t} \\ &=\frac{e^{-t}}{1+e^{-t}}=\frac{1+e^{-t}-1}{1+e^{-t}}=1-\frac{1}{1+e^{-t}} \\ &=1-\sigma(t) \end{aligned} (logs ( t ) )′=(1+e−t)−1⋅e−t=1+e−te−t=1+e−t1+e−t−1=1−1+e−t1=1−σ ( t )
d ( y ( i ) log σ ( X b ( i ) θ ) ) d θ j = y ( i ) ( 1 − σ ( X b ( i ) θ ) ) ⋅ X j ( i ) \frac{d\left(y^{(i)} \log \sigma\left(X_{b}^{(i)} \theta\right)\right)}{d \theta_{j}}=y^{(i)}\left(1-\sigma\left(X_{b}^{(i)} \theta\right)\right) \cdot X_{j}^{(i)} d ijd(y(i)logp(Xb(i)i ) )=y(i)(1−p(Xb(i)i ) )⋅Xj(i)
( log ( 1 − σ ( t ) ) ) ′ = 1 1 − , σ ( t ) ⋅ ( − 1 ) ⋅ σ ( t ) ′ = − 1 1 − σ ( t ) ⋅ ( 1 + e − t ) − 2 ⋅ e − t = − 1 + e − t e − t ⋅ ( 1 + e − t ) − 2 ⋅ e − t = − ( 1 + e − t ) − 1 = − σ ( t ) \begin{aligned} (\log (1-\sigma(t)))^{\prime}=&\frac{1}{1-, \sigma(t)} \cdot(-1) \cdot \sigma(t)^{\prime}=-\frac{1}{1-\sigma(t)} \cdot\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \\ &=-\frac{1+e^{-t}}{e^{-t}} \cdot\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \\ &=-\left(1+e^{-t}\right)^{-1}=-\sigma(t) \end{aligned} (log(1−σ ( t )) )′=1−,s ( t )1⋅(−1)⋅s ( t )′=−1−s ( t )1⋅(1+e−t)−2⋅e−t=−e−t1+e−t⋅(1+e−t)−2⋅e−t=−(1+e−t)−1=− σ ( t )
d ( ( 1 − y ( i ) ) log ( 1 − σ ( X b ( i ) θ ) ) ) d θ j = ( 1 − y ( i ) ) ⋅ ( − σ ( X b ( i ) θ ) ) ⋅ X j ( i ) \frac{d\left(\left(1-y^{(i)}\right) \log \left(1-\sigma\left(X_{b}^{(i)} \theta\right)\right)\right)}{d \theta_{j}}=\left(1-y^{(i)}\right) \cdot\left(-\sigma\left(X_{b}^{(i)} \theta\right)\right) \cdot X_{j}^{(i)} d ijd((1−y(i))log(1−p(Xb(i)i ) ) )=(1−y(i))⋅( − p(Xb(i)i ) )⋅Xj(i)
J ( θ ) θ j = 1 m ∑ i = 1 m ( σ ( X b ( i ) θ ) − y ( i ) ) X j ( i ) = 1 m ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) X j ( i ) \begin{aligned} \frac{J(\theta)}{\theta_{j}}= &\frac{1}{m} \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) X_{j}^{(i)}\\ &=\frac{1}{m} \sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right) X_{j}^{(i)} \end{aligned} ijJ(θ)=m1i=1∑m( p(Xb(i)i )−y(i))Xj(i)=m1i=1∑m(y^(i)−y(i))Xj(i)
Among them y ^ ( i ) \hat{y}^{(i)}y^( i ) is the predicted value
∇ J ( θ ) = ( ∂ J / ∂ θ 0 ∂ J / ∂ θ 1 ∂ J / ∂ θ 2 … ∂ J / ∂ θ n ) = 1 m ⋅ ( ∑ i = 1 m ( σ ( X b ( i ) θ ) − y ( i ) ) ∑ i = 1 m ( σ ( X b ( i ) θ ) − y ( i ) ) ⋅ X 1 ( i ) ∑ i = 1 m ( σ ( X b ( i ) θ ) − y ( i ) ) ⋅ X 2 ( i ) … ∑ i = 1 m ( σ ( X b ( i ) θ ) − y ( i ) ) ⋅ X n ( i ) ) = 1 m ⋅ ( ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) ⋅ X 1 ( i ) ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) ⋅ X 2 ( i ) … ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) ⋅ X n ( i ) ) \nabla J(\theta)=\left(\begin{array}{c} \partial J / \partial \theta_{0} \\ \partial J / \partial \theta_{1} \\ \partial J / \partial \theta_{2} \\ \ldots \\ \partial J / \partial \theta_{n} \end{array}\right)=\frac{1}{m} \cdot\left( \begin{gathered} \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) \\ \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) \cdot X_{1}^{(i)}\\ \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) \cdot X_{2}^{(i)}\\ \ldots \\ \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) \cdot X_{n}^{(i)} \end{gathered}\right) =\frac{1}{m} \cdot\left( \begin{gathered} \sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right) \\ \sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right) \cdot X_{1}^{(i)}\\ \sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right) \cdot X_{2}^{(i)}\\ \ldots \\ \sum_{i=1}^{m}\left(\hat{y}^{(i)} -y^{(i)}\right) \cdot X_{n}^{(i)} \end{gathered}\right) ∇ J ( θ )=⎝
⎛∂J/∂θ0∂J/∂θ1∂J/∂θ2…∂J/∂θn⎠
⎞=m1⋅⎝
⎛i=1∑m( p(Xb(i)i )−y(i))i=1∑m( p(Xb(i)i )−y(i))⋅X1(i)i=1∑m( p(Xb(i)i )−y(i))⋅X2(i)…i=1∑m( p(Xb(i)i )−y(i))⋅Xn(i)⎠
⎞=m1⋅⎝
⎛i=1∑m(y^(i)−y(i))i=1∑m(y^(i)−y(i))⋅X1(i)i=1∑m(y^(i)−y(i))⋅X2(i)…i=1∑m(y^(i)−y(i))⋅Xn(i)⎠
⎞
Vectorize the above formula (I don’t understand it, I can’t push it)
∇ J ( θ ) = 1 m ⋅ X b T ⋅ ( σ ( X b θ ) − y ) \nabla J(\theta)=\frac{1 }{m} \cdot X_{b}^{T} \cdot\left(\sigma\left(X_{b} \theta\right)-y\right)∇ J ( θ )=m1⋅XbT⋅( p(Xbi )−y)
Gradient Descent
Optimization algorithm, suitable for functions with unique extreme points. For functions without a unique extreme point, multiple runs can be made to randomize the initial point.
Learning rate η \etaThe value of η will affect the speed of the optimal solution, if the value is not appropriate, the optimal solution will not be obtained. η \etaη is a hyperparameter of the gradient descent method
Form:OperationJ ( θ ) = MSE ( y , y ^ ) J(\theta)=\operatorname{MSE}(y, \hat{y});J(θ)=MSE(y,y^)尽可能小
1 m ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 = 1 m ∑ i = 1 m ( y ( i ) − θ 0 − θ 1 X 1 ( i ) − θ 2 X 2 k ( i ) − … − θ n X n ( i ) ) 2 \frac{1}{m} \sum_{i=1}^{m}\left(y^{(i)}-\hat{y}^{(i)}\right)^{2}=\frac{1}{m} \sum_{i=1}^{m}\left(y^{(i)}-\theta_{0}-\theta_{1} X_{1}^{(i)}-\theta_{2} X_{2 k}^{(i)}-\ldots-\theta_{n} X_{n}^{(i)}\right)^{2} m1i=1∑m(y(i)−y^(i))2=m1i=1∑m(y(i)−i0−i1X1(i)−i2X2 k(i)−…−inXn(i))2
∇ J ( θ ) = ( ∂ J / ∂ θ 0 ∂ J / ∂ θ 1 ∂ J / ∂ θ 2 … ∂ J / ∂ θ n ) = ( ∑ i = 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − 1 ) ∑ i = 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − X 1 ( i ) ) ∑ i = 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − X 2 ( i ) ) … ∑ i = 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − X n ( i ) ) ) = 2 m ⋅ ( ∑ i = 1 m ( X b ( i ) θ − y ( i ) ) ∑ i = 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 1 ( i ) ∑ i = 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 2 ( i ) … ∑ i = 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X n ( i ) ) \nabla J(\theta)=\left(\begin{array}{c} \partial J / \partial \theta_{0} \\ \partial J / \partial \theta_{1} \\ \partial J / \partial \theta_{2} \\ \ldots \\ \partial J / \partial \theta_{n} \end{array}\right)=\left(\begin{array}{c} \sum_{i=1}^{m} 2\left(y^{(i)}-X_{b}^{(i)} \theta\right) \cdot(-1) \\ \sum_{i=1}^{m} 2\left(y^{(i)}-X_{b}^{(i)} \theta\right) \cdot\left(-X_{1}^{(i)}\right) \\ \sum_{i=1}^{m} 2\left(y^{(i)}-X_{b}^{(i)} \theta\right) \cdot\left(-X_{2}^{(i)}\right) \\ \ldots \\ \sum_{i=1}^{m} 2\left(y^{(i)}-X_{b}^{(i)} \theta\right) \cdot\left(-X_{n}^{(i)}\right) \end{array}\right)=\frac{2 }{m}\cdot\left(\begin{array}{c} \sum_{i=1}^{m}\left(X_{b}^{(i)} \theta-y^{(i)}\right) \\ \sum_{i=1}^{m}\left(X_{b}^{(i)} \theta-y^{(i)}\right) \cdot X_{1}^{(i)} \\ \sum_{i=1}^{m}\left(X_{b}^{(i)} \theta-y^{(i)}\right) \cdot X_{2}^{(i)} \\ \ldots \\ \sum_{i=1}^{m}\left(X_{b}^{(i)} \theta-y^{(i)}\right) \cdot X_{n}^{(i)} \end{array}\right) ∇ J ( θ )=⎝
⎛∂J/∂θ0∂J/∂θ1∂J/∂θ2…∂J/∂θn⎠
⎞=⎝
⎛∑i=1m2(y(i)−Xb(i)i )⋅(−1)∑i=1m2(y(i)−Xb(i)i )⋅(−X1(i))∑i=1m2(y(i)−Xb(i)i )⋅(−X2(i))…∑i=1m2(y(i)−Xb(i)i )⋅(−Xn(i))⎠
⎞=m2⋅⎝
⎛∑i=1m(Xb(i)i−y(i))∑i=1m(Xb(i)i−y(i))⋅X1(i)∑i=1m(Xb(i)i−y(i))⋅X2(i)…∑i=1m(Xb(i)i−y(i))⋅Xn(i)⎠
⎞
θ i = θ i − η ∂ J ( θ 0 , θ 1 , ⋯ , θ n ) ∂ θ i \theta_{i}=\theta_{i}-\eta \frac{\partial J\left(\theta_{0}, \theta_{1}, \cdots, \theta_{n}\right)}{\partial \theta_{i}} ii=ii−the∂θi∂ J ( i0, i1,⋯, in)
Small calculation example of gradient descent method (including PYTHON program)
Gradient descent method 1
Gradient descent method 2
appendix
##Reference Learning
Sigmoid Function Analysis
Logistic Regression Learned in ten minutes, easy to understand (including spark solution process) [Station B]