机器学习与高维信息检索 - Note 3 - 逻辑回归(Logistic Regression)及相关实例

3. 逻辑回归

在谈论逻辑回归时,一般的设定是,我们有数据点 X ∈ R p X\in\mathbb{R}^{p} XRp和输出变量 Y ∈ { − 1 , + 1 } Y\in\{-1,+1\} Y{ 1,+1}。这是一个所谓的二元分类问题。

其任务是在预定的函数类别 F \mathcal{F} F中找到函数 f f f,使 sign ⁡ f ( X ) \operatorname{sign} f(X) signf ( X ) predicts YYas well as possibleY. _ A commonly used loss function to measure the "accuracy" of a prediction function is motivated by the number of misclassifications, i.e. iff (The sign of f ( X ) and the real outputYYThe sign of Y is inconsistent. We use a 0.1 loss function

L 0 , 1 ( Y , f ( X ) ) = { 1  if  Y sign ⁡ f ( X ) ≤ 0 0  otherwise  (3. 1) L_{0,1}(Y, f(X))= \begin{cases}1 & \text { if } Y \operatorname{sign} f(X) \leq 0 \\ 0 & \text { otherwise }\end{cases} \tag{3. 1} L0,1(Y,f(X))={ 10 if Ysignf(X)0 otherwise (3. 1)

来完成这项工作。

为了在给定的训练样本集 ( x i , y i ) i = 1 , … , n \left(x_{i}, y_{i}\right)_{i=1, \ldots, n} (xi,yi)i=1,,n中找到最佳预测函数 f ∈ F f\in\mathcal{F} fF,目标是找到最小化经验预期损失的 f ∈ F f\in\mathcal{F} fF

1 n ∑ i = 1 n L 0 , 1 ( y i , f ( x i ) ) (3.2) \frac{1}{n} \sum_{i=1}^{n} L_{0,1}\left(y_{i}, f\left(x_{i}\right)\right) \tag{3.2} n1i=1nL0,1(yi,f(xi))(3.2)

However, finding the minimum for this problem is numerically infeasible. Even if we only consider a class of affine functions F aff \mathcal{F}_{\text{aff }}Faff , which is also difficult to solve numerically because the loss function is neither continuous nor separable. To make it easier to solve, we use a convex loss function to approximate the (non-continuous, non-convex) function L 0 , 1 L_{0,1}L0,1

Supplement: Convexity

Definition 3.1

C ⊂ R n \mathcal{C} \subset \mathbb{R}^{n} CRn is a convex set, that is, for any pair of elementsx 1 , x 2 ∈ C x_{1}, x_{2} \in \mathcal{C}x1,x2For C , pointtx 2 + ( 1 − t ) x 1 t \mathbf{x}_{2}+(1-t) \mathbf{x}_{1}tx2+(1t)x1Also C \mathcal{C}An element of C for all t ∈ [0, 1] t \in[0,1]t[0,1]。如果对于所有 x 1 , x 2 ∈ C , t ∈ [ 0 , 1 ] \mathbf{x}_{1}, \mathbf{x}_{2} \in \mathcal{C}, t \in[0,1] x1,x2C,t[0,1],有 t f ( x 2 ) + ( 1 − t ) f ( x 1 ) ≥ f ( t x 2 + ( 1 − t ) x 1 ) t f\left(\mathbf{x}_{2}\right)+(1-t) f\left(\mathbf{x}_{1}\right) \geq f\left(t \mathbf{x}_{2}+(1-t) \mathbf{x}_{1}\right) tf(x2)+(1t)f(x1)f(tx2+(1t)x1),一个函数 f : C → R f: \mathcal{C} \rightarrow \mathbb{R} f:CR被称为凸的。如果不等式是严格的,它就被称为严格凸的

例: f : R + → R , x ↦ 1 / x f: \mathbb{R}^{+} \rightarrow \mathbb{R}, x \mapsto 1 / x f:R+R,x1/x是凸的。

定理3.2

如果 f , g f,g fg是凸的,那么

  • h = max ⁡ ( f , g ) h=\max (f, g) h=max(f,g)
  • h = f + g h=f+g h=f+g
  • h = g ∘ f h=g \circ f h=gf if g g g is non-decreasing

Insert image description here

图3.1: L 0 , 1 L_{0,1} L0,1和对数损失,横轴为 t : = y f ( x ) t:=y f(x) t:=yf(x)

定理3.3

严格凸函数的局部最小值与它的全局最小值相吻合。如果它存在的话,它是唯一的。

对于目前的问题,这意味着如果我们选择 f f f是仿射的(即 f ( x ) = f(\mathbf{x})= f(x)= w ⊤ x + b \mathbf{w}^{\top} \mathbf{x}+b wx+b,根据定义是凸的),作为损失函数,我们使用凸的对数损失函数 ℓ ( t ) = l o g ( 1 + e − t ) \ell(t)=log \left(1+e^{-t}\right) (t)=log(1+et)。两者的结合是凸的。为了看到这一点,计算二阶导数并研究其黑森矩阵。结果发现,它只有非负的特征值。选择对数损失的原因是,它可以被解释为 L 0 , 1 L_{0,1} L0,1损失的凸近似值,如图3.1所示。给定一些训练数据 { ( x i , y i ) } i = 1 , … , N \left\{\left(\mathbf{x}_{i}, y_{i}\right)\right\}_{i=1, \ldots, N} { (xi,yi)}i=1,,N, we can find the optimal parameters ww by solving the optimization problemw andbbb

min ⁡ w ∈ R p , b ∈ R 1 n ∑ i = 1 n log ⁡ ( 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) ) . (3. 3) \min _{\mathbf{w} \in \mathbb{R}^{p}, b \in \mathbb{R}} \frac{1}{n} \sum_{i=1}^{n} \log \left(1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)\right) . \tag{3. 3} wRp,bRminn1i=1nlog(1+exp(yi(wxi+b))).(3. 3)

Convexity of cost function

For simplicity, we only consider linear ffcost function of f . to affineffThe expansion of f is straightforward. Let us express it by
F ( w ) = ∑ i = 1 n log ⁡ ( 1 + exp ⁡ ( − yi ( w ⊤ xi ) ) ) . F(\mathbf{w})=\sum_{i=1 }^{n} \log \left(1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)\right) \right).F(w)=i=1nlog(1+exp(yi(wxi))).

We also use the auxiliary function g ( z ) = 1 / ( 1 + e − z ) g(z)=1 /\left(1+e^{-z}\right)g(z)=1/(1+ez),其中 g ′ ( z ) = g ( z ) ( 1 − g ( z ) ) g^{\prime}(z)=g(z)(1-g(z)) g(z)=g(z)(1g ( z ) ) . So,FFF的一次和二次偏导是
∂ ∂ w ( j ) F ( w ) = − ∑ i y i x i ( j ) ( 1 − g ( y i w ⊤ x i ) ) \frac{\partial}{\partial w^{(j)}} F(\mathbf{w})=-\sum_{i} y_{i} x_{i}^{(j)}\left(1-g\left(y_{i} \mathbf{w}^{\top} \mathbf{x}_{i}\right)\right) w(j)F(w)=iyixi(j)(1g(yiwxi))

and

∂ ∂ w ( j ) ∂ w ( k ) F ( w ) = ∑ i y i 2 x i ( j ) x i ( k ) g ( y i w ⊤ x i ) ( 1 − g ( y i w ⊤ x i ) ) , \frac{\partial}{\partial w^{(j)} \partial w^{(k)}} F(\mathbf{w})=\sum_{i} y_{i}^{2} x_{i}^{(j)} x_{i}^{(k)} g\left(y_{i} \mathbf{w}^{\top} \mathbf{x}_{i}\right)\left(1-g\left(y_{i} \mathbf{w}^{\top} \mathbf{x}_{i}\right)\right), w(j)w(k)F(w)=iyi2xi(j)xi(k)g(yiwxi)(1g(yiwxi)),

Among them yi 2 = 1 y_{i}^{2}=1yi2=1。为了证明该函数为非负定值,我们需要证明对所有 a a a而言 a ⊤ ∇ 2 F a ≥ 0 a^{\top} \nabla^{2} F a \geq 0 a2Fa0。我们定义辅助变量 P i = g ( y i w ⊤ x i ) ( 1 − g ( y i w ⊤ x i ) ) P_{i}=g\left(y_{i} \mathbf{w}^{\top} \mathbf{x}_{i}\right)\left(1-g\left(y_{i} \mathbf{w}^{\top} \mathbf{x}_{i}\right)\right) Pi=g(yiwxi)(1g(yiwxi)) ρ i ( j ) = x i ( j ) P i \rho_{i}^{(j)}=x_{i}^{(j)} \sqrt{P_{i}} ρi(j)=xi(j)Pi 。那么有

a ⊤ ∇ 2 F a = ∑ i ∑ j , k a i a j x i ( j ) x i ( k ) P i = ∑ i a ⊤ ρ i ρ i ⊤ a ≥ 0. a^{\top} \nabla^{2} F a=\sum_{i} \sum_{j, k} a_{i} a_{j} x_{i}^{(j)} x_{i}^{(k)} P_{i}=\sum_{i} a^{\top} \rho_{i} \rho_{i}^{\top} a \geq 0 . a2 Fa=ij,kaiajxi(j)xi(k)Pi=iaρiria0.

This is true for any combination of convex and affine functions.

In the following, we provide a probabilistic interpretation of this optimization problem. First, note that given the observed Y = y Y=yY=Conditional probability x of y \mathrm{x}x is defined as

Pr ⁡ ( Y = y ∣ x ) = exp ⁡ ( − ℓ ( y , f ( x ) ) ) = 1 1 + exp ⁡ ( − y ( w ⊤ x + b ) ) . (3. 4) \operatorname{Pr}(Y=y \mid x)=\exp (-\ell(y, f(x)))=\frac{1}{1+\exp \left(-y\left(\mathbf{w}^{\top} \mathbf{x}+b\right)\right)} . \tag{3. 4} P r ( Y=yx)=exp((y,f(x)))=1+exp(y(wx+b))1.(3. 4)

为了找到(3.3)的解决方案,通常的做法是使用基于梯度的方法。最简单的形式是梯度下降法,在一个给定的迭代点,我们计算梯度,然后在该梯度的负方向迈出一步。函数 F ( w , b ) = ∑ i log ⁡ ( 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) ) F(\mathbf{w}, b)=\sum_{i} \log \left(1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)\right) F(w,b)=ilog(1+exp(yi(wxi+b)))可以通过计算偏导来确定

∂ ∂ w ( k ) F ( w , b ) = ∑ i = 1 n exp ⁡ ( − y i ( w ⊤ x i + b ) ) 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) ( − y i x i ( k ) ) = ∑ i = 1 n 1 1 + exp ⁡ ( y i ( w ⊤ x i + b ) ) ( − y i x i ( k ) ) = ∑ i ∣ y i = 1 1 1 + exp ⁡ ( w ⊤ x i + b ) ( − x i ( k ) ) + ∑ i ∣ y i = − 1 1 1 + exp ⁡ ( − w ⊤ x i + b ) ( x i ( k ) ) \begin{aligned} \frac{\partial}{\partial w^{(k)}} F(\mathbf{w}, b)&=\sum_{i=1}^{n} \frac{\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)}{1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)}\left(-y_{i} x_{i}^{(k)}\right) \\ &=\sum_{i=1}^{n} \frac{1}{1+\exp \left(y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)}\left(-y_{i} x_{i}^{(k)}\right) \\ &=\sum_{i \mid y_{i}=1} \frac{1}{1+\exp \left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)}\left(-x_{i}^{(k)}\right)+\sum_{i \mid y_{i}=-1} \frac{1}{1+\exp \left(-\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)}\left(x_{i}^{(k)}\right) \end{aligned} w(k)F(w,b)=i=1n1+exp(yi(wxi+b))exp(yi(wxi+b))(yixi(k))=i=1n1+exp(yi(wxi+b))1(yixi(k))=iyi=11+exp(wxi+b)1(xi(k))+iyi=11+exp(wxi+b)1(xi(k))

系数 1 / ( 1 + exp ⁡ ( w ⊤ x i + b ) ) 1 /\left(1+\exp \left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right) 1/(1+exp(wxi+b)) 1 / ( 1 + exp ⁡ ( − w ⊤ x i + b ) ) 1 /\left(1+\exp \left(-\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right) 1/(1+exp(wxi+b))是错误的预测概率。(参照(3.4))

因此,当我们朝着负梯度的方向迈出一步时,我们就会朝着这些错误的 "反方向 "前进。这就是为什么梯度下降方法也被称为错误驱动方法的原因。当前模型的错误(这里由一些权重 ( w , b ) ) (\mathbf{w}, b)) (w,b)被用来改进它。梯度指向的方向是使当前模型中的错误最小化。

总的来说:

Logistic regression is a supervised classification method, the decision function is affine, and the loss is L ( y , f ( x ) ) = log ⁡ ( 1 + e − yf ( x ) ) ) \left.L(y, f( \mathbf{x}))=\log \left(1+e^{-yf(\mathbf{x})}\right)\right)L ( y ,f(x))=log(1+ey f ( x ) ))measure. Optimal parametersw ⋆ , b ⋆ \mathbf{w}^{\star}, b^{\star}w,b is found by minimizing the empirical expected loss, that is,min ⁡ w , b 1 N ∑ log ⁡ ( 1 + e − yi ( w ⊤ xi + b ) ) \min _{\mathbf{w}, b} \ frac{1}{N} \sum \log \left(1+e^{-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right )}\right)minw,bN1log(1+eyi(wxi+b))。一旦确定了最佳的 w ⋆ , b ⋆ \mathbf{w}^{\star}, b^{\star} w,b,我们就可以通过计算 sign ⁡ ( w ⋆ ⊤ x new  + b ⋆ ) \operatorname{sign}\left(\mathbf{w}^{\star \top} \mathbf{x}_{\text {new }}+b^{\star}\right) sign(wxnew +b)对一个新数据点 w ⋆ , b ⋆ \mathbf{w}^{\star}, b^{\star} w,b进行分类。我们还可以通过公式(3.4)计算出这种分类正确的概率。

3.1逻辑回归的替代方法

首先,请注意,逻辑回归这个名字可能会产生误导,因为事实上逻辑回归并不是一种回归方法,而是一种分类方法。上一节更多的是以优化为目的,而在这里,我们提供了一种更多的统计方法来处理逻辑回归。

例:
我们试图预测在某些参数下的死亡概率。让 x 1 x_{1} x1为一个人的年龄, x 2 x_{2} x2为性别(0对应男性,1对应女性), x 3 x_{3} x3为胆固醇水平。我们假设可以将这些数值以线性方式组合起来,从而得到一个与死亡概率有某种关联的实际数值

w 0 + w 1 x 1 + w 2 x 2 + w 3 x 3 = w ⊤ x w_{0}+w_{1} x_{1}+w_{2} x_{2}+w_{3} x_{3}=\mathbf{w}^{\top} \mathbf{x} w0+w1x1+w2x2+w3x3=wx

x = [ 1 , x 1 , x 2 , x 3 ] ⊤ , w = [ w 0 , w 1 , w 2 , w 3 ] ⊤ \mathbf{x}=\left[1, x_{1}, x_{2}, x_{3}\right]^{\top}, \mathbf{w}=\left[w_{0}, w_{1}, w_{2}, w_{3}\right]^{\top} x=[1,x1,x2,x3],w=[w0,w1,w2,w3] w i w_{i} wi的值称为权重, w 0 w_{0} w0称为偏置。得到的值在 R \mathbb{R} R中。为了将其塑造成一个概率,我们需要一个函数 σ \sigma σ,将这个值压缩到 [ 0 , 1 ] [0,1] [0,1]的区间内。一个能实现这一目的的函数是逻辑回归函数

σ ( a ) = 1 1 + e − a (3.5) \sigma(a)=\frac{1}{1+e^{-a}} \tag{3.5} σ(a)=1+ea1(3.5)

得到的模型是 P ( 死亡 ∣ x ) = σ ( w ⊤ x ) P(\text{死亡}\mid x)=\sigma\left(\mathbf{w}^{\top} \mathbf{x}\right) P(死亡x)=σ(wx)


更一般地说,我们考虑二元分类问题的训练数据 D = { ( x 1 , z 1 ) , … , ( x n , z n ) } , x i ∈ R d , z i ∈ { 0 , 1 } D=\left\{\left(\mathbf{x}_{1}, z_{1}\right), \ldots,\left(\mathbf{x}_{n}, z_{n}\right)\right\}, \mathbf{x}_{i} \in \mathbb{R}^{d}, z_{i} \in\{0,1\} D={ (x1,z1),,(xn,zn)},xiRd,zi{ 0,1} ,输入和输出变量的依赖性模型为 z i ∝ Bernoulli ⁡ ( σ ( w ⊤ x ) ) z_{i} \propto \operatorname{Bernoulli}\left(\sigma\left(\mathbf{w}^{\top} \mathbf{x}\right)\right) ziBernoulli(σ(w x)), where we assumezi z_{i}ziis independent.

In order to train this model, we need to find the given DDD' swwThe maximum likelihood estimate of w , that is

w M L E = arg ⁡ max ⁡ w Pr ⁡ ( D ∣ w ) \mathbf{w}_{\mathrm{MLE}}=\arg \max _{\mathbf{w}} \operatorname{Pr}(D \mid \mathbf{w}) wMLE=argwmaxP r ( Dw)

have

Pr ⁡ ( D ∣ w ) = ∏ i = 1 n Pr ⁡ ( z i ∣ x i , w ) = ∏ i = 1 n σ ( w ⊤ x i ) z i ( 1 − σ ( w ⊤ x i ) ) 1 − z i (3.6) \operatorname{Pr}(D \mid \mathbf{w})=\prod_{i=1}^{n} \operatorname{Pr}\left(z_{i} \mid \mathbf{x}_{i}, \mathbf{w}\right)=\prod_{i=1}^{n} \sigma\left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)^{z_{i}}\left(1-\sigma\left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)\right)^{1-z_{i}} \tag{3.6} P r ( Dw)=i=1nPr(zixi,w)=i=1np(wxi)zi(1p(wxi))1zi(3.6)

For optimization purposes, it is common to use the negative log ⁡ \log of the above conditional probabilitylog,即
L ( w ) = − log ⁡ Pr ⁡ ( D ∣ w ) = − ∑ i = 1 n z i log ⁡ ( σ ( w ⊤ x i ) ) + ( 1 − z i ) log ⁡ ( 1 − σ ( w ⊤ x i ) ) .  (3.7) L(\mathbf{w})=-\log \operatorname{Pr}(D \mid w)=-\sum_{i=1}^{n} z_{i} \log \left(\sigma\left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)\right)+\left(1-z_{i}\right) \log \left(1-\sigma\left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)\right) \text {. }\tag{3.7} L(w)=logP r ( Dw)=i=1nzilog( p(wxi))+(1zi)log(1p(wxi))(3.7)

respective gradients (relative to www ) is given by

g = ∇ w L ( w ) = ∑ i = 1 n ( σ ( w ⊤ x i ) − z i ) x i = X ( σ ( X ⊤ w ) − z ) (3.8) \mathbf{g}=\nabla_{\mathbf{w}} L(\mathbf{w})=\sum_{i=1}^{n}\left(\sigma\left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)-z_{i}\right) \mathbf{x}_{i}=\mathbf{X}\left(\sigma\left(\mathbf{X}^{\top} \mathbf{w}\right)-\mathbf{z}\right)\tag{3.8} g=wL(w)=i=1n( p(wxi)zi)xi=X( p(Xw)z)(3.8)

X = [ x 1 , … , x n ] ∈ R d × n \mathbf{X}=\left[\mathbf{x}_{1}, \ldots, \mathbf{x}_{n}\right] \in \mathbb{R}^{d \times n} X=[x1,,xn]Rd × n . CorrespondingwwHessian matrixH \mathbf{H} of wH is given as

H = ∇ w 2 L ( w ) = \top}\tag{3.9}H=w2L(w)=X B X(3.9)

B = diag ⁡ ( σ ( w ⊤ x i ) ( 1 − σ ( w ⊤ x i ) ) ) ∈ R n × n \mathbf{B}=\operatorname{diag}\left(\sigma\left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)\left(1-\sigma\left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)\right)\right) \in \mathbb{R}^{n \times n} B=diag(σ(wxi)(1σ(wxi)))Rn×n 。黑塞矩阵 H \mathbf{H} H是半正定的,因此 L L L是凸的。

假设Hessian是可逆的。那么牛顿方法的迭代形式为

wt + 1 = wt − H − 1 g = wt − ( XBX ⊤ ) − 1 X ( σ ( X ⊤ w ) − Z ) = ( XBX ⊤ ) − 1 XB rt (3.10) \begin{aligned} \mathbf{; w}_{t+1} &=\mathbf{w}_{t}-\mathbf{H}^{-1} \mathbf{g} \\ &=\mathbf{w}_{t}-\ left (\mathbf{XBX}^{\top}\right)^{-1}\mathbf{X}\left(\sigma\left(\mathbf{X}^{\top}\mathbf{w}\right )-\mathbf{Z}\right) \\ &=\left(\mathbf{XBX}^{\top}\right)^{-1} \mathbf{XB r}_{t}\end{aligned} \tag{3.10}wt+1=wtH1g=wt( X B X)1X( p(Xw)Z)=( X B X)1X B rt(3.10)

rt = X ⊤ wt − B − 1 ( σ ( X ⊤ wt ) − z ) \mathbf{r}_{t}=\mathbf{X}^{\top} \mathbf{w}_{t}- \mathbf{B}^{-1}\left(\sigma\left(\mathbf{X}^{\top}\mathbf{w}_{t}\right)-\mathbf{z}\right)rt=XwtB1( p(Xwt)z ) . This is the solution to the weighted least squares problemarg ⁡ min ⁡ w . ∑ ibi ( ri − w ⊤ xi ) 2 \arg\min _{\mathbf{w}}. \sum_{i} b_{i }\left(r_{i}-\mathbf{w}^{\top} \mathbf{x}_{i}\right)^{2}argminw.ibi(riwxi)2.

Exercise: Prove that equation (3.7) and equation (3.3) in the scalar factor 1 / n 1 / nWithin 1 / n are equivalent.


proof

首先,注意 log ⁡ ( σ ( w ⊤ x ) ) = − log ⁡ ( 1 + exp ⁡ ( − w ⊤ x ) ) \log \left(\sigma\left(\mathbf{w}^{\top} \mathbf{x}\right)\right)=-\log \left(1+\exp \left(-\mathbf{w}^{\top} \mathbf{x}\right)\right) log(σ(wx))=log(1+exp(wx)) log ⁡ ( 1 − σ ( w ⊤ x ) ) = − log ⁡ ( 1 + exp ⁡ ( w ⊤ x ) ) \log \left(1-\sigma\left(\mathbf{w}^{\top} \mathbf{x}\right)\right)=-\log \left(1+\exp \left(\mathbf{w}^{\top} \mathbf{x}\right)\right) log(1σ(wx))=log(1+exp(wx))。因此,公式(3.7)等价于

L ( w ) = ∑ i = 1 n z i log ⁡ ( 1 + exp ⁡ ( − w ⊤ x i ) ) + ( 1 − z i ) log ⁡ ( 1 + exp ⁡ ( w ⊤ x i ) ) (3.11) L(\mathbf{w})=\sum_{i=1}^{n} z_{i} \log \left(1+\exp \left(-\mathbf{w}^{\top} \mathbf{x}_{i}\right)\right)+\left(1-z_{i}\right) \log \left(1+\exp \left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)\right) \tag{3.11} L(w)=i=1nzilog(1+exp(wxi))+(1zi)log(1+exp(wxi))(3.11)

由于 z i z_{i} zi不是0就是1,我们可以将和改写为

L ( w ) = ∑ z i = 1 log ⁡ ( 1 + exp ⁡ ( − w ⊤ x i ) ) + ∑ z i = 0 log ⁡ ( 1 + exp ⁡ ( w ⊤ x i ) ) . (3.12) L(\mathbf{w})=\sum_{z_{i}=1} \log \left(1+\exp \left(-\mathbf{w}^{\top} \mathbf{x}_{i}\right)\right)+\sum_{z_{i}=0} \log \left(1+\exp \left(\mathbf{w}^{\top} \mathbf{x}_{i}\right)\right) \tag{3.12}. L(w)=zi=1log(1+exp(wxi))+zi=0log(1+exp(wxi)).(3.12)

Now, change the label zi z_{i}ziWith the label yi y_{i} in (3.3)yiIn comparison, we see that zi = 1 ⇔ yi = 1 z_{i}=1\Leftrightarrow y_{i}=1zi=1yi=1 z i = 0 ⇔ y i = − 1 z_{i}=0\Leftrightarrow y_{i}=-1 zi=0yi=1,所以

L ( w ) = ∑ y i = 1 log ⁡ ( 1 + exp ⁡ ( − y i w ⊤ x i ) ) + ∑ y i = − 1 log ⁡ ( 1 + exp ⁡ ( − y i w ⊤ x i ) ) = ∑ i = 1 n log ⁡ ( 1 + exp ⁡ ( − y i w ⊤ x i ) ) , (3.13) \begin{aligned} L(\mathbf{w}) &=\sum_{y_{i}=1} \log \left(1+\exp \left(-y_{i} \mathbf{w}^{\top} \mathbf{x}_{i}\right)\right)+\sum_{y_{i}=-1} \log \left(1+\exp \left(-y_{i} \mathbf{w}^{\top} \mathbf{x}_{i}\right)\right) \\ &=\sum_{i=1}^{n} \log \left(1+\exp \left(-y_{i} \mathbf{w}^{\top} \mathbf{x}_{i}\right)\right), \end{aligned}\tag{3.13} L(w)=yi=1log(1+exp(yiwxi))+yi=1log(1+exp(yiwxi))=i=1nlog(1+exp(yiwxi)),(3.13)

与方程(3.3)相吻合,其系数为 1 / n 1/n 1/n

3.2 线性可分性和逻辑回归

需要注意的是,逻辑回归在线性可分离的训练集上可以过度拟合。当两个类1和 − 1 -1 1是线性可分离的,我们可以找到一个超平面 ( w s , b s ) \left(\mathbf{w}_{s}, b_{s}\right) (ws,bs),使得以下不等式成立。

y i ( w s ⊤ x i + b s ) > 0 ∀ i . (3.14) y_{i}\left(\mathbf{w}_{s}^{\top} \mathbf{x}_{i}+b_{s}\right)>0 \quad \forall i .\tag{3.14} yi(wsxi+bs)>0i.(3.14)
考虑以下定理。

定理3.4

对于线性可分离的、非空的训练集,损失函数

F ( w , b ) = ∑ i = 1 n log ⁡ ( 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) ) (3.15) F(\mathbf{w}, b)=\sum_{i=1}^{n} \log \left(1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)\right)\tag{3.15} F(w,b)=i=1nlog(1+exp(yi(wxi+b)))(3.15)

At R p + 1 \mathbb{R}^{p+1}RThere is no global minimum in p + 1 .


Proof
First let us describe FFThe global minimum of F. It is( w ∗ , b ∗ ) ∈ R p + 1 \left(\mathbf{w}^{*}, b^{*}\right) \in \mathbb{R}^{p+1}(w,b)RA point in p + 1 , that is

F ( w , b ) ≥ F ( w ∗ , b ∗ ) ∀ ( w , b ) ∈ R p + 1 (3.16) F(\mathbf{w}, b) \geq F\left(\mathbf{w}^{*}, b^{*}\right) \quad \forall(\mathbf{w}, b) \in \mathbb{R}^{p+1}\tag{3.16} F(w,b)F(w,b)(w,b)Rp+1( 3.16 )
exists . _ _ If the training set is non-empty, then the loss function is strictly positive definite. Therefore,FFThe minimum value of F is some positive numberε \varepsiloneh,即

F ( w ∗ , b ∗ ) = ε > 0. (3.17) F\left(\mathbf{w}^{*}, b^{*}\right)=\varepsilon>0 .\tag{3.17} F(w,b)=e>0.(3.17)

Now we will prove the absence of a global minimum by means of a contradiction. Suppose there is a point ( ws , bs ) \left(\mathbf{w}_{s}, b_{s}\right)(ws,bs) and a real numberε > 0 \varepsilon>0e>0 , making the following conditions true.

yi ( ws ⊤ xi + bs ) > 0 ∀ i and F ( w , b ) ≥ ε ∀ ( w , b ) ∈ R p + (3.18) y_{i}\left(\mathbf{w}_{s}^{\top} \mathbf{x}_{i}+b_{s}\right)>0 \forall i \text { and } F(\mathbf{w}, b) \geq \varepsilon \forall(\mathbf{w}, b) \in \mathbb{R}^{p+1} .\tag{3.18}yi(wsxi+bs)>0i and F(w,b)ε(w,b)Rp+1.(3.18)

for each iii,定义以下标量值
ξ i = ​​yi ( ws ⊤ xi + bs ) (3.19) \xi_{i}=y_{i}\left(\mathbf{w}_{s}^{\top} \mathbf {x}_{i}+b_{s}\right)\tag{3.19}Xi=yi(wsxi+bs)(3.19)

Consider the following equation

f i ( h ) = log ⁡ ( 1 + exp ⁡ ( − h ξ i ) ) (3.20) f_{i}(h)=\log \left(1+\exp \left(-h \xi_{i}\right)\right)\tag{3.20} fi(h)=log(1+exp(hξi))(3.20)

It is easy to see that xii xi_{i}xiifor any iii are strictly positive numbers, sofi ( h ) f_{i}(h)fi(h)随着 h h h接近 ∞ \infty 而接近0。如果我们考虑 i i i的总和,即

lim ⁡ h → ∞ ∑ i = 1 n f i ( h ) = F ( h w s , h b s ) = 0. (3.21) \lim _{h \rightarrow \infty} \sum_{i=1}^{n} f_{i}(h)=F\left(h \mathbf{w}_{s}, h b_{s}\right)=0 .\tag{3.21} hlimi=1nfi(h)=F(hws,hbs)=0.(3.21)

换句话说,对于任何 ε > 0 \varepsilon>0 ε>0,我们可以找到一个实数 η > 0 \eta>0 η>0,这样,对于任何 h ≥ η h\geq\eta hη,下面的不等式都成立。
F ( h w s , h b s ) < ε (3.22) F\left(h \mathbf{w}_{s}, h b_{s}\right)<\varepsilon\tag{3.22} F(hws,hbs)<ε(3.22)

这直接与假设(3.18)相矛盾,因为我们可以选择 w = h w s \mathrm{w} = h \mathrm{w}_{s} w=hws b = h b s b = h b_{s} b=hbs,且有 h ≥ η h \geq \eta hη

证明表明,一旦找到一个分离超平面,损失函数的值总是可以通过增加超平面参数的大小而进一步降低。请注意,这对任何分离类的超平面都是真实的。在实践中,一个优化算法可以选择一个具有 "坏 "位置和方向的超平面,并增加参数的大小,直到达到最大迭代次数。

为了防止这种情况,通常通过引入正则器来惩罚 ( w , b ) (\mathbf{w}, b) (w,b)的大小,例如通过固定一个实数常数 λ > 0 \lambda>0 λ>0并将原始成本函数调整为

F ~ ( w , b ) = F ( w , b ) + λ ( ∥ w ∥ 2 + b 2 ) (3.23) \tilde{F}(\mathbf{w}, b) = F(\mathbf{w}, b)+\lambda\left(\|\mathbf{w}\|^{2}+b^{2}\right)\tag{3.23} F~(w,b)=F(w,b)+λ(w2+b2)(3.23)

3.3 逻辑回归的额外内容

任务1. 考虑为数据样本 y ∈ { − 1 , 1 } y\in\{-1,1\} y{ 1,1}分配一个标签的二元分类问题。通过逻辑回归的方式对数据样本 y ∈ { − 1 , 1 } y \in\{-1,1\} y{ 1,1 } Perform binary classification. Given a training set{ ( x 1 , y 1 ) , … , ( x N , y N ) } \left\{\left(\mathbf{x}_{1}, y_{1}\right), \ ldots,\left(\mathbf{x}_{N}, y_{N}\right)\right\}{ (x1,y1),,(xN,yN) } tag data. To recap, the loss function is given by

L ( w , b ) = ∑ i = 1 N log ⁡ ( 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) ) L(\mathbf{w}, b)=\sum_{i=1}^{N} \log \left(1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)\right) L(w,b)=i=1Nlog(1+exp(yi(wxi+b)))

3.3.1 Gradient ∇ w , b L \nabla_{\mathbf{w}, b} Lw,bL

Solution: According to the chain rule, we have

∇ b L = ∑ i = 1 N ∇ b ( 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) ) 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) = − ∑ i = 1 N y i exp ⁡ ( − y i ( w ⊤ x i + b ) ) 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) = − ∑ i = 1 N y i 1 + exp ⁡ ( y i ( w ⊤ x i + b ) ) \begin{aligned} \nabla_{b} L &=\sum_{i=1}^{N} \frac{\nabla_{b}\left(1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)\right)}{1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)} \\ &=-\sum_{i=1}^{N} y_{i} \frac{\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)}{1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)} \\ &=-\sum_{i=1}^{N} \frac{y_{i}}{1+\exp \left(y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)} \end{aligned} bL=i=1N1+exp(yi(wxi+b))b(1+exp(yi(wxi+b)))=i=1Nyi1+exp(yi(wxi+b))exp(yi(wxi+b))=i=1N1+exp(yi(wxi+b))yi

Accordingly, we get

∇ w L = ∑ i = 1 N 1 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) ∇ w ( 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) ) = − ∑ i = 1 N y i exp ⁡ ( − y i ( w ⊤ x i + b ) ) 1 + exp ⁡ ( − y i ( w ⊤ x i + b ) ) x i = − ∑ i = 1 N y i 1 + exp ⁡ ( y i ( w ⊤ x i + b ) ) x i \begin{aligned} \nabla_{\mathbf{w}} L &=\sum_{i=1}^{N} \frac{1}{1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)} \nabla_{\mathbf{w}}\left(1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)\right) \\ &=-\sum_{i=1}^{N} y_{i} \frac{\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)}{1+\exp \left(-y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)} \mathbf{x}_{i} \\ &=-\sum_{i=1}^{N} \frac{y_{i}}{1+\exp \left(y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}+b\right)\right)} \mathbf{x}_{i} \end{aligned} wL=i=1N1+exp(yi(wxi+b))1w(1+exp(yi(wxi+b)))=i=1Nyi1+exp(yi(wxi+b))exp(yi(wxi+b))xi=i=1N1+exp(yi(wxi+b))yixi

3.3.2 Global Minimum of Loss Function

Assume that the two classes of the training set are linearly separable, that is, there is a weight vector ws ∈ R p \mathbf{w}_{s} \in \mathbb{R}^{p}wsRp , and there is a biasbs b_{s}bs,从而
y i ( w s ⊤ x i + b s ) > 0 ∀ i y_{i}\left(\mathbf{w}_{s}^{\top} \mathbf{x}_{i}+b_{s}\right)>0 \quad \forall i yi(wsxi+bs)>0i
holds. Prove that under this assumption, the loss function is( w ∗ , b ∗ ) ∈ R p + 1 \left(\mathbf{w}^{*}, b^{*}\right) \in \mathbb {R}^{p+1}(w,b)RThere is no global minimum in p + 1 .

Solution:
LLThe global minimum of L refers to( w ∗ , b ∗ ) ∈ R p + 1 \left(w^{*}, b^{*}\right) \in \mathbb{R}^{p+1}(w,b)RA pair in p + 1
, such that L ( w , b ) ≥ L ( w ∗ , b ∗ ) ∀ ( w , b ) ∈ R p + 1 L(\mathbf{w}, b) \geq L\left (\mathbf{w}^{*}, b^{*}\right) \quad \forall(\mathbf{w}, b) \in \mathbb{R}^{p+1}L(w,b)L(w,b)(w,b)Rp + 1
holds. Furthermore, for a non-empty training set,LLL is strictly positive, so we can conclude that
L ( w ∗ , b ∗ ) = ε > 0 L\left(\mathrm{w}^{*}, b^{*}\right)=\varepsilon >0L(w,b)=e>0
Assume that such a point exists. Let us define
zi = yi ( ws ⊤ xi + bs ) z_{i}=y_{i}\left(\mathbf{w}_{s}^{\top} \mathbf{x}_{i}+b_ {s}\right)zi=yi(wsxi+bs)
Please note thatzi z_{i}zifor every iii are all strictly positive numbers. Consider the function
f ( h ) = ∑ i = 1 N log ⁡ ( 1 + exp ⁡ ( − hzi ) ) f(h)=\sum_{i=1}^{N} \log \left(1+\exp \ left(-h z_{i}\right)\right)f(h)=i=1Nlog(1+exp(hzi) )
Since every sum followshhh is close to∞ \infty and close to 0,f ( h ) f(h)The same is true for f ( h )
, that is , lim ⁡ h → ∞ f ( h ) = 0 \lim _{h \rightarrow \infty} f(h)=0hlimf(h)=0Observe
the equal case
f ( h ) = L ( hws , hbs ) f(h)=L\left(h \mathbf{w}_{\mathbf{s}}, h b_{s}\right)f(h)=L(hws,hbs)
This means that for anyε > 0 \varepsilon>0e>0 , we can find inR \mathbb{R}Find ahh in Rh,并设定 ( w , b ) = ( h w s , h b s ) (\mathbf{w}, b)= \left(h\mathrm{w}_{\mathrm{s}}, h b_{s}\right) (w,b)=(hws,hbs) ,let
L ( w , b ) < ε L(\mathbf{w}, b)<\varepsilonL(w,b)<ε
is established, which is consistent with( w ∗ , b ∗ ) \left(\mathbf{w}^{*}, b^{*}\right)(w,b )contradicts the assumption,L ( w ∗ , b ∗ ) = ε L\left(\mathbf{w}^{*}, b^{*}\right)=\varepsilonL(w,b)=ε is a global minimum.

Note that ( ws , bs ) \left(\mathbf{w}_{s}, b_{s}\right)(ws,bs) is not necessarily optimal in any sense. Depending on the algorithm, this may result in a "non-ideal" hyperplane descriptor with increasing constants.

3.3.3 Overfitting

To avoid the situation in 3.3.2, one can penalize (w, b) (w, b) by adding a square norm controllerThe norm of ( w , b ) . Consider the modified loss function
L ~ ( w , b ) = L ( w , b ) + λ ( ∥ w ∥ 2 + b 2 ) \tilde{L}(\mathbf{w}, b)=L(\mathbf {w}, b)+\lambda\left(\|\mathbf{w}\|^{2}+b^{2}\right)L~(w,b)=L(w,b)+l(w2+b2 )
whereλ > 0 \lambda>0l>0 is a real-valued constant. Calculate the gradient∇ w , b L ~ \nabla_{\mathbf{w}, b} \tilde{L}w,bL~

Solution:
Due to the linear nature of the derivative, we have
∇ b L ~ = ∇ b L + 2 λ b \nabla_{b} \tilde{L}=\nabla_{b} L+2 \lambda bbL~=bL+2 λ b

∇ w L ~ = ∇ w L + 2 λ w \lambda_{\mathrm{w}} \tilde{L}=\mathrm{w}} L+2 \lambda_{\mathrm{w }}wL~=wL+2 minw

3.4 Logistic regression example

The following exercise is based on the extracted feature representation, but we will implement logistic regression manually, that is, by minimizing F ( w ) F(\mathbf{w}) in Chapter 3 of the lecture notesF ( w ) instead of a pre-built classifier. To do this, make sure the variablestrain,test,train_data_featuresandtest_data_features(from the extracted features) are loaded into your IPython shell.

a) Write a PYTHON function logistic_gradient, expecting to train the training set matrix X_trainand the ground truth label vector y train y_{\text {train}}ytrainand the current weight vector www takes as its input and returns the gradientggg . For mathematical definitions, please refer to the handout.

b) Write a PYTHON function find_wthat expects a training set matrix X_train, a ground truth label vector y_train, a step size α and a maximum number of iterations max_itto determine the optimal logistic regression weight vector by performing gradient descent w_star, i.e. calling logistic_gradient in each iteration . Make sure to set the affine offset w 0 = b w_{0}=bw0=bIncorporate into your model.

c) The data set at hand is quite large. Applying standard gradient descent may cause Python to throw a MemoryError exception. To avoid this, we will use a variation of stochastic gradient descent that has proven successful in training deep neural networks. In minibatch learning, each iteration of the algorithm is epochreplaced by a so-called At each epoch, the training set is randomly divided into equal-sized subsets, called mini-batches. For each minibatch, the gradient is calculated and applied only to the samples in the minibatch. epochOne ends when each minibatch's gradient step is executed . Modify find_w to enable small-batch learning. You need to use n − n_{-}nepochs replaces max_it and adds parameter n_minibatch in the function definition. Note: Pay attention to the normalization of gradient and loss functions.

d) Write a function classify_logwith a weight vector wand a test set matrix , classify the samples X_testthrough logistic regression , and return a label vector y test. y_{\text {test. }}X_testytest. Test implementation of sum on and over train_data_features10 epochs, s size 100, step size .test_data_featuresfind_wclassify_logminibatchealpha=1

e) Logistic regression is prone to overfitting. To prevent this, a regularizer can be used. Adjust the implementation so that it does not minimize F ( w ) F(\mathbf{w})F ( w ) , but the minimization term
F ( w ) + λ ∥ w ∥ 2 F(\mathbf{w})+\lambda\|\mathbf{w}\|^{2}F(w)+λw2whereλ
\lambdaλ is a non-negative regularization parameter. Useλ = 1 0 − 3 \lambda=10^{-3}l=103 Test implementation.

3.4.1 Implementation using python

The relevant python code is as follows

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import re
import nltk
import matplotlib.pyplot as plt

nltk.download('stopwords')  # Download text data sets, including stop words
from nltk.corpus import stopwords  # Import the stop word list
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression as LR
from sklearn.metrics import roc_auc_score as AUC


# function for preprocessing the data
def review_prepro(data, remove_stopwords=False):
    stops = stopwords.words('english')
    # remove HTML tags
    review_text = BeautifulSoup(data, 'lxml').get_text()
    # remove non-letters and numbers
    letters_only = re.sub('[^a-zA-Z]',
                          ' ',
                          review_text)
    # make all characters lower case and split the documents into single words
    words = letters_only.lower().split()

    if remove_stopwords:
        # remove stop words
        meaningful_words = [w for w in words if not w in stops]
        # return concatenated single string
        return ' '.join(meaningful_words)
    else:
        # or don't and concatenate to single string
        return ' '.join(words)


def classify_log(w, X_test):
    w0 = w[0]
    y_test = sigmoid(w0 + np.dot(X_test.T, w[1:]))
    return y_test


def train_data_prep(train, vectorizer):
    """
    preprocess the training data using the bag of words in Sklearn
    :param train:
    :return: processed training data
    """
    # load train data
    num_reviews = train['review'].size
    clean_train_reviews = []
    for i in range(num_reviews):
        if (i + 1) % 1000 == 0:
            print('\r Review {} of {} - Training'.format(i + 1, num_reviews), end="")
        clean_train_reviews.append(review_prepro(train['review'][i], remove_stopwords=True))

    # fit the vectorizer to the data
    train_data_features = vectorizer.fit_transform(clean_train_reviews)
    # convert to numpy array
    train_data_features = train_data_features.toarray()

    return train_data_features


def test_data_prep(test, vectorizer):
    """
    preprocess the testing data from the raw input
    :param test:
    :return: processed testing data
    """
    num_test_reviews = test['review'].size
    clean_test_reviews = []
    for i in range(num_test_reviews):
        if (i + 1) % 1000 == 0:
            print('\r Review {} of {}'.format(i + 1, num_test_reviews), end='')
        clean_test_reviews.append(review_prepro(test['review'][i], remove_stopwords=True))

    test_data_features = (vectorizer.transform(clean_test_reviews)).toarray()
    return test_data_features


def sigmoid(x):
    # https://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/
    z = np.exp(-np.abs(x))
    return np.where(x >= 0.0, 1.0 / (1.0 + z), z / (1.0 + z))


def logistic_gradient(x_train, y_train, w, reg=0.0):
    g = -np.dot(x_train * y_train, sigmoid(-np.dot(w, x_train) * y_train)) / x_train.shape[1] + reg * np.sum(w ** 2)
    return g


def find_w(x_train, y_train, alpha, n_epochs, n_minibatch, reg=0.0):
    """
    Using this function to find the best w.
    :param x_train: the training data, which has the shapes [samples, features]
    :param y_train: the training label, which has the shapes [outputs,]
    :param alpha: step constant
    :param n_epochs: training epochs
    :param n_minibatch: we divided the training data into some small training set to accelerate the training speed.
    :param reg: the factor of the regularizer
    :return: the best weight w_star
    """
    x_train = x_train.T
    x_pre = np.ones((x_train.shape[0] + 1, x_train.shape[1]))
    x_pre[1:, :] = x_train
    w = np.ones((x_pre.shape[0],))
    print('x_pre shape ={}, w shape = {}'.format(x_pre.shape, w.shape))
    loss_ = []
    for k in range(n_epochs):
        loss = np.sum(-np.log(sigmoid(y_train * np.dot(w, x_pre)))) / x_pre.shape[1] + reg * np.sum(w ** 2)
        loss_.append(loss)
        rp = np.random.permutation(x_pre.shape[1])
        print('In {} iteration, the loss = {}'.format(k, loss))
        for it in range(x_pre.shape[1] // n_minibatch):
            delta_w = logistic_gradient(x_pre[:, rp[it * n_minibatch: (it + 1) * n_minibatch]],
                                        y_train[rp[it * n_minibatch:(it + 1) * n_minibatch]], w)
            w = w - alpha * delta_w
    return w,loss_


if __name__ == "__main__":
    # load the data
    train = pd.read_csv('labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
    test = pd.read_csv('labeledTestData.tsv', header=0, delimiter="\t", quoting=3)

    # download the stopwords
    stops = set(stopwords.words('english'))
    vectorizer = CountVectorizer(analyzer='word',
                                 tokenizer=None,
                                 preprocessor=None,
                                 stop_words=stops,
                                 max_features=5000)

    # process the data
    train_data_features = train_data_prep(train, vectorizer)
    test_data_features = test_data_prep(test, vectorizer)
    y_train = (np.array(train['sentiment'].values) - 0.5) * 2

    # (1) Testing implementation without regularizers
    W, W_loss = find_w(train_data_features, y_train, 0.8, 20, 100)
    y_pred = classify_log(W, test_data_features.T)
    y_test = test['sentiment'].values
    auc = AUC(y_test, y_pred)
    print('AUC score after 20 epochs:', auc)

    # (2) Testing implementation with regularizers to overdue mitigate the overfitting
    w, w_loss = find_w(train_data_features, y_train, 1, 20, 100, 1e-3)
    y_pred = classify_log(w, test_data_features.T)
    auc = AUC(y_test, y_pred)
    print('AUC score after 20 epochs:', auc)

    # plot
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.set_xlabel('Iteration k')
    ax.set_ylabel('loss')
    ax.plot(W_loss, marker='.', label='without regularizers')
    ax.plot(w_loss, marker='.', label='with regularizers')
    ax.legend()
    plt.show()

The output is,

x_pre shape =(5001, 20000), w shape = (5001,)
In 0 iteration, the loss = 48.15060140614601
In 1 iteration, the loss = 1.9271375244901945
In 2 iteration, the loss = 1.1439895026827396
In 3 iteration, the loss = 0.8040511998178164
In 4 iteration, the loss = 0.637276530141581
In 5 iteration, the loss = 0.5275615696559821
In 6 iteration, the loss = 0.4531238182560532
In 7 iteration, the loss = 0.39835683526227206
In 8 iteration, the loss = 0.3585725270966355
In 9 iteration, the loss = 0.3208458216978429
In 10 iteration, the loss = 0.2989468342660105
In 11 iteration, the loss = 0.27127014988315357
In 12 iteration, the loss = 0.26380395477375235
In 13 iteration, the loss = 0.23870379629686295
In 14 iteration, the loss = 0.22874535039441457
In 15 iteration, the loss = 0.22097269464251518
In 16 iteration, the loss = 0.21411424494393824
In 17 iteration, the loss = 0.20234804193010983
In 18 iteration, the loss = 0.19146457065053046
In 19 iteration, the loss = 0.18977771952092898
AUC score after 20 epochs: 0.9201902430149848

x_pre shape =(5001, 20000), w shape = (5001,)
In 0 iteration, the loss = 53.15160140614601
In 1 iteration, the loss = 4.850572401123086
In 2 iteration, the loss = 3.7359586329298007
In 3 iteration, the loss = 3.2139682944306305
In 4 iteration, the loss = 2.907177612020046
In 5 iteration, the loss = 2.681058512521938
In 6 iteration, the loss = 2.5166307885635284
In 7 iteration, the loss = 2.392833861337298
In 8 iteration, the loss = 2.294087922661967
In 9 iteration, the loss = 2.2217974363161987
In 10 iteration, the loss = 2.161302804041983
In 11 iteration, the loss = 2.113409966477404
In 12 iteration, the loss = 2.0736801266948683
In 13 iteration, the loss = 2.061541837823529
In 14 iteration, the loss = 2.016740023535874
In 15 iteration, the loss = 1.9973296285347977
In 16 iteration, the loss = 1.9823663669782907
In 17 iteration, the loss = 1.967645885882539
In 18 iteration, the loss = 1.9572636195387196
In 19 iteration, the loss = 1.9538835403945283
AUC score after 20 epochs: 0.921975893241498

Insert image description here

Guess you like

Origin blog.csdn.net/qq_37266917/article/details/121772921