【机器学习】7 支持向量机

1 Optimization Objective 优化目标

  • Hypothesis:
    h θ ( x ) = { 1 , if  θ T x ≥ 0 0 , otherwise h_\theta(x)=\begin{cases} 1&\text{, if $\theta^Tx≥0$}\\ 0&\text{, otherwise} \end{cases} hθ(x)={ 10, if θTx0, otherwise
  • Cost Function:
    J ( θ ) = C ∑ i = 1 m [ y ( i ) c o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T x ( i ) ) ] + 1 2 ∑ j = 1 n θ j 2 J(\theta)=C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tx^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tx^{(i)})\right]+\frac{1}{2}\sum_{j=1}^n{\theta_j}^2 J(θ)=Ci=1m[y(i)cost1(θTx(i))+(1y(i))cost0(θTx(i))]+21j=1nθj2
    C 类 似 于 1 λ C类似于\frac{1}{\lambda} Cλ1
  • Goal:
    minimize θ J ( θ ) \mathop{\text{minimize}}\limits_{\theta} J(\theta) θminimizeJ(θ)

2 Large Margin Intution

cost(z)

  • if  y ( i ) = 1 , we want  θ T x ≥ 1 (not just  ≥ 0 ) if  y ( i ) = 0 , we want  θ T x ≤ − 1 (not just  ≤ 0 ) \begin{aligned} &\text{if $y^{(i)}=1$, we want $\theta^Tx≥1$(not just $≥0$)}\\ &\text{if $y^{(i)}=0$, we want $\theta^Tx≤-1$(not just $≤0$)} \end{aligned} if y(i)=1, we want θTx1not just 0if y(i)=0, we want θTx−1not just 0

3 数学原理

u = [ u 1 u 2 ] , v = [ v 1 v 2 ] u=\left[\begin{matrix} u_1\\u_2\end{matrix}\right],v=\left[\begin{matrix} v_1\\v_2\end{matrix}\right] u=[u1u2]v=[v1v2]

  • u T v = p   ⋅ ∣ ∣ u ∣ ∣ = u 1 v 1 + u 2 v 2 u^Tv=p\ \cdot ||u||=u_1v_1+u_2v_2 uTv=p u=u1v1+u2v2
    p p p:length of projection of v v v onto u ( 有 符 号 ) u(有符号) u
  • u T v = v T u u^Tv=v^Tu uTv=vTu
  • ∣ ∣ u ∣ ∣ = u 1 2 + u 2 2 ||u||=\sqrt{ {u_1}^2+{u_2}^2} u=u12+u22
  • θ 0 = 0 \theta_0=0 θ0=0:意味着决策边界必须通过原点(0,0)
  • 支持向量机就是极小化 ∣ ∣ θ ∣ ∣ ||\theta|| θ
    minimize θ 1 2 ∑ j = 1 n θ j 2 s.t.   p ( i )   ⋅ ∣ ∣ θ ∣ ∣ ≥ 1 if  y ( i ) = 1    p ( i )   ⋅ ∣ ∣ θ ∣ ∣ ≤ − 1 if  y ( i ) = 0 \begin{aligned} &\mathop{\text{minimize}}\limits_{\theta} \frac{1}{2}\sum_{j=1}^n{\theta_j}^2\\ \text{s.t.} \ \ &p^{(i)}\ \cdot ||\theta||≥1&\text{if $y^{(i)}=1$}\\ \ \ &p^{(i)}\ \cdot ||\theta||≤-1&\text{if $y^{(i)}=0$} \end{aligned} s.t.    θminimize21j=1nθj2p(i) θ1p(i) θ1if y(i)=1if y(i)=0
  • 要使投影足够大,才能有大间距

4 Kernels 核函数

  • h θ ( x ) = θ 1 f 1 + θ 2 f 2 + ⋅ ⋅ ⋅ + θ n f n h_\theta(x)=\theta_1f_1+\theta_2f_2+···+\theta_nf_n hθ(x)=θ1f1+θ2f2++θnfn
  • Given x x x, compute new features f f f depending on proximity to landmarks l l l
  • Kernel f i = s i m i l a r i t y ( x , l ( i ) ) f_i=similarity(x,l^{(i)}) fi=similarity(x,l(i))
  • Choose the landmarks l ( 1 ) = x ( 1 ) , l ( 2 ) = x ( 2 ) , ⋅ ⋅ ⋅ , l ( m ) = x ( m ) l^{(1)}=x^{(1)}, l^{(2)}=x^{(2)}, ···, l^{(m)}=x^{(m)} l(1)=x(1),l(2)=x(2),,l(m)=x(m)
    f ( i ) = [ f 0 ( i ) = 1 f 1 ( i ) = s i m ( x ( i ) , l ( 1 ) ) f 2 ( i ) = s i m ( x ( i ) , l ( 2 ) ) ⋮ f i ( i ) = s i m ( x ( i ) , l ( i ) ) = e 0 = 1 ⋮ f 1 ( m ) = s i m ( x ( i ) , l ( m ) ) ] f^{(i)}=\left[\begin{matrix} f_0^{(i)}=1\\ f_1^{(i)}=sim(x^{(i)},l^{(1)})\\ f_2^{(i)}=sim(x^{(i)},l^{(2)})\\ \vdots\\ f_i^{(i)}=sim(x^{(i)},l^{(i)})=e^0=1\\ \vdots\\ f_1^{(m)}=sim(x^{(i)},l^{(m)}) \end{matrix}\right] f(i)=f0(i)=1f1(i)=sim(x(i),l(1))f2(i)=sim(x(i),l(2))fi(i)=sim(x(i),l(i))=e0=1f1(m)=sim(x(i),l(m))

4.1 Gaussian Kernel

  • f i = s i m i l a r i t y ( x , l ( i ) ) = e x p ( − ∣ ∣ x − l ( i ) ∣ ∣ 2 2 σ 2 ) = e x p ( − ∑ j = 1 n ( x j − l j ( i ) ) 2 2 σ 2 ) f_i=similarity(x,l^{(i)})=exp\left(-\frac{ {||x-l^{(i)}||}^2}{2\sigma^2}\right)=exp\left(-\frac{\sum_{j=1}^n{(x_j-l_j^{(i)})}^2}{2\sigma^2}\right) fi=similarity(x,l(i))=exp(2σ2xl(i)2)=exp2σ2j=1n(xjlj(i))2
    与正态分布没啥联系
  • if x ≈ l ( i ) x≈l^{(i)} xl(i), f i ≈ 1 f_i≈1 fi1
    if x x x is far from l ( i ) l^{(i)} l(i), f i ≈ 0 f_i≈0 fi0
  • 使用前需要进行特征缩放
  • 需要选择 σ 2 \sigma^2 σ2

4.2 Linear Kernel

  • do not use kernels
  • Predict “ y = 1 y=1 y=1” if θ T f ≥ 0 \theta^Tf≥0 θTf0

5 SVM with Kernels

  • Hypothesis:Given x x x, compute features f ∈ R m + 1 f∈\mathbb{R}^{m+1} fRm+1.
  • Training
    min θ C ∑ i = 1 m [ y ( i ) c o s t 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T f ( i ) ) ] + θ T M θ \mathop{\text{min}}\limits_{\theta}C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tf^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tf^{(i)})\right]+\theta^TM\theta θminCi=1m[y(i)cost1(θTf(i))+(1y(i))cost0(θTf(i))]+θTMθ
    其中, M M M是根据我们选择的核函数而不同的一个矩阵

6 参数 C C C σ 2 \sigma^2 σ2的影响

  • C C C较大时,相当于 λ \lambda λ较小,可能会导致过拟合,高方差
    C C C较小时,相当于 λ \lambda λ较大,可能会导致欠拟合,高偏差
  • σ 2 \sigma^2 σ2较大时,可能会导致低方差,高偏差( f i f_i fi very more smoothly)
    σ 2 \sigma^2 σ2较大时,可能会导致低偏差,高方差( f i f_i fi very less smoothly)

7 Using an SVM

  • Use SVM software package to solve for parameters θ \theta θ
  • Need to specify:
    (1) Choice of parameter C C C
    (2) Choice of kernel ( similarity function )

7.1 Other Choices of Kernel

  • Not all similarity function make valid kernels
    Need to satisfy technical condition called “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly, and do not diverge
  • Many off-the-shelf kernels:
    (1) Polynomial kernel(多项式核函数)
    (2) String kernel(字符串核函数)
    (3) chi-square kernel(卡方核函数)
    (4) histogram intersection kernel(直方图交集核函数)

7.2 Multi-class classification

-Train K K K SVMs, one to distinguish y = i y=i y=i from the rest, for i = 1 , 2 , ⋅ ⋅ ⋅ , K i=1, 2, ···, K i=1,2,,K, get θ ( 1 ) , θ ( 2 ) , ⋅ ⋅ ⋅ , θ ( K ) \theta^{(1)}, \theta^{(2)}, ···, \theta^{(K)} θ(1),θ(2),,θ(K). Pick class i i i with largest ( θ ( i ) ) T x {(\theta^{(i)})}^Tx (θ(i))Tx

  • Or use packages already

8 逻辑回归和支持向量机

n n n:特征数
m m m:训练样本数

n 、 m n、m nm 选择
n > > m n>>m n>>m 逻辑回归、不带核函数的支持向量机
n n n较小, m m m中等 高斯核函数的支持向量机
n n n较小, m m m较大 创造、增加更多的特征,再逻辑回归、不带核函数的支持向量机
  • 神经网络在上述情况下可能非常慢
  • 支持向量机主要在于它的代价函数是凸函数,不存在局部最小值

9 Reference

吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记

猜你喜欢

转载自blog.csdn.net/qq_44714521/article/details/108508049