深度学习——李宏毅第一课2020

李宏毅深度学习课程

预测宝可梦的战斗力

Regression

  • Market Forecast——预测明天股价如何?
  • self-driving car——预测方向盘角度
  • Recommendation——购买可能性(推荐系统)

f ( x ( 宝 可 梦 ) ) = y    ′    C P    a f t e r    e v o l u t i o n    ′ f(x(宝可梦))=y\;'\;CP\;after\;evolution\;' f(x())=yCPafterevolution

x c p : 进 化 前 战 斗 力 、 x s : 物 种 、 x h p : 生 命 值 、 x w : 重 量 、 x h : 高 度 x_{cp}:进化前战斗力、x_s:物种、x_{hp}:生命值、x_w:重量、x_h:高度 xcp:xs:xhp:xw:xh:

Step 1. Model

A set of function … ————→ Model ( f1, f2, f3 … )

linear Model :
y = b + w ⋅ x c p y = b+w\cdot{x_{cp}} y=b+wxcp
w w w and b b b are parameters (can be any value)
y = b + ∑ w i x i y = b+\sum{w_ix_i} y=b+wixi
x i x_i xi : an attribute of input X X X (feature). —— X X X 的各种属性

w i w_i wi : weight

b b b : bias

Step 2. Goodness of function

function input : function output (scalar) :

x 1 , x 2 , x 3 . . . x^1, x^2, x^3 ... x1,x2,x3... y ^ 1 , y ^ 2 , y ^ 3 . . . \widehat{y}^1, \widehat{y}^2, \widehat{y}^3 ... y 1,y 2,y 3...

Loss function L :

  • input : a function
  • output : how bad it is

L ( f ) = L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(f)=L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 L(f)=L(w,b)=n=110(y n(b+wxcpn))2

扫描二维码关注公众号,回复: 15063626 查看本文章

estimation error : estimated y y y based on input function .

在衡量一组 w w w, b b b 的好坏。

Step 3. Gradient Descent

Best function :

( pick the ‘‘best’’ function)
f ∗ = a r g m i n f    L ( f ) w ∗ , b ∗ = a r g m i n w , b    L ( w , b ) = a r g m i n w , b    ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 f^*=argmin_{f}\;L(f)\\w^*,b^*=argmin_{w,b}\;L(w,b)=argmin_{w,b}\;\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 f=argminfL(f)w,b=argminw,bL(w,b)=argminw,bn=110(y n(b+wxcpn))2
Consider loss function L ( w ) L(w) L(w) with one parameter w w w .
w ∗ = a r g m i n w    L ( w ) w^*=argmin_w\;L(w) w=argminwL(w)
可微分的数可以回传,进行梯度下降

  • ( Randomly ) Pick an initial value w 0 w^0 w0
  • compute

d L d w ∣ w = w 0 − η d L d w ∣ w = w 0 w 1 = w 0 − η d L d w ∣ w = w 0 \frac{dL}{dw}|_{w=w^0} \\-\eta\frac{dL}{dw}|_{w=w^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0} dwdLw=w0ηdwdLw=w0w1=w0ηdwdLw=w0

η \eta η is called ‘learning rate’ .

  • Many iteration

Local optimal, not global optimal .

How obout two parameters ?
w ∗ , b ∗ = a r g m i n w , b    L ( w , b ) w^*,b^*=argmin_{w,b}\;L(w,b) w,b=argminw,bL(w,b)

  • ( Randomly ) Pick an initial value w 0 ,    b 0 w^0,\;b^0 w0,b0
  • compute

d L d w ∣ w = w 0 , b = b 0 d L d b ∣ w = w 0 , b = b 0 w 1 = w 0 − η d L d w ∣ w = w 0 , b = b 0 b 1 = b 0 − η d L d b ∣ w = w 0 , b = b 0 \frac{dL}{dw}|_{w=w^0,b=b^0} \\\frac{dL}{db}|_{w=w^0,b=b^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0,b=b^0} \\b^1=b^0-\eta\frac{dL}{db}|_{w=w^0,b=b^0} dwdLw=w0,b=b0dbdLw=w0,b=b0w1=w0ηdwdLw=w0,b=b0b1=b0ηdbdLw=w0,b=b0

▽ = [ α L α w α L α b ] g r a d i e n t \bigtriangledown=\left[\begin{array}{rcl} \frac{\alpha{L}}{\alpha{w}} \\\frac{\alpha{L}}{\alpha{b}} \end{array}\right]_{gradient} =[αwαLαbαL]gradient

The linear regression, the loss function L L L is convex. ( No local optimal )

Fomulation of α L α w \frac{\alpha{L}}{\alpha{w}} αwαL and α L α b \frac{\alpha{L}}{\alpha{b}} αbαL
L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 α L α w = ∑ n = 1 10 2 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − x c p n ) α L α b = ∑ n = 1 10 2 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − 1 ) L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 \\\frac{\alpha{L}}{\alpha{w}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-x_{cp}^n) \\\frac{\alpha{L}}{\alpha{b}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-1) L(w,b)=n=110(y n(b+wxcpn))2αwαL=n=1102(y n(b+wxcpn))(xcpn)αbαL=n=1102(y n(b+wxcpn))(1)

Model Selection

M o d e l    1 : y = w 1 x + b M o d e l    2 : y = w 1 x + w 2 x + b M o d e l    3 : y = w 1 x + w 2 x + w 3 x + b . . . Model\;1:y=w_1x+b \\Model\;2:y=w_1x+w_2x+b \\Model\;3:y=w_1x+w_2x+w_3x+b \\... Model1:y=w1x+bModel2:y=w1x+w2x+bModel3:y=w1x+w2x+w3x+b...

A more complex model does not always lead to better performance on testing data. This is overfitting.

Let’s collect more data. There is more hidden factors influence the previous model. : the type of pokeman

Back to Step 1: Redesign the Model

x s x_s xs = species of x x x

X ——→
i f    x s = P i d g e y :    y = b 1 + w 1 ⋅ x c p i f    x s = W e e d l e :    y = b 2 + w 2 ⋅ x c p i f    x s = C a t e r p i e :    y = b 3 + w 3 ⋅ x c p i f    x s = E e v e e :    y = b 4 + w 4 ⋅ x c p if\;x_s=Pidgey:\;y=b_1+w_1\cdot{x_{cp}} \\ if \;x_s=Weedle:\;y=b_2+w_2\cdot{x_{cp}} \\ if \;x_s=Caterpie:\;y=b_3+w_3\cdot{x_{cp}} \\ if \;x_s=Eevee:\;y=b_4+w_4\cdot{x_{cp}} ifxs=Pidgey:y=b1+w1xcpifxs=Weedle:y=b2+w2xcpifxs=Caterpie:y=b3+w3xcpifxs=Eevee:y=b4+w4xcp
——→ y y y
y = b 1 δ ( x s = P i d e y ) + w 1 ⋅ δ ( x s = P i d e y ) x c p + . . . + b 4 δ ( x s = E e v e e ) + w 4 ⋅ δ ( x s = E e v e e ) x c p y=b_1\delta(x_s=Pidey)+w_1\cdot\delta(x_s=Pidey)x_{cp} \\+... \\+b_4\delta(x_s=Eevee)+w_4\cdot\delta(x_s=Eevee)x_{cp} y=b1δ(xs=Pidey)+w1δ(xs=Pidey)xcp+...+b4δ(xs=Eevee)+w4δ(xs=Eevee)xcp

δ ( x s = P i d e y ) = { 1 , i f    x s = P i d e y 0 , o t h e r w i s e \delta(x_s=Pidey)=\left\{\begin{array}{rcl}1, & if\;x_s=Pidey \\0,&otherwise \end{array}\right. δ(xs=Pidey)={ 1,0,ifxs=Pideyotherwise

Are there any other hidden factors?

Back to Step 2: Regulazation

y = b + ∑ w i x i y=b+\sum{w_ix_i} y=b+wixi

L = ∑ n ( y ^ n − ( b + ∑ w i x i ) ) 2 + λ ∑ ( w i ) 2 L=\sum_n(\widehat{y}^n-(b+\sum{w_ix_i}))^2+\lambda\sum(w_i)^2 L=n(y n(b+wixi))2+λ(wi)2

training error + 正则化

b b b 对 function 的平滑程度无关,所以正则化时不考虑 b b b

  • The functions with smaller w i w_i wi are better. w i w_i wi 越小越平滑。
  • Training error: larger λ \lambda λ , considering the training error less.

λ \lambda λ 越大越平滑,但是不可以太平滑

why smooth function are preferred?

平滑 function 对输入杂物影响小。if some noises corrupt input x i x_i xi when testing, a smooth function has less influence.

where are the errors from?

  • bias
  • variance

simpler model is less influenced by the sample data.

  • simple model → small variance, large bias ( underfitting )
  • complex model → large variance, small bias ( overfitting )

复杂模型包含简单模型

For bias, redesign your model:

  • add more features as input
  • a more complex model

what to do with large variance?

  • more data ( 采集真实数据,生成假数据 ) —— very effective, but not always practical
  • regularization

猜你喜欢

转载自blog.csdn.net/weixin_46489969/article/details/125054594