Deep Learning - Li Hongyi's first lesson 2020

Li Hongyi Deep Learning Course

Predict Pokémon's combat power

Regression

  • Market Forecast - Predict what the stock price will be tomorrow?
  • self-driving car - predict the steering wheel angle
  • Recommendation - Purchase Likelihood (Recommendation System)

f ( x ( Pokémon) ) = y ′ CP after evolution ′ f(x(Pokémon))=y\;'\;CP\;after\;evolution\;'f ( x ( Pokémon ) ) _ _=yCPafterevolution

xcp: combat power before evolution, xs: species, xhp: health, xw: weight, xh: height x_{cp}: combat power before evolution, x_s: species, x_{hp}: health, x_w: weight, x_h: heightxcp:Combat power before evolution , xs:species x _hp:health , x _w:weight , xh:height

Step 1. Model

A set of function … ————→ Model ( f1, f2, f3 … )

linear Model :
y = b + w ⋅ x c p y = b+w\cdot{x_{cp}} y=b+wxcp
w w w and b b b are parameters (can be any value)
y = b + ∑ w i x i y = b+\sum{w_ix_i} y=b+wixi
x i x_i xi : an attribute of input X X X (feature). —— X X Various properties of X

w i w_i wi : weight

b b b : bias

Step 2. Goodness of function

function input : function output (scalar) :

x 1 , x 2 , x 3 . . . x^1, x^2, x^3 ... x1,x2,x3 . . . y ^ 1 , y ^ 2 , y ^ 3 . . . \widehat{y}^1, \widehat{y}^2, \widehat{y}^3 ...y 1,y 2,y 3...

Loss function L :

  • input : a function
  • output : how bad it is

L ( f ) = L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(f)=L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 L(f)=L(w,b)=n=110(y n(b+wxcpn))2

estimation error : estimated y y y based on input function .

In measuring a set of www, b b b is good or bad.

Step 3. Gradient Descent

Best function :

( pick the ‘‘best’’ function)
f ∗ = a r g m i n f    L ( f ) w ∗ , b ∗ = a r g m i n w , b    L ( w , b ) = a r g m i n w , b    ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 f^*=argmin_{f}\;L(f)\\w^*,b^*=argmin_{w,b}\;L(w,b)=argmin_{w,b}\;\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 f=argminfL(f)w,b=argminw,bL(w,b)=argminw,bn=110(y n(b+wxcpn))2
Consider loss function L ( w ) L(w) L(w) with one parameter w w w .
w ∗ = a r g m i n w    L ( w ) w^*=argmin_w\;L(w) w=argminwL ( w )
Differentiable numbers can be passed back for gradient descent

  • ( Randomly ) Pick an initial value w 0 w^0 w0
  • compute

d L dw ∣ w = w 0 − η d L dw ∣ w = w 0 w 1 = w 0 − η d L dw ∣ w = w 0 \frac{dL}{dw}|_{w=w^0} \\-\eta\frac{dL}{dw}|_{w=w^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0 }dwdLw=w0thedwdLw=w0w1=w0thedwdLw=w0

the \etaη is called ‘learning rate’ .

  • Many iteration

Local optimal, not global optimal .

How obout two parameters ?
w ∗ , b ∗ = a r g m i n w , b    L ( w , b ) w^*,b^*=argmin_{w,b}\;L(w,b) w,b=argminw,bL(w,b)

  • ( Randomly ) Pick an initial value w 0 ,    b 0 w^0,\;b^0 w0,b0
  • compute

d L d w ∣ w = w 0 , b = b 0 d L d b ∣ w = w 0 , b = b 0 w 1 = w 0 − η d L d w ∣ w = w 0 , b = b 0 b 1 = b 0 − η d L d b ∣ w = w 0 , b = b 0 \frac{dL}{dw}|_{w=w^0,b=b^0} \\\frac{dL}{db}|_{w=w^0,b=b^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0,b=b^0} \\b^1=b^0-\eta\frac{dL}{db}|_{w=w^0,b=b^0} dwdLw=w0,b=b0dbdLw=w0,b=b0w1=w0thedwdLw=w0,b=b0b1=b0thedbdLw=w0,b=b0

▽ = [ α L α w α L α b ] g r a d i e n t \bigtriangledown=\left[\begin{array}{rcl} \frac{\alpha{L}}{\alpha{w}} \\\frac{\alpha{L}}{\alpha{b}} \end{array}\right]_{gradient} =[a wαLαbαL]gradient

The linear regression, the loss function L L L is convex. ( No local optimal )

Formulation of α L α w \frac{\alpha{L}}{\alpha{w}}a wαL and α L α b \frac{\alpha{L}}{\alpha{b}} αbαL
L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 α L α w = ∑ n = 1 10 2 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − x c p n ) α L α b = ∑ n = 1 10 2 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − 1 ) L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 \\\frac{\alpha{L}}{\alpha{w}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-x_{cp}^n) \\\frac{\alpha{L}}{\alpha{b}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-1) L(w,b)=n=110(y n(b+wxcpn))2a wαL=n=1102(y n(b+wxcpn))(xcpn)αbαL=n=1102(y n(b+wxcpn))(1)

Model Selection

M o d e l    1 : y = w 1 x + b M o d e l    2 : y = w 1 x + w 2 x + b M o d e l    3 : y = w 1 x + w 2 x + w 3 x + b . . . Model\;1:y=w_1x+b \\Model\;2:y=w_1x+w_2x+b \\Model\;3:y=w_1x+w_2x+w_3x+b \\... Model1:y=w1x+bModel2:y=w1x+w2x+bModel3:y=w1x+w2x+w3x+b...

A more complex model does not always lead to better performance on testing data. This is overfitting.

Let’s collect more data. There is more hidden factors influence the previous model. : the type of pokeman

Back to Step 1: Redesign the Model

x s x_s xs = species of x x x

X ——→
i f    x s = P i d g e y :    y = b 1 + w 1 ⋅ x c p i f    x s = W e e d l e :    y = b 2 + w 2 ⋅ x c p i f    x s = C a t e r p i e :    y = b 3 + w 3 ⋅ x c p i f    x s = E e v e e :    y = b 4 + w 4 ⋅ x c p if\;x_s=Pidgey:\;y=b_1+w_1\cdot{x_{cp}} \\ if \;x_s=Weedle:\;y=b_2+w_2\cdot{x_{cp}} \\ if \;x_s=Caterpie:\;y=b_3+w_3\cdot{x_{cp}} \\ if \;x_s=Eevee:\;y=b_4+w_4\cdot{x_{cp}} ifxs=Pidgey:y=b1+w1xcpifxs=Weedle:y=b2+w2xcpifxs=Caterpie:y=b3+w3xcpifxs=Eevee:y=b4+w4xcp
——→ yyy
y = b 1 δ ( xs = P idey ) + w 1 ⋅ δ ( xs = P idey ) xcp + . . . + b 4 δ ( xs = E evee ) + w 4 ⋅ δ ( xs = E evee ) xcpy=b_1\delta(x_s=Pidey)+w_1\cdot\delta(x_s=Pidey)x_{cp} \\+. .. \\+b_4\delta(x_s=Eevee)+w_4\cdot\delta(x_s=Eevee)x_{cp}y=b1d ( xs=Pidey)+w1d ( xs=Pidey)xcp+...+b4d ( xs=Eevee)+w4d ( xs=Eevee)xcp

δ ( x s = P i d e y ) = { 1 , i f    x s = P i d e y 0 , o t h e r w i s e \delta(x_s=Pidey)=\left\{\begin{array}{rcl}1, & if\;x_s=Pidey \\0,&otherwise \end{array}\right. d ( xs=Pidey)={ 1,0,ifxs=Pideyotherwise

Are there any other hidden factors?

Back to Step 2: Regulazation

y = b + ∑ w i x i y=b+\sum{w_ix_i} y=b+wixi

L = ∑ n ( y ^ n − ( b + ∑ w i x i ) ) 2 + λ ∑ ( w i ) 2 L=\sum_n(\widehat{y}^n-(b+\sum{w_ix_i}))^2+\lambda\sum(w_i)^2 L=n(y n(b+wixi))2+l(wi)2

training error + regularization

b b b has nothing to do with the smoothness of the function, so bbis not considered when regularizingb

  • The functions with smaller w i w_i wi are better. w i w_i wiKoshi Kogoshi smooth.
  • Training error: larger λ \lambda λ , considering the training error less.

λ \lambdaThe larger the λ , the smoother it is, but it cannot be too smooth

why smooth function are preferred?

The smoothing function has little effect on input clutter. if some noises corrupt input xi x_ixi when testing, a smooth function has less influence.

where are the errors from?

  • bias
  • variance

simpler model is less influenced by the sample data.

  • simple model → small variance, large bias ( underfitting )
  • complex model → large variance, small bias ( overfitting )

Complex models contain simple models

For bias, redesign your model:

  • add more features as input
  • a more complex model

what to do with large variance?

  • more data (collect real data, generate fake data) —— very effective, but not always practical
  • regularization

Guess you like

Origin blog.csdn.net/weixin_46489969/article/details/125054594