Li Hongyi Deep Learning Course
Predict Pokémon's combat power
Regression
- Market Forecast - Predict what the stock price will be tomorrow?
- self-driving car - predict the steering wheel angle
- Recommendation - Purchase Likelihood (Recommendation System)
f ( x ( Pokémon) ) = y ′ CP after evolution ′ f(x(Pokémon))=y\;'\;CP\;after\;evolution\;'f ( x ( Pokémon ) ) _ _=y′CPafterevolution′
xcp: combat power before evolution, xs: species, xhp: health, xw: weight, xh: height x_{cp}: combat power before evolution, x_s: species, x_{hp}: health, x_w: weight, x_h: heightxcp:Combat power before evolution , xs:species x _hp:health , x _w:weight , xh:height
Step 1. Model
A set of function … ————→ Model ( f1, f2, f3 … )
linear Model :
y = b + w ⋅ x c p y = b+w\cdot{x_{cp}} y=b+w⋅xcp
w w w and b b b are parameters (can be any value)
y = b + ∑ w i x i y = b+\sum{w_ix_i} y=b+∑wixi
x i x_i xi : an attribute of input X X X (feature). —— X X Various properties of X
w i w_i wi : weight
b b b : bias
Step 2. Goodness of function
function input : function output (scalar) :
x 1 , x 2 , x 3 . . . x^1, x^2, x^3 ... x1,x2,x3 . . . y ^ 1 , y ^ 2 , y ^ 3 . . . \widehat{y}^1, \widehat{y}^2, \widehat{y}^3 ...y 1,y 2,y 3...
Loss function L :
- input : a function
- output : how bad it is
L ( f ) = L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(f)=L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 L(f)=L(w,b)=n=1∑10(y n−(b+w⋅xcpn))2
estimation error : estimated y y y based on input function .
In measuring a set of www, b b b is good or bad.
Step 3. Gradient Descent
Best function :
( pick the ‘‘best’’ function)
f ∗ = a r g m i n f L ( f ) w ∗ , b ∗ = a r g m i n w , b L ( w , b ) = a r g m i n w , b ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 f^*=argmin_{f}\;L(f)\\w^*,b^*=argmin_{w,b}\;L(w,b)=argmin_{w,b}\;\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 f∗=argminfL(f)w∗,b∗=argminw,bL(w,b)=argminw,bn=1∑10(y
n−(b+w⋅xcpn))2
Consider loss function L ( w ) L(w) L(w) with one parameter w w w .
w ∗ = a r g m i n w L ( w ) w^*=argmin_w\;L(w) w∗=argminwL ( w )
Differentiable numbers can be passed back for gradient descent
- ( Randomly ) Pick an initial value w 0 w^0 w0
- compute
d L dw ∣ w = w 0 − η d L dw ∣ w = w 0 w 1 = w 0 − η d L dw ∣ w = w 0 \frac{dL}{dw}|_{w=w^0} \\-\eta\frac{dL}{dw}|_{w=w^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0 }dwdL∣w=w0−thedwdL∣w=w0w1=w0−thedwdL∣w=w0
the \etaη is called ‘learning rate’ .
- Many iteration
Local optimal, not global optimal .
How obout two parameters ?
w ∗ , b ∗ = a r g m i n w , b L ( w , b ) w^*,b^*=argmin_{w,b}\;L(w,b) w∗,b∗=argminw,bL(w,b)
- ( Randomly ) Pick an initial value w 0 , b 0 w^0,\;b^0 w0,b0
- compute
d L d w ∣ w = w 0 , b = b 0 d L d b ∣ w = w 0 , b = b 0 w 1 = w 0 − η d L d w ∣ w = w 0 , b = b 0 b 1 = b 0 − η d L d b ∣ w = w 0 , b = b 0 \frac{dL}{dw}|_{w=w^0,b=b^0} \\\frac{dL}{db}|_{w=w^0,b=b^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0,b=b^0} \\b^1=b^0-\eta\frac{dL}{db}|_{w=w^0,b=b^0} dwdL∣w=w0,b=b0dbdL∣w=w0,b=b0w1=w0−thedwdL∣w=w0,b=b0b1=b0−thedbdL∣w=w0,b=b0
▽ = [ α L α w α L α b ] g r a d i e n t \bigtriangledown=\left[\begin{array}{rcl} \frac{\alpha{L}}{\alpha{w}} \\\frac{\alpha{L}}{\alpha{b}} \end{array}\right]_{gradient} ▽=[a wαLαbαL]gradient
The linear regression, the loss function L L L is convex. ( No local optimal )
Formulation of α L α w \frac{\alpha{L}}{\alpha{w}}a wαL and α L α b \frac{\alpha{L}}{\alpha{b}} αbαL
L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 α L α w = ∑ n = 1 10 2 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − x c p n ) α L α b = ∑ n = 1 10 2 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − 1 ) L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 \\\frac{\alpha{L}}{\alpha{w}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-x_{cp}^n) \\\frac{\alpha{L}}{\alpha{b}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-1) L(w,b)=n=1∑10(y
n−(b+w⋅xcpn))2a wαL=n=1∑102(y
n−(b+w⋅xcpn))(−xcpn)αbαL=n=1∑102(y
n−(b+w⋅xcpn))(−1)
Model Selection
M o d e l 1 : y = w 1 x + b M o d e l 2 : y = w 1 x + w 2 x + b M o d e l 3 : y = w 1 x + w 2 x + w 3 x + b . . . Model\;1:y=w_1x+b \\Model\;2:y=w_1x+w_2x+b \\Model\;3:y=w_1x+w_2x+w_3x+b \\... Model1:y=w1x+bModel2:y=w1x+w2x+bModel3:y=w1x+w2x+w3x+b...
A more complex model does not always lead to better performance on testing data. This is overfitting.
Let’s collect more data. There is more hidden factors influence the previous model. : the type of pokeman
Back to Step 1: Redesign the Model
x s x_s xs = species of x x x
X ——→
i f x s = P i d g e y : y = b 1 + w 1 ⋅ x c p i f x s = W e e d l e : y = b 2 + w 2 ⋅ x c p i f x s = C a t e r p i e : y = b 3 + w 3 ⋅ x c p i f x s = E e v e e : y = b 4 + w 4 ⋅ x c p if\;x_s=Pidgey:\;y=b_1+w_1\cdot{x_{cp}} \\ if \;x_s=Weedle:\;y=b_2+w_2\cdot{x_{cp}} \\ if \;x_s=Caterpie:\;y=b_3+w_3\cdot{x_{cp}} \\ if \;x_s=Eevee:\;y=b_4+w_4\cdot{x_{cp}} ifxs=Pidgey:y=b1+w1⋅xcpifxs=Weedle:y=b2+w2⋅xcpifxs=Caterpie:y=b3+w3⋅xcpifxs=Eevee:y=b4+w4⋅xcp
——→ yyy
y = b 1 δ ( xs = P idey ) + w 1 ⋅ δ ( xs = P idey ) xcp + . . . + b 4 δ ( xs = E evee ) + w 4 ⋅ δ ( xs = E evee ) xcpy=b_1\delta(x_s=Pidey)+w_1\cdot\delta(x_s=Pidey)x_{cp} \\+. .. \\+b_4\delta(x_s=Eevee)+w_4\cdot\delta(x_s=Eevee)x_{cp}y=b1d ( xs=Pidey)+w1⋅d ( xs=Pidey)xcp+...+b4d ( xs=Eevee)+w4⋅d ( xs=Eevee)xcp
δ ( x s = P i d e y ) = { 1 , i f x s = P i d e y 0 , o t h e r w i s e \delta(x_s=Pidey)=\left\{\begin{array}{rcl}1, & if\;x_s=Pidey \\0,&otherwise \end{array}\right. d ( xs=Pidey)={ 1,0,ifxs=Pideyotherwise
Are there any other hidden factors?
Back to Step 2: Regulazation
y = b + ∑ w i x i y=b+\sum{w_ix_i} y=b+∑wixi
L = ∑ n ( y ^ n − ( b + ∑ w i x i ) ) 2 + λ ∑ ( w i ) 2 L=\sum_n(\widehat{y}^n-(b+\sum{w_ix_i}))^2+\lambda\sum(w_i)^2 L=n∑(y n−(b+∑wixi))2+l∑(wi)2
training error + regularization
b b b has nothing to do with the smoothness of the function, so bbis not considered when regularizingb
- The functions with smaller w i w_i wi are better. w i w_i wiKoshi Kogoshi smooth.
- Training error: larger λ \lambda λ , considering the training error less.
λ \lambdaThe larger the λ , the smoother it is, but it cannot be too smooth
why smooth function are preferred?
The smoothing function has little effect on input clutter. if some noises corrupt input xi x_ixi when testing, a smooth function has less influence.
where are the errors from?
- bias
- variance
simpler model is less influenced by the sample data.
- simple model → small variance, large bias ( underfitting )
- complex model → large variance, small bias ( overfitting )
Complex models contain simple models
For bias, redesign your model:
- add more features as input
- a more complex model
what to do with large variance?
- more data (collect real data, generate fake data) —— very effective, but not always practical
- regularization