PRML Notes 1 - Probability Theory in Polynomial Regression in the Introduction

introduction

  In the introduction, it is first pointed out that "the problem of finding patterns in data is a basic problem". The author gave an example through handwritten digit recognition: handwritten characters are varied, how can we correctly recognize handwritten digits?
  Some people may say that we can directly write the program to set rules for it. However, since there are always handwritten characters that we cannot consider, whenever we find new characters that cannot be recognized by the program, we must add new rules, which leads to rule proliferation. For example, the following two 2, it is difficult for us to manually set the rules.
insert image description here
  Or understand it this way, we hope to have a mapping that can map the input image into the corresponding number. The mapping function set by manual methods may be very large, so we need to explore a less labor-saving and more efficient method, that is, machine learning.
insert image description here
  In the machine learning method, there are N digital images { x 1 , ⋯ , x N } \{x_1,\cdots,x_N\}{ x1,,xN} is called the training set, which is used to adjust the parameters of the mapping. The number corresponding to each image is known, using the target vector (target vector)ttt to represent the category label of the number. The machine learning method can also be regarded as a mapping functionf ( x ) f(x)f ( x ) , he takes the imagexxx as input, vectoryyy is the output. The yyoutput herey vector form and target vectorttt has the same form.

  1. Training phase: determine the function f ( x ) f(x)The exact form of f ( x ) , trained with digital images and known labels.
  2. Test phase: use the function f ( x ) f(x)f ( x ) predicts labels for new images (test set) images.
  3. Generalization problem: predictive performance on data outside the training set.
      Additionally the original digital image may need to be transformed into a new variable space (preprocessing). Such as converting an image to a vector. If the training images are preprocessed, the test images also need to be preprocessed in the same way.
    Applications in which training data samples contain input vectors and corresponding target vectors are called supervised learning problems. The training data consists of a set of input vectors xxx without any corresponding target value. This is called an unsupervised problem. The problem of feedback learning technology is to find the appropriate action under the given conditions to maximize the reward.
      The introduction mainly introduces the three most important tools in this book: probability theory, decision theory, and information theory.

Polynomial Curve Fitting

  from sin ( 2 π ) sin(2\pi)s in ( 2 π ) This function is uniformly sampled between 0 and 1. The sampling process adds noise that conforms to the Gaussian distribution. The training data set consists ofxxNNof xComposed of N observations ,x ≡ ( x 1 , ⋯ , x N ) T x\equiv(x_1,\cdots,x_N)^Tx(x1,,xN)T , observed valuettt记作 t ≡ ( t 1 , ⋯   , t N ) t\equiv(t_1,\cdots,t_N) t(t1,,tN) .
  Now if we only have the training data setxxx and the set of observationsttt , how to predict newx ^ \hat{x}x^ target variablet ^ \hat{t}t^ . Due to the interference of data sampling, for a givenx ^ \hat{x}x^ t ^ \hat{t} t^ Uncertain. The authors use polynomial functions for curve fitting. y ( x , w ) = w 0 + w 1 x + w 2 x 2 + ⋯ + w M x M = ∑ j = 0 M wy(x,w)=w_0+w_1x+w_2x^2+\cdots+w_Mx ^M=\sum_{j=0}^{M}wy(x,w)=w0+w1x+w2x2++wMxM=j=0Mwhere MM_M is the order of the polynomial,xjx^jxj meansxxjjof xj power. Polynomial coefficientw 0 , ⋯ , w M w_0,\cdots,w_Mw0,,wMThe whole is recorded as a vector w \boldsymbol{w}w . Note: Although the polynomial functiony ( x , w ) y(x,\boldsymbol{w})y(x,w )isxxA nonlinear function of x , but it is the coefficientw \boldsymbol{w}A linear function of w .
  To find a suitablew \boldsymbol{w}The value of w needs to be realized by minimizing the error function. For example each data pointxn x_nxnThe predicted value y ( xn , w ) y(x_n,\boldsymbol{w})y(xn,w )and target valuetn t_ntnThe sum of squares of the difference. Minimize the error function to find the appropriate w \boldsymbol{w}w E ( w ) = 1 2 ∑ n = 1 N { y ( x n , w ) − t n } 2 E(\boldsymbol{w})=\frac{1}{2}\sum_{n=1}^N\{y(x_n,\boldsymbol{w})-t_n\}^2 E ( w )=21n=1N{ y(xn,w)tn}2 This is a non-negative quantity, if all values ​​are predicted correctly, the error functionE ( x ) = 0 E(x)=0E ( x )=0 .
The error function isw \boldsymbol{w}The quadratic function of w , derivation can get a unique solution, this solution is the minimum value solution of the error functionw ∗ \boldsymbol{w^*}w .
  There is another problem to be solved below, that is the polynomial orderMMThe value of M , that is,w \boldsymbol{w}The length of the w vector.

  1. M = 0 M=0 M=0 horizontal line
  2. M = 1 M=1 M=1 straight line
  3. M = 3 M=3 M=3 The fitting results are relatively close to
  4. M = 9 M=9 M=9 The fitted curve oscillates violently, completely passing through all points, the error function is 0, and overfitting.
      It stands to reason that it should beMMThe bigger the M , the better. Why does it oscillate? Intuitively, the authors say, a largerMMThe polynomial of the M value is over-tuned, so that the polynomial is adjusted to match the random noise of the target value. Personally, I feel that the random noise should also be fitted during the adjustment, so the vibration is relatively large. However, as the training data set increases,MMThe unsatisfactory aspect of this result is that we have to limit the number of parameters according to the size of the training set available. A more reasonable explanation is to choose the complexity of the model according to the complexity of the problem to be solved. Therefore, a method will be mentioned later, using the Bayesian model, the effective number of parameters will be automatically adjusted according to the size of the data set.

probability theory

  Sum rule: p ( X ) = ∑ Y p ( X , Y ) p(X)=\sum_Yp(X,Y)p(X)=Yp(X,Y )
  product rule:p ( X , Y ) = p ( Y ∣ X ) p ( X ) p(X,Y)=p(Y|X)p(X)p(X,Y)=p ( Y X ) p ( X )
according to the product rule, and the symmetryp ( X , Y ) = p ( Y , X ) p(X,Y)=p(Y,X)p(X,Y)=p ( Y ,X ) , the relationship between two conditional probabilities can be obtained:
p ( X , Y ) = p ( Y , X ) p ( Y ∣ X ) p ( X ) = p ( X ∣ Y ) p ( Y ) p ( Y ∣ X ) = p ( X ∣ Y ) p ( Y ) p ( X ) p(X,Y)=p(Y,X)\\ p(Y|X)p(X)=p(X|Y )p(Y)\\ p(Y|X)=\frac{p(X|Y)p(Y)}{p(X)}p(X,Y)=p ( Y ,X)p(YX)p(X)=p(XY)p(Y)p(YX)=p(X)p(XY)p(Y)
This is Bayes' theorem. According to the sum rule, the above result can be written as:
p ( Y ∣ X ) = p ( X ∣ Y ) p ( Y ) ∑ Y p ( X , Y ) p(Y|X)=\frac {p(X|Y)p(Y)}{\sum_Yp(X,Y)}p(YX)=Yp(X,Y)p(XY)p(Y)
The parameter w \boldsymbol{w}   in the polynomial curve fitting examplew , before we observe the data, forw \boldsymbol{w}w has some assumptions, which with prior probabilityp ( w ) p(\boldsymbol{w})given in the form of p ( w ) . Observation dataD = { t 1 , ⋯ , t N } D=\{t_1,\cdots,t_N\}D={ t1,,tN} The effect of the conditional probabilityp ( D ∣ w ) p(D|\boldsymbol{w})p ( D w ) expression, using Bayes' theorem to represent the polynomial curve fitting problem:
p ( w ∣ D ) = p ( D ∣ w ) p ( w ) p ( D ) p(\boldsymbol{w}| D)=\frac{p(D|\boldsymbol{w})p(\boldsymbol{w})}{p(D)}p(wD)=p(D)p(Dw)p(w)It allows us to pass the posterior probability p ( w ∣ D ) p(\boldsymbol{w} | D)p ( wD ) , after observingDDEstimatew \boldsymbol{w} after DUncertainty of w . The p on the right side of the formula( D ∣ w ) p(D|\boldsymbol{w})p ( D w ) (called the likelihood function) is a very interesting quantity, which represents the given parameterw \boldsymbol{w}The possibility of the appearance of w observation data, about w \boldsymbol{w}The integral of w is not necessarily equal to one. All quantities in the formula are aboutw \boldsymbol{w}The function of w , where p ( D ) = ∫ p ( D ∣ w ) p ( w ) dwp(D)=\int{p(D|\boldsymbol{w})p(\boldsymbol{w})d\boldsymbol {w}}p(D)=p ( D w ) p ( w ) d w . Frequentist and Bayesian pairs withw \boldsymbol{w}There are different views on the value of w , the frequentist thinks thatw \boldsymbol{w}w is a fixed parameter, which is commonly usedfor maximum likelihood estimation, while Bayesians believe thatw \boldsymbol{w}w is expressed by a probability distribution, such asa bootstrap.
  The following considers the case of maximum likelihood estimation Gaussian distribution, for the unary real-valued variablexxx ,empty infinitesimal:
N ( x ∣ µ , σ 2 ) = 1 ( 2 π σ 2 ) 1 2 exp { − 1 2 σ 2 ( x − µ ) 2 } \mathcal{N}(x|\ in,\sigma^2)=\frac{1}{(2\pi\sigma^2)^\frac{1}{2}}exp\{-\frac{1}{2\sigma^2}( x-\mu)^2\}N(xμ,p2)=( 2 p . p2)211exp{ 2 p21(xm )2 }
Intuitive understanding is that given the mean valueμ \muμ and varianceσ 2 \sigma^2p2 x x distribution of x . Here the mean and variance are also unknown.
  Now suppose we have an observation data sett = ( t 1 , ⋯ , t N ) T \boldsymbol{t}=(t_1,\cdots,t_N)^Tt=(t1,,tN)T , the observations of this data set are independently taken from a data set that conforms to a Gaussian distribution, so given the mean and variance, we can get the observation data set t \boldsymbol{t}The joint probability of each outcome in t
: p ( t ∣ μ , σ 2 ) = ∏ n = 1 NN ( tn ∣ μ , σ 2 ) p(\boldsymbol{t}|\mu,\sigma^2)=\ prod_{n=1}^{N}\mathcal{N}(t_n|\mu,\sigma^2)p(tμ,p2)=n=1NN(tnμ,p2 )
Through the maximum likelihood estimation method, we can get the maximum likelihood solution of the mean:
μ ML = 1 N ∑ n = 1 N tn \mu_{ML}=\frac{1}{N}\sum_{n=1} ^Nt_nmML=N1n=1Ntn
and the maximum likelihood solution of variance:
σ ML 2 = 1 N ∑ n = 1 N ( tn − μ ML ) 2 \sigma^2_{ML}=\frac{1}{N}\sum_{n=1}{ N}(t_n-\mu_{ML})^2pML2=N1n=1N(tnmML)2Let E [ σ ML 2 ] = ( N − 1 N ) σ 2 \mathbb{E}[ \sigma^2_{ML}]=(\frac{N-1}{N})\sigma^2E [ pML2]=(NN1) p2 , so maximum likelihood estimation has the disadvantage of underestimating the variance (less problematic for larger N). In practical problems, since we are interested in models with many parameters, the maximum likelihood offset problem will be very serious. The maximum likelihood offset problem is at the heart of the overfitting problem encountered in the polynomial curve fitting problem above.
  Now back to the problem of curve fitting, there isNNN inputsx = ( x 1 , ⋯ , x N ) T \boldsymbol{x}=(x_1,\cdots,x_N)^Tx=(x1,,xN)T and their corresponding target values​​t = ( t 1 , ⋯ , t N ) T \boldsymbol{t}=(t_1,\cdots,t_N)^Tt=(t1,,tN)T , assuming a givenxxThe value of x , corresponding tottThe t value obeys a Gaussian distribution, and the mean of the distribution isy ( x , w ) y(x,\boldsymbol{w})y(x,w ), so:
p ( t ∣ x , w , β ) = N ( t ∣ y ( x , w ) , β − 1 ) p(t|x,\boldsymbol{w},\beta)=\mathcal {N}(t|y(x,\boldsymbol{w}),\beta^{-1})p(tx,w,b )=N(ty(x,w),b1 ), whereβ \betaβ is the precision, it is the reciprocal of the distribution variance,β − 1 \beta^{-1}b1 is the variance. Determine the unknown parameter w \boldsymbol{w}by the maximum likelihood methodw sumβ \betaThe value of β,似然投行载:
p ( t ∣ x , w , β ) = ∏ n = 1 NN ( tn ∣ y ( xn , w ) , β − 1 ) = ∏ n = 1 N 1 ( 2 π ) 1 2 β − 1 2 exp ( tn − y ( xn , w ) ) 2 − 2 β − 1 p(\boldsymbol{t}|\boldsymbol{x},\boldsymbol{w},\beta)=\prod_{ n=1}^N\mathcal{N}(t_n|y(x_n,\boldsymbol{w}),\beta^{-1})=\prod_{n=1}^N\frac{1}{( 2\pi)^{\frac{1}{2}}\beta^{-\frac{1}{2}}}exp{\frac{(t_n-y(x_n,\boldsymbol{w}))^ 2}{-2\beta^{-1}}}p(tx,w,b )=n=1NN(tny(xn,w),b1)=n=1N( 2 p )21b211exp2 b1(tny(xn,w))2, again emphasizing β − 1 \beta^{-1}b1Specify the equation:
ln ⁡ p ( t ∣ x , w , β ) = − ∑ n = 1 N ( tn − y ( xn , w ) ) 2 2 β − 1 − ∑ n = 1 N ln ⁡ ( 2 π ) 1 2 − ∑ n = 1 N ln ⁡ β − 1 2 = − β 2 ∑ n = 1 N ( tn − y ( xn , w ) ) 2 − N 2 ln ⁡ ( 2 π ) + N 2 ln β \ln{p}(\ballsymbol{t}|\ballsymbol{x},\ballsymbol{w},\beta)=-\sum_{n=1}^{N}\frac{(t_n- y(x_n,\ballsymbol{w}))^2}{2\beta^{-1}}-\sum_{n=1}^N\ln(2\pi)^\frac{1}{2} -\sum_{n=1}^{N}\ln\beta^{-\frac{1}{2}}\\ =-\frac{\beta}{2}\sum_{n=1}^{ N}(t_n-y(x_n,\ballsymbol{w}))^2-\frac{N}{2}\ln(2\pi)+\frac{N}{2}ln\betalnp(tx,w,b )=n=1N2 b1(tny(xn,w))2n=1Nln(2π)21n=1Nlnb21=2bn=1N(tny(xn,w))22Nln(2π)+2Nlnβ
由于 { y ( x n , w ) − t n } 2 = { t n − y ( x n , w ) } 2 \{y(x_n,\boldsymbol{w})-t_n\}^2=\{t_n-y(x_n,\boldsymbol{w})\}^2 { y(xn,w)tn}2={ tny(xn,w)}2 , so the log-likelihood function is:
ln ⁡ p ( t ∣ x , w , β ) = − β 2 ∑ n = 1 N { y ( xn , w ) − tn } 2 + N 2 ln ⁡ β − N 2 ln ⁡ ( 2 π ) \ln{p}(\boldsymbol{t}|\boldsymbol{x},\boldsymbol{w},\beta)=-\frac{\beta}{2}\sum_{n= 1}^N\{y(x_n,\boldsymbol{w})-t_n\}^2+\frac{N}{2}\ln{\beta}-\frac{N}{2}\ln{( 2\pi)}lnp(tx,w,b )=2bn=1N{ y(xn,w)tn}2+2Nlnb2Nln( 2 π ) Find the maximum likelihood solutionw ML \boldsymbol{w}_{ML}wML, need to w \boldsymbol{w}w is derived, and the latter two terms are related tow \boldsymbol{w}w is irrelevant, only the pair− β 2 ∑ n = 1 N { y ( xn , w ) − tn } 2 -\frac{\beta}{2}\sum_{n=1}^N\{y(x_n ,\boldsymbol{w})-t_n\}^22bn=1N{ y(xn,w)tn}2 Find the maximum value,β \betaβ only plays a scaling role, so it is equivalent to1 2 ∑ n = 1 N { y ( xn , w ) − tn } 2 \frac{1}{2}\sum_{n=1}^N\{y (x_n,\boldsymbol{w})-t_n\}^221n=1N{ y(xn,w)tn}2 Find the minimum value. Interestingly, this returns to the original minimized error function. Under the assumption of Gaussian noise, the square sum error function is a natural result of maximizing the likelihood function. At this time, we have obtained the parameterw \boldsymbol{w}w .
  Using the maximum likelihood method can also get the varianceβ ML − 1 \beta_{ML}^{-1}bML1The maximum likelihood solution
β ML − 1 = 1 N ∑ n = 1 N { y ( xn , w ML ) − tn } 2 \beta_{ML}^{-1}=\frac{1}{N}\sum_ {n=1}^{N}\{y(x_n,\boldsymbol{w}_{ML})-t_n\}^2bML1=N1n=1N{ y(xn,wML)tn}2 Same as Gaussian distribution, still need to determinew ML \boldsymbol{w}_{ML}wMLThen determine β ML − 1 \beta_{ML}^{-1}bML1. Now available for new xxx has been predicted, and the prediction result is given byttThe prediction distribution of the probability distribution of t is represented by: p ( t ∣ x , w ML , β ML ) = N ( t ∣ y ( x , w ML ) , β ML − 1 ) p(t|x,\boldsymbol{w }_{ML},\beta_{ML})=\mathcal{N}(t|y(x,\boldsymbol{w}_{ML}),\beta_{ML}^{-1})p(tx,wML,bML)=N(ty(x,wML),bML1)

Guess you like

Origin blog.csdn.net/zhuzheqing/article/details/128915531