Estimated depth understanding of linear models (two) --- based on the likelihood function

Updated: 2019.10.31

1 Introduction

  In the last article, we proceed from the perspective of loss of function discussed \ (\ beta \) and \ (\ sigma \) estimate. This article deals change in the taste of a highly statistical point of view, starting from the likelihood function to discuss the \ (\ beta \) and \ (\ sigma \) estimate. From which we will also see, in different assumptions, the loss of function will occur different changes.

2. About \ (\ varepsilon \) hypothesis

  On a ( based on estimated loss function ), we mentioned that, for a linear model, we often use Guass-Markon assumption that:

  1. \(E(\varepsilon) = 0\)
  2. \ (Cov (\ varepsilon) = \ sigma ^ 2 I_n \)

  But, in fact, we are always the same variance assumption is not satisfied, it is full of \ (\ varepsilon \) assumption that there should be three types:

  1. Homocedastic, and each random error uncorrelated variables: \ (CoV (\ varepsilon) = \ 2 I_n Sigma ^ \)
  2. Abnormal difference, but each random error uncorrelated variables, \ (CoV (\ varepsilon) = diag (\ sigma_1 ^ 2, \ sigma_2 ^ 2, \ cdots, \ sigma_n ^ 2) \)
  3. Heteroscedasticity, and each random error variable is related,
    \ [CoV (\ varepsilon) = \ the begin {pmatrix} \ sigma_ {. 11} ^ 2 & CoV (\ varepsilon_1, \ varepsilon_2) & \ cdots & CoV (\ varepsilon_1, \ varepsilon_n) \\ cov (\ varepsilon_2 , \ varepsilon_1) & \ sigma_ {22} ^ 2 & \ cdots & cov (\ varepsilon_2, \ varepsilon_n) \\ \ vdots & \ vdots & & \ vdots \\ cov (\ varepsilon_n , \ varepsilon_1) & cov (\ varepsilon_n, \ varepsilon_2) & \ cdots & \ sigma_ {nn} ^ 2 \ end {pmatrix} \]

  In this case, remember \ (cov (\ varepsilon) = \ Sigma \)

3. Based on estimates of the likelihood function

  Is from the perspective of loss function before the estimated parameters, but the loss of virtually every function should correspond to a distribution, and distribution so that the likelihood function is maximized
  we know that in a given situation X, the likelihood function \ ( L (\ Theta; the Y, X-) = P _ {\ Theta} (Y_1 = Y_1, Y_2 = Y_2, \ cdots, y_n = y_n) \) . Suppose \ (Y_1, Y_2, \ cdots , Y_n \) is independent, \ (L (\ Theta; the Y, X-) = \ prod_. 1} = {I ^ nP (the Y = y_i) \) . As is the case of discrete time, further into: \ (L (\ Theta; the Y, X-) = \ {I = prod_ nP_i. 1} ^ (\ Theta) \) . As is the case of continuous time, it can be reduced to: \ (L (\ Theta; the Y, X-) = \ prod_. 1} = {I ^ NF (y_i; \ Theta) \)

3.1 based on assumptions 1

  1, assuming if satisfied \ (cov (\ varepsilon) = \ sigma ^ 2 I_n \) plus a normality assumption that there are, and \ (\ varepsilon_i \ sim N ( 0, \ sigma ^ 2) \) then, \ (y_i = x_i \ Beta + \ varepsilon_i \ SIM N (x_i \ Beta, \ Sigma ^ 2) \) , then there is the likelihood function:
\ the begin {Equation}
\ the begin {Split}
L (\ Beta, \ Sigma ^ 2, the Y, X-) & = \ prod_. 1} = {I ^ NF (y_i) \\
& = \ prod_. 1} ^ {n-I = \ {FRAC. 1} {\ sqrt {2 \} PI \ {E} ^ Sigma - \ {FRAC (y_i - x_i \ Beta) ^ 2} {2 \ Sigma ^ 2}} \\
& = (\ {FRAC. 1} {\ sqrt {2 \} PI \ Sigma}) ^ ^ {NE - \ {2}. 1 FRAC {\ Sigma} ^ 2 \ DisplayStyle \ sum_. 1} ^ {n-I = (y_i - x_i \ Beta) ^ 2}
\ Split End {}
\} End {Equation

  It can be seen that the likelihood function contained in the \ (\ sum_ {i = 1 } ^ n (y_i - x_i \ beta) ^ 2 \) portion is in the form of secondary loss of the previously discussed. Then we learned that based on the assumption 1:00, indeed quadratic loss should form before we used
  usually for simple calculations, we will be the likelihood function of the number of

\begin{equation}
\begin{split}
lnL(\beta, \sigma^2, Y, X) & = -nln(\sqrt{2\pi}\sigma)- \frac{1}{2 \sigma^2} \sum_{i=1}^n(y_i - x_i\beta)^2
\end{split}
\end{equation}

  Note \ (G (\ beta, \ sigma ^ 2) = nln (\ sqrt {2 \ pi} \ sigma) + \ frac {1} {2 \ sigma ^ 2} \ sum_ {i = 1} ^ n (y_i - x_i \ Beta) ^ 2 \) , so maximizing the likelihood function, that is seeking \ (min \ hspace {1mm} G (\ beta, \ sigma ^ 2) \)

  For \ (G (\ beta, \ sigma ^ 2) \) requirements on \ (\ Beta \) is the partial derivative with a

\begin{equation}
\begin{split}
\frac {\partial G(\beta, \sigma^2)}{\partial \beta}
&= 0 + \frac{1}{2 \sigma^2}2 \displaystyle \sum_{i=1}^n (y_i - x_i\beta)x_i\\
& = \frac{1}{2 \sigma^2} \displaystyle \sum_{i=1}^n 2(x_iy_i - x_i^2\beta) = 0
\end{split}
\\
=> \displaystyle \sum_{i=1}^n (x_iy_i - x_i^2\beta) = 0 => \displaystyle \sum_{i=1}^n x_iy_i = \displaystyle \sum_{i=1}^n x_i^2\beta\\
=> X^TY = X^TX\beta => \hat \beta = (X^TX)^{-1}X^TY
\end{equation}

  Pairs \ (G (\ beta, \ sigma ^ 2) \) requirements on \ (\ Sigma \) is the partial derivative with a

\begin{equation}
\begin{split}
\frac {\partial G(\beta, \sigma^2)}{\partial \sigma}
&= n\frac{1}{\sqrt{2\pi}\sigma}\sqrt{2\pi} - \frac{2}{2\sigma^3}\sum_{i=1}^n(y_i - x_i\beta)^2 \\
& = \frac{n}{\sigma} + \frac{1}{\sigma^3}\sum_{i=1}^n(y_i - x_i\beta)^2 = 0
\end{split}
\\
=> \frac{1}{\sigma^3}\sum_{i=1}^n(y_i - x_i\beta)^2 = \frac{n}{\sigma}
=> \hat \sigma^2 = \frac{\displaystyle \sum_{i=1}^n(y_i - x_i\beta)^2}{n}
\end{equation}

  You can see from here, by the likelihood function, time to get parameters \ (\ beta \) and (\ sigma \) \ estimate, but based on the estimated loss function only estimate the \ (\ beta \) , the \ (\ sigma \) is another set of theoretical estimates made

  • tips: Here \ (x_i \ beta \) in (\ beta \) \ not estimate the amount, which represents the whole of the fitted values is true, so the degree of freedom is different (and \ (\ hat \ sigma ^ 2 = \ frac {SSE} { np} \) is slightly different)

3.2 based on the assumption 2

  If the condition is assumed 2, \ (CoV (\ varepsilon) = CoV (\ varepsilon) = diag (\ sigma_1 ^ 2, \ sigma_2 ^ 2, \ cdots, \ sigma_n ^ 2) \) , plus a normality Suppose that there are \ (\ varepsilon_i \ SIM N (0, \ Sigma ^ 2_ {II}) \) , then, \ (y_i = x_i \ Beta + \ varepsilon_i \ SIM N (x_i \ Beta, \ Sigma ^ 2_ { } II) \) , then there is the likelihood function:

\begin{equation}
\begin{split}
L(\beta, \sigma^2, Y, X) & = \prod_{i=1}^n f(y_i)\\
& = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma_{ii}} e^{- \frac{(y_i - x_i\beta)^2}{2\sigma^2_{ii}}}\\
& = (\frac{1}{\sqrt{2\pi}})^n \prod_{i=1}^n(\frac{1}{\sigma_{ii}}) e^{- \frac{1}{2} \displaystyle \sum_{i=1}^n(\frac {y_i - x_i\beta}{\sigma_{ii}})^2}
\end{split}
\end{equation}

  We can see based on assumption 2, like the core of the likelihood function has changed, no longer \ (\ sum_. 1} ^ {n-I = (y_i - x_i \ Beta) ^ 2 \) . Thus, based on past experience, based on the assumption 2, the loss function should be used changes. At this time, using the loss function should be normalized quadratic loss \ (\ DisplayStyle \ sum_. 1} ^ {n-I = (\ {y_i FRAC - x_i \ {Beta} \}} II {sigma_) ^ 2 \) , we this is also known as the weighted least squares estimation.
  The logarithmic likelihood function:
\ the begin Equation {}
\} Split the begin {
lnL (\ Beta, \ Sigma ^ 2, the Y, X-) = -nln (\ sqrt {2 \} PI) - \ sum_ {I = . 1} ^ NLN \ sigma_ {II} - \ FRAC {. 1} {2} \ DisplayStyle \ sum_ {I =. 1} ^ n-(\ FRAC {y_i - x_i \ Beta} {\ sigma_ {II}}) ^ 2
\ Split} {End
\ Equation End {}

  Note \ (G (\ beta, \ sigma_ {ii} ^ 2) = nln (\ sqrt {2 \ pi}) + \ sum_ {i = 1} ^ nln \ sigma_ {ii} + \ frac {1} {2 } \ DisplayStyle \ sum_. 1} ^ {n-I = (\ {y_i FRAC - x_i \ {Beta} \}} II {sigma_) ^ 2 \) , so maximizing the likelihood function, that is seeking \ (min \ hspace {1mm} G (\ beta,
\ sigma_ {ii} ^ 2) \)   of \ (G (\ beta, \ sigma_ {ii} ^ 2) \) requirements on \ (\ Beta \) is the partial derivative with a

\begin{equation}
\begin{split}
\frac {\partial G(\beta, \sigma_{ii}^2)}{\partial \sigma_{ii}}
&= 0 + 0 - \frac{1}{2}2 \displaystyle \sum_{i=1}^n (\frac {y_i - x_i\beta}{\sigma_{ii}})\frac{x_i}{\sigma_{ii}}\\
& = - \displaystyle \sum_{i=1}^n (\frac {x_iy_i - x_i^2\beta}{\sigma_{ii}^2}) = 0
\end{split}
\\
=> \displaystyle \sum_{i=1}^n (\frac {x_iy_i}{\sigma_{ii}^2}) = \displaystyle \sum_{i=1}^n (\frac {x_i^2\beta}{\sigma_{ii}^2}) \\
=> X_c^TY_c = X_c^TX_c\beta => \hat \beta = (X_c^TX_c)^{-1}X_c^TY_c
\end{equation}

  Note \ (X_c = (\ frac { x_1} {\ sigma_ {11}}, \ frac {x_2} {\ sigma_ {22}}, \ cdots, \ frac {x_n} {\ sigma_ {nn}}) ^ T , Y_c = (\ frac {y_1
} {\ sigma_ {11}}, \ frac {y_2} {\ sigma_ {22}}, \ cdots, \ frac {y_n} {\ sigma_ {nn}}) ^ T \)   for \ (G (\ beta, \ sigma_ {ii} ^ 2) \) requirements on \ (\ sigma_ {ii} \ ) deflector has to \ (\ sigma_ {11} \ ) Example

\begin{equation}
\begin{split}
\frac {\partial G(\beta, \sigma_{ii}^2)}{\partial \sigma_{11}}
&= 0 + \frac{1}{\sigma_{11}} - \frac{1}{2}2\frac{(y_1 - x_1\beta)^2}{\sigma_{11}^3} \\
& = \frac{1}{\sigma_{11}} - \frac{(y_1 - x_1\beta)^2}{\sigma_{11}^3} = 0
\end{split}
\\
=> \frac{1}{\sigma_{11}} = \frac{(y_1 - x_1\beta)^2}{\sigma_{11}^3}
=> \hat \sigma_{11}^2 = (y_1 - x_1\beta)^2
\end{equation}

  Similarly, it has \ (\ hat \ sigma_ {ii } ^ 2 = (y_i - x_i \ beta) ^ 2 \)

3.3. 3 based on the assumption

  If the condition is assumed. 3, \ (CoV (\ varepsilon) = \ Sigma \) plus a normality assumptions, and that there are \ (\ varepsilon \) satisfies multidimensional normal distribution, \ (\ varepsilon \ SIM N_n ( 0, \ Sigma ^ 2_ {II}) \) , then, \ (the Y = X-\ Beta + \ varepsilon \ SIM N_n (X-\ Beta, \ Sigma) \) , then there is the likelihood function

\begin{equation}
\begin{split}
L(\beta, \Sigma Y, X) & =P(Y_1 = y_1, Y_2 = y_2, \cdots, Y_n = y_n) = P(Y=y)\
& = \frac{1}{(\sqrt{2\pi})^n|\Sigma|^{\frac{1}{2}}}e ^{- \frac{1}{2}(Y - X\beta)^T \sum^{-1} (Y - X\beta)}
\end{split}
\end{equation}

  Which, \ (| \ Sigma | \) is \ (\ Sigma \) determinant
  we can see that based on the assumption 3 under, like nuclear likelihood function also changed. Then, based on this assumption, the loss function should be used at this time \ ((Y - X \ Beta) ^ T \ Sigma ^ {-}. 1 (Y - X \ Beta) \) . The logarithmic likelihood function:
\ [lnL (\ Beta, \ Sigma, the Y, X-) = -nln (\ sqrt {2 \} PI) - \ FRAC. 1} {2} {LN | \ Sigma | - \ frac {1} {2} (
Y - X \ beta) ^ T (\ Sigma) ^ {- 1} (Y - X \ beta) \]   referred to \ (G (\ beta, \ Sigma) = nln (\ sqrt {2 \ pi}) + \ frac {1} {2} ln | \ Sigma | + \ frac {1} {2} (Y - X \ beta) ^ T \ Sigma ^ {- 1} (Y - X \ Beta) \) , so maximizing the likelihood function, that is seeking \ (min \ hspace {1mm}
G (\ beta, \ Sigma) \)   of \ (G (\ beta, \ Sigma) \) requirements on \ ( \ beta \) partial derivatives have

\begin{equation}
\begin{split}
\frac {\partial G(\beta, \Sigma)}{\partial \beta}
&= 0 + 0 - \frac{1}{2}2 X^T \Sigma^{-1} (Y - X\beta)\\
& = X^T \Sigma^{-1}(X\beta - Y) = 0
\end{split}
\\
=> X^T \Sigma^{-1}X\beta = X^T \Sigma^{-1}Y \\
=> \hat \beta = (X^T \Sigma^{-1} X)^{-1}X^T \Sigma^{-1} Y
\end{equation}

  For \ (G (\ beta, \ Sigma) \) requirements on \ (\ Sigma \) is the partial derivative with a

\ begin {equation}
\ begin {split}
\ mathrm {d} = G & \ frac {1} {2} | \ Sigma | ^ {- 1} d | \ Sigma | + \ Frac {1} {2} (YX \ beta) ^ T \ Sigma ^ {- 1} d \ Sigma \ Sigma ^ {- 1} (YX \ beta) \\
& = \ frac {1} { 2} tr (\ Sigma ^ {- 1} d \ Sigma) + tr (\ frac {1} {2} (Y - X \ beta) ^ T \ Sigma ^ {- 1} d \ Sigma \ Sigma ^ {- 1} (YX \ beta)) \\
& = \ frac {1} {2} tr (\ Sigma ^ {- 1} d \ Sigma) + tr (\ frac {1} {2} \ Sigma ^ {- 1 } (YX \ beta) (YX \ beta) ^ T \ Sigma ^ {- 1} d \ Sigma), \\
& tr = (\ frac {1} {2} ((\ Sigma ^ {- 1} - \ Sigma ^ {- 1} (YX \ beta) (YX \ beta) ^ T \ Sigma ^ {- 1})) d \ Sigma)
\} end {split
\\
=> \ frac {\ partial} G {\ partial \ Sigma} = \ frac {1} {2} (\ Sigma ^ {- 1} - \ Sigma ^ {- 1} (YX \ beta) (YX \ beta) ^ T \ Sigma ^ {- 1}) ^ T = 0 \\
=> \ Sigma ^ {- 1} (YX \ beta) ^ T (YX \ beta) \ Sigma ^ {- 1} = \ Sigma ^ {- 1} \\
= > \ bar \ sigma = (YX \ beta) ^ T (YX \ beta)
\end{equation}

4. ESTIMATES

  In the estimation based on loss function, we discussed ESTIMATES, then when changed assumptions and loss of function, we estimate whether or excellent properties it
  for hypothesis 3,
\ the begin {Equation}
\ the begin { } Split
L_3 (\ Beta) & = (the Y - X-\ Beta) ^ T \ Sigma ^ {-}. 1 (the Y - X-\ Beta) \\
& = (the Y - X-\ Beta) ^ T \ Sigma ^ {- \ FRAC. 1} {2} {} \ Sigma ^ {- \ FRAC. 1} {2} {} (the Y - X-\ Beta) \\
& = (\ Sigma ^ {- \ FRAC. 1} {2} {} the Y - \ Sigma ^ {- \ frac {1} {2}} X \ beta) ^ T (\ Sigma ^ {- \ frac {1} {2}} Y - \ Sigma ^ {- \ frac {1} {2 X-}} \ Beta) \\
& = (the Y ^ * - * ^ X-\ Beta) ^ T (the Y ^ * - * ^ X-\ Beta)
\ Split End {}
\} End {Equation

  Wherein remember \ (\ Sigma ^ {- \ frac {1} {2}} Y - \ Sigma ^ {- \ frac {1} {2}} X \ beta \) of \ (Y ^ * - X ^ * \ Beta \) , since \ (L_1 (\ beta) = (YX \ beta) ^ T (YX \ beta) \) have excellent properties, then \ (L_3 (\ beta) = (Y ^ * - X ^ * \ beta) ^ T ( Y ^ * - X ^ * \ beta) \) estimates should also have excellent properties.

The scenario assumed

  Why does assume a linear model in line with Hypothesis 1 it? Indeed, when we assume 2, the parameters to be estimated based has n + p (n being different \ (\ sigma_ {II} \) , and the p \ (\ beta_i \) ), and we only n samples, such lack of freedom situation occurs; and 3 based on the assumption when we, the parameters to be estimated even more (with \ (\ frac {n ^ 2 + n} {2} + p \) a). This is difficult to do basic estimates even make out, it is estimated not necessarily unique.

  Faced with this situation, we usually have to increase the sample size, as measured m times an individual can obtain data mn, of course, then the model has become a hybrid model. Thus, for the assumed hypothesis 2 and 3, is more suitable for some longitudinal data (data on economic panel, repeated measures on psychological, sociological multilevel data)

Guess you like

Origin www.cnblogs.com/xyy2019/p/11772159.html