All of Statistics Chapter 5

Contents of this chapter

5.1 Introduction
5.2 Types of convergence
5.3 The Law of Large Numbers
5.4 The Central Limit Theorem
5.5 Delta method

As for the key nouns, some words may not convey the meaning, so the key nouns are organized as follows

1. The Law of Large Numbers: The Law of Large Numbers

2. The Central Limit Theorem: The Central Limit Theorem

3. large sample theory: large sample theory

4. Limit Theory: limit Theory

5. Asymptotic theory: asymptotic theory

6. Slutzky's theorem: Slutzky's theorem

7. The Weak Law of Large Numbers(WLLN)

8. Multivariate central limit theorem:Multivariate central limit theorem

5.1 Introduction

One of the most interesting aspects of probability theory is the behavior of sequences of random variables. This part of probability theory is called large sample theory or limit theory or asymptotic theory. The most basic question is: the limit of the random variable sequence X1, X2,... What is behavior? Because statistics and data mining both collect data, we naturally think about what will happen when more and more data are collected.

In calculus, if for any reason $\varepsilon >0$ , there is a number greater than n such $|x_n-x|<\varepsilon$ that We just say $x_n$ it converges to x, and this x is $x_n$ the limit of . In probability theory, convergence becomes a little more subtle. Let's go back to calculus for a moment. If for all n, there is $x_n=x$ , then, obviously $lim_{n\rightarrow \infty}x_n=x$ . Then let's think about the general model of this example. If X1, X2... are random variable sequences, they are independent and conform to N(0,1) distribution. Since these random variables all have the same distribution, we can say that Xn converges to X and X follows a normal distribution $X\sim N(0,1)$ . But this is not very accurate, because for all n $\mathbb{P}(Xn=X)=0$ (the probability that two consecutive random variables are equal is 0)

Here's another example. Consider X1, X2,...where Xi follows $X_i \sim N(0,1/n)$ a distribution. Intuitively, when n becomes large, Xn is concentrated near 0, so we can say that Xn tends to 0. But for all n $\mathbb{P}(X_n=0)=0$ . Clearly we need to develop a tool to discuss this convergence in a more rigorous way. This chapter will develop this suitable method.

There are two main perspectives in this chapter, informally stated as follows:

The law of large numbers states that the sample average $\bar{X}_n=n^{-1}\Sigma X_i$ converges to the expectation, $\mu = \mathbb{E}(X_i)$ which means that $\bar{X}_n$ it is close to μ with a high probability
The central limit theorem shows that the $\sqrt{n}(\bar{X}_n-\mu)$ distribution converges to the normal distribution. This means that when n is large enough, the sample mean follows a normal distribution

5.2 Types of convergence

Two main types of convergence are defined as follows:

5.1 Definition

Let X1, X2... be sequences of random variables, and X be another random variable. Let $F_n$ be the CDF of Xn, $F$ and let be the CDF of X.

For any $\varepsilon > 0$ , $n \to \infty$ at that time , there is $\mathbb{P}(|X_n-X|>\varepsilon) \to 0$ . Then Xn is said to converge to X with probability, denoted as $X_n \overset{P}{\to} X$

If for all t, there exists $\underset{n\to\infty}{\lim}F_n(t) =F(t)$ , where F is a continuous function, then Xn is said to converge to X in distribution, denoted as

When constraining the random variable to obey a point mass distribution, we slightly change the way we write it. If $\mathbb{P}(X=c) = 1$ , and $X_n \overset p \to X$ , then we can write it as $X_n \overset P \to c$ . Similarly, we can also write it as

There is one more type of convergence that is introduced mainly because it is very useful for proving probabilistic convergence. $t \neq 0$

5.2 Definition

If, $n \to \infty$ at that time , $\mathbb{E}(X_n-X)^2 \to 0$ , then Xn is said to converge to X under the mean square. Referred to as $X_n \overset{qm} \to X$

Similarly, if X obeys the point mass distribution, it can be written as $Xn \overset {qm} \to c$

5.3 Example

Suppose $X_n \sim N(0,1/n)$ . Intuitively, Xn gradually gathers near 0. So we can say that Xn converges to 0. Now let's see if it is correct. Let F be the point mass distribution function at 0. Note $\sqrt{n}X_n\sim N(0,1)$ that Z is a standard normal random variable. For t<0, there is $F_n(t) = \mathbb{P}(X_n<t) = \mathbb{P}(\sqrt nX_n < \sqrt n t) = \mathbb{P}(Z < \sqrt n t) \to 0$ , because $\sqrt n t \to - \infty$ . And for t>0, there is $F_n(t)=\mathbb{P}(X_n < t)= \mathbb{P}(\sqrt n X_n < \sqrt n t) = \mathbb{P}(Z < \sqrt n t) \to 1$ , because $\sqrt n t \to \infty$ .

Therefore, for $t \neq 0$ , there is $F_n(t) \to F(t)$ . So Xn converges to 0 on the redistribution.

Note that, $F_n(0)=1/2 \neq F(0)=1$ ,so at t=0, convergence is not established. This is not important because t = 0 is not a continuous point of F, and in the definition of distribution convergence, only convergence at continuous points is required. See below

Now let's think about convergence in probability. For any $\varepsilon > 0$ , when $n \to \infty$ , using Markov's inequality, we get

$\mathbb{P}(|X_n|>\varepsilon) =\mathbb{P}(|X_n|^2-\varepsilon^2) \leq \frac{\mathbb{E}(Xn^2)}{\varepsilon^2}=\frac{\frac{1}{n}}{\varepsilon^2}\to 0$

Therefore Xn converges to 0 with probability. $X_n \overset{P} \to 0$

The following theorem gives the relationship between two types of convergence. The results are summarized in the figure below

5.4 Theorem

The following relationship is established

$X_n \overset{qm} \to X$ implicit $X_n \overset{P} \to X$
$X_n \overset{P} \to X$ It implies that Xn converges to X in distribution,
If Xn converges to X in distribution, and $\mathbb{P}(X=c)=1$ , then Xn converges to X in probability, $X_n \overset{P} \to X$

Usually, except for the third point, the reverse does not hold.

To prove, start by proving the first point. Assume that $X_n \overset{qm} \to X$ , for fixed $\varepsilon > 0$ . Then use Markov's inequality

$\mathbb{P}(|X_n-X| >< \varepsilon) = \mathbb{P}(|X_n-X|^2>\varepsilon^2) \leq \frac{\mathbb{E}(|X_n-X|^2)}{\varepsilon^2} \to 0$

Prove the second point. This proof is a bit complicated, so you can skip it if you don't want to read it. Fixed $\varepsilon > 0$ , let x be a continuous point of F. So

$F_n(x)$

$=\mathbb{P}(X_n < x)\\\\ =\mathbb{P}(X_n\leq x,X \leq x + \varepsilon)+\mathbb{P}(X_n \leq x,X > x+\varepsilon) \\\\ \leq \mathbb{P}(X \leq x+\varepsilon) + \mathbb{P}(|X_n - X| > \varepsilon)\\\\ =F(x+\varepsilon)+\mathbb{P}(|X_n-X| > \varepsilon)$

at the same time,

$F(x-\varepsilon)$

$=\mathbb{P}(X \leq x -\varepsilon) =\mathbb{P}(X \leq x -\varepsilon,X_n \leq x )+\mathbb{P}(X \leq x -\varepsilon,X_n > x)\\\\ \leq Fn(x)+\mathbb{P}(|X_n-X| > \varepsilon)$

therefore,

$F(x-\varepsilon) - \mathbb{P}(|X_n-X| > \varepsilon) \leq F_n(x) \leq F(x+\varepsilon) +\mathbb{P}(|X_n-X| > \varepsilon)$

Taking the limit , we $n \to \infty$ get, $F(x-\varepsilon) \leq \underset{n\to \infty }\lim inf F_n(x) \leq \underset{n\to \infty }\lim sup F_n(x) \leq F(x+\varepsilon)$

Listing is true for all $\varepsilon > 0$ , take the limit of the above formula $\varepsilon \to 0$ , and F is continuous at x $\lim_n F_n(x)=F(x)$

Prove the third point. fixed $\varepsilon > 0$ , then

$\mathbb{P}(|X_n-c| > \varepsilon)$

$=\mathbb{P}(X_n < c-\varepsilon)+\mathbb{P}(X_n > c+ \varepsilon)\\\\ \leq \mathbb{P}(X_n < c-\varepsilon)+\mathbb{P}(X_n > c+ \varepsilon)\\\\ =F_n(c-\varepsilon)+1-F_n(c+\varepsilon)\\\\ \to F(c-\varepsilon)+1-F(c+\varepsilon)\\\\ =0+1-1=0$

Now let us prove that the opposite direction is not true.

Convergence in probability does not mean convergence in mean square : let $U \sim Unif(0,1)$ , and let again $X_n =\sqrt{n}I_{(0,1/n)}(U)$ , then

$\mathbb{P}(|X_n| > \varepsilon) = \mathbb{P}(\sqrt n I_{(0,1/n)}(U) > \varepsilon) = \mathbb{P}(0 \leq U < 1/n) = 1/n \to 0$ .Thus $X_n \overset{P} \to 0$ , but for all n, $\mathbb{E}(X_n^2)=n\int_0^1du=1$ , so Xn will not converge under the mean square.

Convergence in distribution does not mean convergence in probability : let $X \sim N(0,1)$ , $X_n =-X$ , where n=1,2,3.... Therefore $X_n \sim N(0,1)$ . For all n, Xn and X have the same distribution function. Therefore, for all x $\lim _n F_n(x) = F(x)$ , Xn converges distributionally to X. But $\mathbb{P}(|X_n-X| > \epsilon) = \mathbb{P}(|2X| > \epsilon) = \mathbb{P}(|X| > \epsilon/2) \neq$ . So Xn does not converge to X in probability

Warning: One might think that if $X_n \overset{P} \to b$ , then $\mathbb{E}(X_n) \to b$ , this is incorrect. Let X be a random variable with probability $\mathbb{P}(X_n=n^2)=1/n$ . $\mathbb{P}(X_n=0) = 1-(1/n)$ Now, $\mathbb{P}(|X_n| < \varepsilon) = \mathbb{P}(X_n = 0) =1-(1/n) \to 1$ .Therefore, $X_n \overset{P} \to 0$ .But, $\mathbb{E}(X_n) = [n^2\times(1/n)]+[0\times (1-(1/n))] = n$ ,therefore $\mathbb{E}(X_n) \to \infty$

5.5 Theorem

Let Xn,X,Yn,Y be random variables, let g be a continuous function

if $X_n \overset{P} \to X$ , and $Y_n \overset{P} \to Y$ , then $X_n+Y_n \overset{P} \to X+Y$
if $X_n \overset{qm} \to X$ , and $Y_n \overset{qm} \to Y$ , then $X_n+Y_n \overset{qm} \to X+Y$
If Xn converges to X in distribution and Yn converges to c in distribution, then Xn+Yn converges to X+c in distribution
if $X_n \overset{P} \to X$ , and $Y_n \overset{P} \to Y$ , then $X_nY_n\overset{P}\to XY$
If Xn converges to X in distribution and Yn converges to c in distribution, then XnYn converges to cX in distribution
if $X_n \overset{P} \to X$ , then $g(X_n) \overset{P} \to g(X)$
If Xn converges to X in distribution, then g(Xn) converges to g(X) in distribution

Among them, 3-5 are Slutzky's theorem. It is worth noting that Xn converges to X in distribution, and Yn converges to Y in distribution. It cannot be concluded that Xn+Yn converges to X+ in distribution. Y

5.3 Law of large numbers

Now we come to the pinnacle achievement in probability theory - The Law of Large Numbers. This theory states that the average of a large number of samples is close to the mean of the distribution. For example, if you toss a large number of coins, the proportion of heads will be close to 1/2. Now let's describe it more precisely.

Assume that X1 $\mu =\mathbb{E}(X_1)$ , _ $\sigma^2=\mathbb{V}(X_1)$ $\bar{X}_n=n^{-1}\Sigma X_i$ $\mathbb{E}(\bar{X}_n) = \mu$ $\mathbb{V}(\bar{X}_ n)= \sigma^2/n$

5.6 Theorem

The Weak Law of Large Numbers (WLLN)

If X1, X2...Xn are independently and identically distributed, then $\bar{X}_n \overset{P} \to \mu$

Explanation of WLLN (Law of Large Numbers): As n increases, the distribution of Xn gradually concentrates around μ.

Proof: Assumption $\sigma < \infty$ . This assumption is not required, but it simplifies the proof. Using Chebyshev’s inequality we get:

$\mathbb{P}(|\bar{X}_n-\mu| > \varepsilon) \leq \frac{\mathbb{V}(\bar{X}_n)}{\varepsilon^2}=\frac{\sigma^2}{n\varepsilon^2}$ .When n tends to infinity, this formula tends to 0.

5.7 Example

Consider tossing a coin where the probability of heads is p. Let Xi be the result of a single toss (0,1). Therefore $p=\mathbb{P}(X_i=1)=E(X_i)$ , the proportion of n heads after this toss is: $\bar{X}_n$ . According to the law of large numbers, $\bar{X}_n$ it converges to p in probability. This does not mean that it $\bar{X}_n$ is numerically equal to p. It simply means that, when n is large enough, $\bar{X}_n$ the distribution of is tightly around p. If p=1/2, then for a large n, we can let $\mathbb{P}(0.4 \leq \bar{X}_n \leq 0.6) \geq 0.7$ . First, $\mathbb{E}(\bar{X}_n) = p = 1/2$ , and $\mathbb{V}(\bar{X}_n)=\sigma^2/n=p(1-p)/n=1/(4n)$ , from Chebyshev's inequality:

$\mathbb{P}(0.4 \leq \bar{X}_n \leq 0.6)$

$=\mathbb{P}(|\bar{X}_n-\mu| \leq 0.1)\\\\ =1-\mathbb{P}(|\bar{X}_n-\mu| > 0.1)\\\\ \geq 1-\frac{1}{4n(0.1)^2}\\\\ =1-\frac{25}{n}$

Got, if n=84, then the expression will be greater than 0.7

5.4 Central limit theorem

The law of large numbers states $\bar{X}_n$ that distributions of $\mu$ . It does not help us state $\bar{X}_n$ the probabilistic properties, for which we also need the central limit theorem.

Assume that X1 $\mu$ , ... _ This theorem is striking because it requires nothing more than the existence of a mean and a variance. $\sigma^2$ $\bar{X}_n$ $\mu$ $\sigma^2/n$

5.8 Theorem

The Central Limit Theorem (CLT). Let X1,...Xn be independent and identically distributed $\mu$ with mean and variance . $\sigma^2$ Suppose $\bar{X}_n=n^{-1}\Sigma_{i=1}^nX_i$ . Then

$Z_n=\frac{\bar{X}_n-\mu}{\sqrt{\mathbb{V}(\bar{X}_n)}}=\frac{\sqrt n(\bar{X}_n-\mu)}{\sigma}$ The distribution converges to Z (normal distribution)

in other words, $\underset {n\to \infty }\lim \mathbb{P}(Z_n \leq z) = \Phi(z) = \int _{-\infty}^z \frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx$

Explanation: The probability state with respect to Xn can be approximated using a normal distribution. What we are approximating is the probability state, not the random variable itself.

In addition to Zn's distribution converging to N(0,1), there are several formats as follows to indicate that Zn's distribution converges to normal. They all mean the same thing.

5.9 Example

Suppose that the number of program errors per minute follows a Poisson distribution with mean 5. There are 125 programs available. Let X1,...X125 be the number of errors for these programs. we want to ask $\mathbb{P}(\bar{X}_n < 5.5)$

Let $\mu = E(X_1) = \lambda = 5$ , $\sigma^2 = \mathbb{V}(X_1) = \lambda =5$ .then $\mathbb{P}(\bar{X}_n < 5.5 ) = \mathbb{P}(\frac{\sqrt n (\bar{X}_n - \mu)}{\sigma} < \frac{\sqrt n (5.5 - \mu)}{\sigma} ) \approx \mathbb{P}(Z < 2.5) = 0.9938$

The central limit theorem tells us that $Z_n=\sqrt n (\bar{X}_n-\mu)/\sigma$ it is approximately N(0,1). However, we rarely know it $\sigma$ . Later we will estimate it in the following way $\sigma$ :

$S_n^2=\frac{1}{n-1}\overset{n}{\underset {i=1}\Sigma}(X_i-\bar{X}_n)^2$

This leads to the following question: $S_n^2$ does $\sigma$ the central limit theorem still hold if we use it instead? The answer is: yes

5.10 Theorem

Assuming the same conditions as CLT, then

$\frac{\sqrt n (\bar{X}_n -\mu)}{S_n} \sim N(0,1)$

You may be wondering, how accurate is this normal approximation? The answer will be given in the Berry-Esseen theorem

5.11 Theorem (The Berry-Esseen Inequality)

Assume $\mathbb{E}|X_1|^3 < \infty$ . Then $\underset z {sup}|\mathbb{P}(Z_n<z)-\Phi(z)| \leq \frac{33}{4}\frac{\mathbb{E}|X_1 - \mu|^3}{\sqrt n\sigma ^3}$

There is also a multivariate version of the central limit theorem

5.12 Theorem (Multivariate central limit theorem)

Let X1,...Xn be independent and identically distributed vectors, where Xi is:

$X_i=\begin{pmatrix} X_{1i}\\ X_{2i}\\ \vdots\\ X_{ki} \end{pmatrix}$

The mean μ is:

$\mu=\begin{pmatrix} \mu_1\\ \mu_2\\ \vdots \\ \mu_k \end{pmatrix}=\begin{pmatrix} \mathbb{E}(X_{1i})\\ \mathbb{E}(X_{2i})\\ \vdots \\ \mathbb{E}(X_{ki}) \end{pmatrix}$

Variance matrix Σ.

Let $\bar{X} = \begin{pmatrix} \bar{X}_1\\ \bar{X}_2\\ \vdots\\ \bar{X}_k \end{pmatrix}$ , where $\bar{X}_j=n^{-1}\overset n {\underset {i=1}\Sigma }X_{ji}$ . Then $\sqrt n(\bar{X} -\mu)$ converge to probability $N(0,\Sigma)$

5.5 Delta method

If the limit distribution of Yn is a normal distribution, then the Delta method provides $g(Y_n)$ a method to find the limit distribution, where the function g is any continuous function.

5.13 Theorem (Delta method)

Assume: $\frac{\sqrt n (Y_n -\mu)}{\sigma}$ The distribution converges to $N(0,1)$ , and g is a differentiable function, then $\frac{\sqrt n( g(Y_n) - g(\mu))}{|g'(\mu)|\sigma}$ the distribution converges to N(0,1).

In other words, $Y_n \approx N(\mu,\frac{\sigma^2}{n})$ ,implicit $g(Y_n) \approx N(g(\mu),(g'(\mu))^2\ \frac{\sigma^2}{n})$

5.14 Example

Let X1,..Xn be independent and identically distributed with finite mean μ and finite variance σ. According to the central limit theorem, $\sqrt n (\bar X_n -\mu )/\sigma$ the distribution converges to N(0,1). Let $W_n=e^{\bar X_n}$ . Therefore $W_n=g(\bar X_n)$ , where $g(s)=e^s$ . Because $g'(s)=e^s$ . According to the Delta method, we get $W_n \approx N(e^\mu,e^{2\mu}\sigma^2/n)$

The delta method also has a multivariate version

5.15 Theorem

Let $Y_n=(Y_{n1},...Y_{nk})$ be a random vector sequence that satisfies the following:

$\sqrt n (Y_n -\mu )$ probability converges to $N(0,\Sigma)$

order $g:\mathbb{R}^k \to \mathbb{R}$ , and

$\triangledown g(y)=\begin{pmatrix} \frac{\partial g}{\partial y_1}\\ \vdots\\ \frac{\partial g}{\partial y_K} \end{pmatrix}$

Let $\triangledown _\mu$ be the value $\triangledown g(y)$ at $y=\mu$ , and let $\triangledown _\mu$ none of the elements be 0. So

$\sqrt n (g(Y_n)-g(\mu))$ The distribution converges to $N(0,\triangledown _\mu^T\Sigma\triangledown _\mu)$

5.16 Example

Let $\begin{pmatrix} X_{11}\\ X_{21} \end{pmatrix},\begin{pmatrix} X_{12}\\ X_{22} \end{pmatrix},\dots, \begin{pmatrix} X_{1n}\\ X_{2n} \end{pmatrix}$ be $\mu=(\mu_1,\mu_2)^T$ an IID random vector with mean and variance Σ. Let $\bar X_1 = \frac{1}{n}\overset n {\underset{i=1}\Sigma}X_{1i}$ $\bar X_2 = \frac{1}{n}\overset n {\underset{i=1}\Sigma}X_{2i}$ , and define $Y_n=\bar X_1 \bar X_2$ . Therefore $Y_n=g(\bar X_1,\bar X_2)$ where, $g(s_1,s_2)=s_1s_2$ .according to the central limit theorem

$\sqrt n \begin{pmatrix} \bar X_1 - \mu_1\\ \bar X_2 - \mu_2 \end{pmatrix}$ Converges to N(0,Σ) in distribution

now $\triangledown g(s)=\begin{pmatrix} \frac{\partial g}{\partial s_1}\\ \frac{\partial g}{\partial s_2} \end{pmatrix}=\begin{pmatrix} s_2\\ s_1 \end{pmatrix}$ , and $\triangledown_\mu^T\Sigma\triangledown_\mu=(\mu_2\ \ \mu_1)\begin{pmatrix} \sigma_{11} & \sigma_{12}\\ \sigma_{21} & \sigma_{22} \end{pmatrix}\begin{pmatrix} \mu_2\\ \mu_1 \end{pmatrix}=\mu_2^2\sigma_{11}+2\mu_1\mu_2\sigma_{12}+\mu_1^2\sigma_{ 22}$

Therefore, $\sqrt n (\bar X_1 \bar X_2 - \mu_1\mu_2)$ the distribution converges to $N(0,\mu_2^2\sigma_{11}+2\mu_1\mu_2\sigma_{12}+\mu_1^2\sigma_{22})$

End of this chapter

Untranslated: literature notes, appendices, homework