Chapter 5 (Limit Theorems): Markov and Chebyshev Inequalities (马尔可夫和切比雪夫不等式)

本文为 I n t r o d u c t i o n Introduction Introduction t o to to P r o b a b i l i t y Probability Probability 的读书笔记

Limit Theorems

  • In this chapter, we discuss some fundamental issues related to the asymptotic behavior (渐近性质) of sequences of random variables. Our principal context involves a sequence X 1 , X 2 , . . . X_1, X_2 , ... X1,X2,... of independent identically distributed random variables with mean μ μ μ and variance σ 2 \sigma^2 σ2
  • Let
    S n = X 1 + ⋅ ⋅ ⋅ + X n S_n= X_1 +· · ·+ X_n Sn=X1++Xnbe the sum of the first n n n of them. Limit theorems are mostly concerned with the properties of S n S_n Sn and related random variables as n n n becomes very large.
    v a r ( S n ) = v a r ( X 1 ) + ⋅ ⋅ ⋅ + v a r ( X n ) = n σ 2 var(S_n) = var(X_1) +· · ·+ var(X_n) = n\sigma^2 var(Sn)=var(X1)++var(Xn)=nσ2Thus, the distribution of S n S_n Sn spreads out as n n n increases, and cannot have a meaningful limit.
  • The situation is different if we consider the sample mean
    M n = X 1 + . . . + X n n = S n n M_n=\frac{X_1+...+X_n}{n}=\frac{S_n}{n} Mn=nX1+...+Xn=nSnWe have
    E [ M n ] = μ ,         v a r ( M n ) = σ 2 n E[M_n]=\mu,\ \ \ \ \ \ \ var(M_n)=\frac{\sigma^2}{n} E[Mn]=μ,       var(Mn)=nσ2
    • This phenomenon is the subject of certain laws of large numbers (大数定律), which generally assert that the sample mean M n M_n Mn (a random variable) converges to the true mean μ μ μ (a number), in a precise sense.
    • These laws provide a mathematical basis for the loose interpretation of an expectation E [ X ] = μ E[X] =μ E[X]=μ as the average of a number of independent samples drawn from the distribution of X X X.
  • We will also consider a quantity which is intermediate between S n S_n Sn and M n M_n Mn.
    Z n = S n − n μ σ n Z_n=\frac{S_n-n\mu}{\sigma\sqrt n} Zn=σn SnnμIt can be seen that
    E [ Z n ] = 0 ,       v a r ( Z n ) = 1 E[Z_n]=0,\ \ \ \ \ var(Z_n)=1 E[Zn]=0,     var(Zn)=1Since the mean and the variance of Z n Z_n Zn remain unchanged as n n n increases, its distribution neither spreads, nor shrinks to a point. The central limit theorem (中心极限定理) is concerned with the asymptotic shape of the distribution of Z n Z_n Zn and asserts that it becomes the standard normal distribution.

Markov Inequality

在这里插入图片描述

  • Loosely speaking, it asserts that if a n o n n e g a t i v e nonnegative nonnegative random variable has a small mean, then the probability that it takes a large value must also be small.

  • To justify the Markov inequality, let us fix a positive number a a a and consider the random variable Y a Y_a Ya defined by
    在这里插入图片描述It is seen that the relation Y a ≤ X Y_a\leq X YaXalways holds and therefore,
    E [ X ] ≥ E [ Y a ] = a P ( X ≥ a ) E[X]\geq E[Y_a]=aP(X\geq a) E[X]E[Ya]=aP(Xa)
    在这里插入图片描述
  • We see that the bounds provided by the Markov inequality can be quite loose.

Chebyshev Inequality

在这里插入图片描述

  • Loosely speaking, it asserts that if a random variable has small variance, then the probability that it takes a value far from its mean is also small.
  • Note that the Chebyshev inequality does not require the random variable to be nonnegative.
  • An alternative form of the Chebyshev inequality is obtained by letting c = k σ c = k\sigma c=kσ, where k k k is positive, which yields
    P ( ∣ X − μ ∣ ≥ k σ ) ≤ 1 k 2 P(|X-\mu|\geq k\sigma)\leq\frac{1}{k^2} P(Xμkσ)k21

  • To justify the Chebyshev inequality, we consider the nonnegative random variable ( X − μ ) 2 (X - μ)^2 (Xμ)2 and apply the Markov inequality with a = c 2 a= c^2 a=c2. We obtain
    P ( ( X − μ ) 2 ≥ c 2 ) ≤ E [ ( X − μ ) 2 ] c 2 = σ 2 c 2 P((X - μ)^2\geq c^2)\leq \frac{E[(X-\mu)^2]}{c^2}=\frac{\sigma^2}{c^2} P((Xμ)2c2)c2E[(Xμ)2]=c2σ2
  • For a similar derivation that bypasses the Markov inequality, assume for simplicity that X X X is a continuous random variable, introduce the function
    在这里插入图片描述note that ( x − μ ) 2 ≥ g ( x ) (x -μ)^2\geq g(x) (xμ)2g(x) for all x x x, and write
    σ 2 = ∫ − ∞ ∞ ( x − μ ) 2 f X ( x ) d x ≥ ∫ − ∞ ∞ g ( x ) f X ( x ) d x = c 2 P ( ∣ X − μ ∣ ≥ c ) \sigma^2=\int_{-\infty}^\infty (x-\mu)^2f_X(x)dx\geq \int_{-\infty}^\infty g(x)f_X(x)dx=c^2P(|X-\mu|\geq c) σ2=(xμ)2fX(x)dxg(x)fX(x)dx=c2P(Xμc)

  • The Chebyshev inequality tends to be more powerful than the Markov inequality (the bounds that it provides are more accurate), because it also uses information on the variance of X X X.

Example 5.3. Upper Bounds in the Chebyshev Inequality.

  • When X X X is known to take values in a range [ a , b ] [a, b] [a,b], we claim that σ 2 ≤ ( b − a ) 2 / 4 \boldsymbol{\sigma^2\leq (b - a)^2/4} σ2(ba)2/4. Thus, if σ 2 \sigma^2 σ2 is unknown, we may use the bound ( b − a ) 2 / 4 (b - a)^2/4 (ba)2/4 in place of σ 2 \sigma^2 σ2 in the Chebyshev inequality, and obtain
    P ( ∣ X − μ ∣ ≥ c ) ≤ ( b − a ) 2 4 c 2 ,        f o r   a l l   c > 0 P(|X-\mu|\geq c)\leq\frac{(b-a)^2}{4c^2},\ \ \ \ \ \ for\ all\ c>0 P(Xμc)4c2(ba)2,      for all c>0
  • To verify our claim, note that for any constant γ \gamma γ, we have
    E [ ( X − γ ) 2 ] = E [ X 2 ] − 2 E [ X ] γ + γ 2 E[(X-\gamma)^2]=E[X^2]-2E[X]\gamma+\gamma^2 E[(Xγ)2]=E[X2]2E[X]γ+γ2and the above quadratic is minimized when γ = E [ X ] \gamma = E[X] γ=E[X]. It follows that
    σ 2 = E [ ( X − E [ X ] ) 2 ] ≤ E [ ( X − γ ) 2 ] ,       f o r   a l l   γ \sigma^2=E[(X-E[X])^2]\leq E[(X-\gamma)^2],\ \ \ \ \ for\ all \ \gamma σ2=E[(XE[X])2]E[(Xγ)2],     for all γBy letting γ = ( a + b ) / 2 \gamma = (a + b) /2 γ=(a+b)/2, we obtain
    σ 2 ≤ E [ ( X − a + b 2 ) 2 ] = E [ ( X − a ) ( X − b ) ] + ( b − a ) 2 4 ≤ ( b − a ) 2 4 \sigma^2\leq E[(X-\frac{a+b}{2})^2]=E[(X-a)(X-b)]+\frac{(b-a)^2}{4}\leq\frac{(b-a)^2}{4} σ2E[(X2a+b)2]=E[(Xa)(Xb)]+4(ba)24(ba)2It is satisfied with equality when X X X is the random variable that takes the two extreme values a a a and b b b with equal probability 1 / 2 1 /2 1/2.

Problem 2. The Chernoff bound. (切尔诺夫界)
The Chernoff bound is a powerful tool that relies on the transform associated with a random variable, and provides bounds on the probabilities of certain tail events.

  • ( a ) (a) (a) Show that the inequality
    P ( X ≥ a ) ≤ e − s a M ( s ) P(X\geq a) \leq e^{-sa}M(s) P(Xa)esaM(s)holds for every a a a and every s ≥ 0 s\geq0 s0, where M ( s ) = E [ e s X ] M(s) = E[e^{sX}] M(s)=E[esX] is the transform associated with the random variable X X X, asumed to be finite in a small open interval containing s = 0 s = 0 s=0.
  • ( b ) (b) (b) Show that the inequality
    P ( X ≤ a ) ≤ e − s a M ( s ) P(X\leq a) \leq e^{-sa}M(s) P(Xa)esaM(s)holds for every a a a and every s ≤ 0 s\leq 0 s0.
  • ( c ) (c) (c) Show that the inequality
    P ( X ≥ a ) ≤ e − ϕ ( a ) P(X\geq a)\leq e^{-\phi(a)} P(Xa)eϕ(a)holds for every a a a, where
    ϕ ( a ) = max ⁡ s ≥ 0 ( s a − ln ⁡ M ( s ) ) \phi(a)=\max_{s\geq0}(sa-\ln M(s)) ϕ(a)=s0max(salnM(s))
  • ( d ) (d) (d) Show that if a > E [ X ] a> E[X] a>E[X], then ϕ ( a ) > 0 \phi(a)> 0 ϕ(a)>0.
  • ( e ) (e) (e) Apply the result of part ( c ) (c) (c) to obtain a bound for P ( X ≥ a ) P(X\geq a) P(Xa), for the case where X X X is a standard normal random variable and a > 0 a > 0 a>0.
  • ( f ) (f) (f) Let X 1 , X 2 , . . . X_1, X_2, ... X1,X2,... be independent random variables with the same distribution as X X X. Show that for any a > E [ X ] a> E[X] a>E[X], we have
    P ( 1 n ∑ i = 1 n X i ≥ a ) ≤ e − n ϕ ( a ) P(\frac{1}{n}\sum_{i=1}^nX_i\geq a)\leq e^{-n\phi(a)} P(n1i=1nXia)enϕ(a)so that the probability that the sample mean exceeds the mean by a certain amount decreases exponentially with n n n.

SOLUTION

  • ( a ) (a) (a) Given some a a a and s ≥ 0 s\geq0 s0, consider the random variable Y a Y_a Ya defined by
    在这里插入图片描述It is seen that the relation
    Y a ≤ e s X Y_a\leq e^{sX} YaesXThus
    M ( s ) = E [ e s X ] ≥ E [ Y a ] = P ( X ≥ a ) e s a M(s)=E[e^{sX}]\geq E[Y_a]=P(X\geq a)e^{sa} M(s)=E[esX]E[Ya]=P(Xa)esafrom which we obtain
    P ( X ≥ a ) ≤ e − s a M ( s ) P(X\geq a) \leq e^{-sa}M(s) P(Xa)esaM(s)
  • ( b ) (b) (b) We define Y a Y_a Ya by
    在这里插入图片描述Since s < 0 s <0 s<0, the relation
    Y a ≤ e s X Y_a\leq e^{sX} YaesXThus
    M ( s ) = E [ e s X ] ≥ E [ Y a ] = P ( X ≤ a ) e s a M(s)=E[e^{sX}]\geq E[Y_a]=P(X\leq a)e^{sa} M(s)=E[esX]E[Ya]=P(Xa)esafrom which we obtain
    P ( X ≤ a ) ≤ e − s a M ( s ) P(X\leq a) \leq e^{-sa}M(s) P(Xa)esaM(s)
  • ( c ) (c) (c) Since the inequality from part ( a ) (a) (a) is valid for every s ≥ 0 s\geq0 s0, we obtain
    P ( X ≥ a ) ≤ min ⁡ s ≥ 0 ( e − s a M ( s ) ) = min ⁡ s ≥ 0 e − ( s a − ln ⁡ M ( s ) ) = e − max ⁡ s ≥ 0 ( s a − ln ⁡ M ( s ) ) = e − ϕ ( a ) \begin{aligned}P(X\geq a)&\leq\min_{s\geq0}(e^{-sa}M(s)) \\&=\min_{s\geq0}e^{-(sa-\ln M(s))} \\&=e^{-\max_{s\geq0}(sa-\ln M(s))} \\&=e^{-\phi(a)}\end{aligned} P(Xa)s0min(esaM(s))=s0mine(salnM(s))=emaxs0(salnM(s))=eϕ(a)
  • ( d ) (d) (d) For s = 0 s= 0 s=0, we have
    s a − ln ⁡ M ( s ) = 0 sa-\ln M(s)=0 salnM(s)=0Furthermore,
    d d s ( s a − ln ⁡ M ( s ) ) ∣ s = 0 = a − 1 M ( s ) ⋅ d d s M ( s ) ∣ s = 0 = a − E [ X ] > 0 \frac{d}{ds}(sa-\ln M(s))\bigg|_{s=0}=a-\frac{1}{M(s)}\cdot \frac{d}{ds}M(s)\bigg|_{s=0}=a-E[X]>0 dsd(salnM(s))s=0=aM(s)1dsdM(s)s=0=aE[X]>0Since the function s a − ln ⁡ M ( s ) sa- \ln M( s) salnM(s) is zero and has a positive derivative at s = 0 s = 0 s=0, it must be positive when s s s is positive and small. It follows that the maximum ϕ ( a ) \phi(a) ϕ(a) of the function s a − ln ⁡ M ( s ) sa - \ln M(s) salnM(s) over all s ≥ 0 s\geq0 s0 is also positive
  • ( e ) (e) (e) For a standard normal random variable X X X, we have M ( s ) = e s 2 / 2 M(s)=e^{s^2/2} M(s)=es2/2. Therefore, s a − ln ⁡ M ( s ) = s a − s 2 / 2 ≤ a 2 / 2     ( s = a ) sa-\ln M(s)=sa-s^2/2\leq a^2/2\ \ \ (s=a) salnM(s)=sas2/2a2/2   (s=a). Thus,
    P ( X ≥ a ) ≤ e − a 2 / 2 P(X\geq a)\leq e^{-a^2/2} P(Xa)ea2/2
  • ( f ) (f) (f) Let Y = X 1 + ⋅ ⋅ ⋅ + X n Y = X_1 +· · ·+ X_n Y=X1++Xn. Using the result of part ( c ) (c) (c), we have
    P ( 1 n ∑ i = 1 n X i ≥ a ) = P ( Y ≥ n a ) ≤ e − ϕ Y ( n a ) P(\frac{1}{n}\sum_{i=1}^nX_i\geq a)=P(Y\geq na)\leq e^{-\phi_Y(na)} P(n1i=1nXia)=P(Yna)eϕY(na)where
    ϕ Y ( n a ) = max ⁡ s ≥ 0 ( n s a − ln ⁡ M Y ( s ) ) = max ⁡ s ≥ 0 ( n s a − ln ⁡ M ( s ) n ) = n max ⁡ s ≥ 0 ( s a − ln ⁡ M ( s ) ) = n ϕ ( a ) \begin{aligned}\phi_Y(na)&=\max_{s\geq0}(nsa-\ln M_Y(s)) \\&=\max_{s\geq0}(nsa-\ln M(s)^n) \\&=n\max_{s\geq0}(sa-\ln M(s)) \\&=n\phi(a)\end{aligned} ϕY(na)=s0max(nsalnMY(s))=s0max(nsalnM(s)n)=ns0max(salnM(s))=nϕ(a)Thus
    P ( 1 n ∑ i = 1 n X i ≥ a ) ≤ e − n ϕ ( a ) P(\frac{1}{n}\sum_{i=1}^nX_i\geq a)\leq e^{-n\phi(a)} P(n1i=1nXia)enϕ(a)Note that when a > E [ X ] a > E[X] a>E[X], part ( d ) (d) (d) asserts that ϕ ( a ) > 0 \phi(a) > 0 ϕ(a)>0, so the probability of interest decreases exponentially with n n n.

Problem 3. Jensen inequality. (詹森不等式)
A twice differentiable real-valued function f f f of a single variable is called convex if its second derivative ( d 2 f / d x 2 ) ( x ) (d^2 f /dx^2 )(x) (d2f/dx2)(x) is nonnegative for all x x x in its domain of definition.

  • (a) Show that if f f f is twice differentiable and convex, then the first order Taylor approximation (一阶泰勒展开) of f f f is an underestimate of the function, that is,
    f ( a ) + ( x − a ) d f d x ( a ) ≤ f ( x ) f(a)+(x-a)\frac{df}{dx}(a)\leq f(x) f(a)+(xa)dxdf(a)f(x)for every a a a and x x x.
  • (b) Show that if f f f has the property in part (b), and if X X X is a random variable, then
    f ( E [ X ] ) ≤ E [ f ( X ) ] f(E[X])\leq E[f(X)] f(E[X])E[f(X)]

SOLUTION

  • (a)
    f ( x ) = f ( a ) + ∫ a x d f d x ( t ) d t ≥ f ( a ) + ∫ a x d f d x ( a ) d t = f ( a ) + ( x − a ) d f d x ( a ) f(x)=f(a)+\int_a^x\frac{df}{dx}(t)dt\geq f(a)+\int_a^x\frac{df}{dx}(a)dt=f(a)+(x-a)\frac{df}{dx}(a) f(x)=f(a)+axdxdf(t)dtf(a)+axdxdf(a)dt=f(a)+(xa)dxdf(a)
  • (b) Since the inequality from part (a) is assumed valid for every possible value x x x of the random variable X X X, we obtain
    f ( a ) + ( X − a ) d f d x ( a ) ≤ f ( X ) f(a)+(X-a)\frac{df}{dx}(a)\leq f(X) f(a)+(Xa)dxdf(a)f(X)We now choose a = E [ X ] a = E[X] a=E[X] and take expectations, to obtain
    f ( E [ X ] ) + ( E [ X ] − E [ X ] ) d f d x ( E [ X ] ) ≤ E [ f ( X ) ] ∴ f ( E [ X ] ) ≤ E [ f ( X ) ] f(E[X])+(E[X]-E[X])\frac{df}{dx}(E[X])\leq E[f(X)] \\\therefore f(E[X])\leq E[f(X)] f(E[X])+(E[X]E[X])dxdf(E[X])E[f(X)]f(E[X])E[f(X)]

猜你喜欢

转载自blog.csdn.net/weixin_42437114/article/details/113944888