Statistics (1) Probability
Update history | date | content |
---|---|---|
1 | 2023-9-24 | Change "intersection" in the third theorem that probability satisfies to "union" |
In order to learn machine learning, now relearn statistics and record the entire learning process. This article is a study note, mainly translated from "all of statistics" by Larry Wasserman
Contents of this Chapter
1.1 Introduction
1.2 Sample space and events
1.3 Probability
1.4 Probability in finite sampling space
1.5 Independent events
1.6 Conditional probability
1.7 Bayes’ theorem
Because when translating English nouns into Chinese nouns, the meaning is often unclear, so some nouns are given in both Chinese and English. In fact, the understanding of the nouns will be deeper if you use the English nouns directly.
Key terms:
- Sample space: sample space
- events:events
- Sample results:sample outcoms
- actual results:realizations
- Sample elements or elements:elements
- Disjoint: disjoint
- Mutually exclusive: mutually exclusive
- Indicative function, or indicator function: indicator function
- probability distribution:probability distribution
- probability measure:probability measure
- Axiom:Axiom
- Frequency:frequency
- Credibility: degree of belief
- Frequentist school: frequentist
- Bayesian schools:Bayesian schools
- Lemma:lemma
- Theorem:Theorem
- equally likely: equally likely
- Uniform probability distribution: uniform probability distribution.
- Combinatorial methods:combinatorial methods
- Independent Events:Independent Events
- Venn diagram: Venn diagram
- Conditional Probability: Conditional Probability
- Expert system:expert system
- Bayesian networks:Bayes' nets
- Total Probability Theorem or Total Probability Theorem: The Law of Total Probability
- Prior probability: prior probability
- posterior probability: posterior probability
1.1 Introduction
Probability is a mathematical language for quantifying uncertainty. In this chapter, we will introduce the basic concepts under probability theory. First we start with the sample space, which is the set of all possible outcomes.
1.2 Sample space and events
Sample space (sample space) Ω represents the set of all possible outcomes in a random experiment. The lowercase ω represents a point in the sample space, called sample outcomes (sample outcomes), or actual results (realizations), or sample elements (elements) .
The subsets in the sample space Ω are called events (Events)
1.1 Example
If we toss the coin twice, then Ω={HH,HT,TH,TT} , and the first toss of the coin is positive A={HH,HT}
1.2 Example
Let ω be the result of a measurement of some physical quantity, such as temperature. Then Ω=R={-∞,∞} , some people may say that defining Ω as R is not accurate because the temperature has a minimum limit. However, when the sample space is larger than actual, there is usually no problem.
Then the event A with a temperature greater than 10 and less than or equal to 23 can be expressed as A=(10,23]
1.3 Example
Note that because text is somewhat difficult to describe complex mathematical formulas, some complex mathematical formulas will be replaced by pictures.
If we toss a coin infinitely, the sample space will be an infinite set. It can be expressed as:
Ω= {ω=(ω1,ω2,ω3,…,):ω∈{H,T}}
Let E be the first time heads appear in the third coin toss, then E can be expressed as:
E={(ω1,ω2,ω3,…,): ω1=T,ω2=T,ω3=H,ωi∈{H,T} For i>3}
Symbols for complement, union, intersection and difference
Given an event A, use the following formula to express the complement of A
Formally, A^c can be read as "not A".
Then the complement of Ω is ∅.
The union of event A and event B is:
The above formula can be considered as: A or B.
If A1, A2... are set sequences, then their union can be expressed as:
The intersection of event A and event B is:
Pronounced: A and B, sometimes the intersection is written as AB or (A, B)
If A1, A2,... are set sequences, then their intersection can be expressed as:
The difference between sets is defined as follows:
If every element in A is also contained in B, then it can be written as
The equivalent way of writing is also
If A is a finite set, then |A| represents the number of elements in A
The figure below gives an overview
mutually exclusive, disjoint
If Ai∩Aj=∅,i ≠ j , then we say that A1, A2,... are disjoint (disjoint), or mutually exclusive (mutually exclusive)
For example, A1=(0,1],A2=(1,2],A3=(2,3],… are disjoint.
Divide Ω into a series of disjoint sets, A1, A2, A3,... Then they satisfy
Indicative function or indicator function
The indicator function of A, or indicator function, is defined as:
Monotonically increasing and monotonically decreasing
If A1 ⊂ A2 ⊂ A3 ⊂ A4 … , and we define An as follows
Then we have A1, A2, A3,... which are monotonically increasing.
If A1 ⊃ A2 ⊃ A3 ⊃ A4…, and we define An as follows
1.4 Example
Because the text is difficult to type, I provide pictures directly.
1.3 Probability
We will assign a real number P(A) to event A, called the probability of A. We call this P a probability distribution or probability measure.
1.5 Definition of probability distribution or probability measure
If the function P, for each event A, has P(A) as a real number and satisfies the following three axioms (axiom), then P is the probability distribution (Probability distribution) or probability measure (Probability measure)
- Axiom 1(axiom 1): P(A) ≥ 0
- Axiom 2(axiom 2): P(Ω) = 1
- Axiom 3 (axiom 3): If A1, A2,... do not intersect, then the probability distribution of the union is equal to the sum of the probability distributions of each subset. The formula is as follows
There are many explanations and understandings of P(A). There are two most common explanations: frequencies and degree of belief. To explain by frequency is that in repeated experiments, P(A) represents event A. The ratio of true times to long-running times.
For example, when we say that the probability of a coin landing heads is 1/2, it means that in a trial of tossing a coin multiple times, as the number of tosses increases, the proportion of heads to the total number of times tends to 1 /2
In an infinitely long, unpredictable tossing sequence, the proportional limit of heads tends to a constant. This is an ideal situation. This ideal situation is like a straight line in geometry.
Explained in terms of degree-of-belief, P(A) measures the observer's belief that A is true.
No matter which explanation is used, we require them to satisfy the above three axioms. The differences between these two explanations will not matter much before dealing with statistical inference. In statistical inference, these two different explanations, This resulted in two different schools of inference: frequentist and Bayesian schools. We will discuss this in Chapter 11
We can deduce many properties of P from the above three axioms, as follows
In the following lemma (Lemma), a less obvious property is given
1.6 Lemma
For any events A and B, they satisfy the following formula
P(A∪B) = P(A) + P(B) - P(AB)
Proof: Because the complement is difficult to print, I directly added the picture of manual proof.
1.7 Example
Toss two coins, let H1 be the first positive event, and H2 be the second positive event. If all outcomes are equally likely, then P(H1∪H2) = P(H1 ) + P(H2) - P(H1H1) = 1/2 + 1/2 -1/4 = 3/4
1.8 Probability Continuity Theorem (Theorem)
When n ⟶ ∞, if An ⟶ A, then P(An) ⟶ P(A)
The proof is as follows:
1.4 Probability on Finite Sample Spaces
Assume that the sample space is finite Ω = {ω1,ω2,ω3,…}. For example, if we throw the dice twice, then the sample space Ω will have 36 elementsΩ = {(i,j);i,j∈{ 1,…6}}. If each result is equally likely, then P(A)= |A|/36. Where |A| represents the number of elements in event A
The probability that the sum of the dice points is 11 is 2/36, because there are only two outcomes corresponding to this event.
If Ω is finite and every outcome is equally likely, then
P(A)=|A|/|Ω|
This is called: **uniform probability distribution**.
In order to calculate the probability, you need to count the number of points of event A. The method of counting points is called combinatorial methods. We don't need to go into the details of combinatorial methods, but we need some The knowledge of calculation theory will be convenient for future use.
给定n个对象,排列这些对象的,排列个数有:n!=n(n-1)(n-2)(n-3)...1,其中0!=1.
We can also define it as follows,
Pronounced as: n choose k, it indicates how many choices there are when k objects are selected from n objects.
For example, we have a class of 20 people and need to choose 3 people to be class committee members, then there are so many choices
We may also observe the following properties
1.5 Independent Events
If we toss a coin twice, the probability of both heads is 1/2*1/2=1/4. We multiply the probabilities directly because we think the two events are independent.
The formal definition of an independent event is as follows:
1.9 Definition of independent events
If P(AB) = P(A) * P(B), then A and B are independent, written as A∐B
A set of events is said to be independent if it satisfies the following formula
If A and B are not independent, write in the following format
Independence can arise in two different ways. The first is to explicitly assume that two events are independent. For example, in the process of two coin tosses, we assume that the tosses are independent. This independence reflects the fact that the coins do not Will remember the first throw.
Second, by verifying that P(AB)=P(A) P(B), it is inferred that A and B are independent of each other. For example, when throwing a uniform dice, let A={2,4, 6},B={1,2,3,4}. Then A∩B={2,4}.P(AB)=2/6=P(A)P(B)=(1/2) ( 2/3). Therefore we can conclude that A and B are independent. In this example, we do not assume that A and B are independent, in fact they are indeed independent.
Assuming that A and B do not intersect and the probability of each event is positive, are they independent?
The answer is, not independent. Because P(A)P(B)>0, but P(AB)=P(∅)=0
Therefore, through this example, it can be concluded that it is impossible to judge whether events are independent through the Venn diagram.
1.10 Example
Toss a coin 10 times, let A = at least one head, let Tj = the jth time there is a tail, then what is P(A)
P(A) = 1- P(A的补集)=1-P(所有都是反面)=1-P(T1T2..T10)
= 1-P(T1)*P(T2)...P(T10)
= 1- (1/2) * (1/2) ... * (1/2)
≈ 0.999
1.11 Example
Two people take turns shooting. The first person's probability of success is 1/3 and the second person's probability of success is 1/4. What is the probability that the first person succeeds before the second person?
A summary of independence is given below
1.6 Conditional Probability
If P(B)>0, we define that the probability of A occurring under the condition that B occurs is:
1.12 Definition of conditional probability
If P(B)>0, then, under the condition that B occurs, the conditional probability of A is:
P(A|B) = P(AB)/P(B)
Think of conditional probability as the proportion of cases in which B occurs, in which A occurs.
For any fixed B, P(B)>0, then P(·|B) is the probability (that is, it satisfies the three axioms mentioned above). In particular, P(A|B)>0, P( Ω|B)=1, if A1, A2,... do not intersect, then
However, P(A|B∪C) = P(A|B) + P(A|C) usually does not hold, and the calculation rules of probability are usually applied to the left side of the vertical line, not the right side of the vertical line.
P(A|B) = P(B|A) is also usually not true. People are often confused by this rule. For example, if you have measles, the probability that you have erythema is 1, but there are conditions for you to have erythema. Next, your probability of getting measles is not 1.
In this case, the difference between P(A|B) and P(B|A) is obvious, but there are still some situations that make this difference very difficult to detect. This often happens in legal cases and is sometimes called :prosecutor's fallacy(Prosecutor's fallacy)
1.13 Example
For disease D, the drug test result is + or -, their probability is
x | D | complement of D |
---|---|---|
+ | .009 | .099 |
- | .001 | .891 |
According to the definition of conditional probability, we have
P(+|D)=P(+∩D)/P(D)=.009/(.009+.001)=.9
P(-|D's complement)=P(-∩(D's complement))/P(D's complement)=.891/(.891+.099)≈.9
Obviously, its test is very accurate. The probability of positive (+) when there is disease is .9, and the probability of negative (-) when there is no disease is about 0.9
Suppose, you took this test and it was positive (+), what is the probability that you have the disease? Most answers are probably 90%, but the correct answer is:
P(D|+)=P(D∩+)/P(+)=.009/(.009+.099)≈.08
The lesson here is that you need to work out your answers numerically, not through intuition
The following lemma comes directly from the definition of conditional probability
1.14 Lemma
If A and B are independent events, then P(A|B)=P(A), of course, for any A, B event pair, it satisfies
P(AB)= P(A|B)P(B) = P(B|A)P(A)
From the above lemma, we can see another explanation of independent events, that is, the occurrence of B does not change the probability of A. The formula P(AB)=P(A)P(B|A) is used to calculate a certain Particularly useful when some probabilities
1.15 Example
Draw two cards from a deck of cards without replacing them. Assume that A is the first event of drawing the club A, and B is the second event of drawing the diamond Q, then P(AB)=P§P(B |A)=(1/52)*(1/51)
A summary of conditional probabilities is given below
1.7 Bayes' Theorem
Bayesian theory is the basis of expert systems and Bayes' nets. Bayesian networks are introduced in Chapter 17. First we need some preliminary theoretical results.
1.16 The Law of Total Probability Theorem or Total Probability Theorem (The Law of Total Probability)
Let Ω be A1, A2,...Ak. For any event B, P(B) is equal to
Proof, as follows
1.17 Bayes' Theorem
Let Ω be A1, A2,...Ak, satisfy P(Ai)>0, if P(B)>0, then for i = 1,...,k, we have
1.18 Description
We call P(Ai) the prior probability of A, and P(Ai|B) the posterior probability of A.
Proof: We can apply the definition of conditional probability twice, as follows
1.19 Example
I divided the emails into three categories, A1 = "Spam", A2 = "Low priority", A3 = "High priority". Based on experience, I got them The probability of,P(A1)=.7,P(A2)=.2,P(A3)=.1
Of course.7+.2+.1=1. Suppose event B is an email containing the word "free". Based on experience, we can get P(B|A1)=.9,P(B|A2)=.01,P( B|A3)=.01
Note:.9+.01+.01≠1
Now I receive an email with the word "free" in it, what is the probability that it is spam?
According to Bayes’ theorem:
P(A1|B)=(.9*.7)/{(.9*.7)+(.01*.2)+(.01*.1)}=.995
End of Chapter 1,
Untranslated, literature notes, appendices and homework