Hands-on Deep Learning - Basics of Probability Theory

basic probability theory

Suppose we roll a die and want to know what the odds are of seeing a 1 instead of seeing another number. If the dice are fair, then all six outcomes {1,…,6} are equally likely to occur, so we can say that 1 occurs with probability 1/6.

In real life, however, for the real dice we receive from the factory, we need to check it for flaws. The only way to check the dice is to throw it multiple times and record the result. For each die, we will observe a value in {1,…,6}. For each value, a natural approach is to divide the number of times it occurs by the total number of throws, which is (event)概率an estimate of this event. 大数定律(law of large numbers)Tell us: As the number of throws increases, this estimate gets closer and closer to the true underlying probability.
In statistics, we call the process of drawing samples from a probability distribution 抽样(sampling). In general terms, it can be 分布(distribution)thought of as a probability assignment to events. A distribution that assigns probabilities to some discrete choices is called 多项分布(multinomial distribution).

Axioms of Probability Theory

When dealing with dice rolls, we call the set S={1,2,3,4,5,6} the 样本空间(sample space)OR result 空间(outcome space), where each element is 结果(outcome). 事件(event)is a set of random outcomes for a given sample space. For example, "saw a 5" ( {5} ) and "saw an odd number" ( {1,3,5} ) are both valid events for rolling dice. Note that if the outcome of a randomized experiment is in A , then event A has already occurred. That is, if a 3 is thrown, since 3∈{1,3,5}, we can say that the "see odd" event happened.
概率(probability)Can be thought of as a function that maps sets to real values.In a given sample space S, the probability of event A , denoted as P(A), satisfies the following properties:

  • For any event A, its probability is never negative, that is, P(A)≥0;
  • The probability of the entire sample space is 1, that is, P(S)=1;
  • For any countable sequence A1,A2,… of mutually exclusive events (with Ai∩Aj=∅ for all i≠j), the probability of any event in the sequence is equal to the sum of their respective probabilities , namely P(⋃∞i=1Ai)=∑∞i=1P(Ai) .

With this system of axioms, we can avoid any philosophical arguments about randomness; instead, we can reason rigorously in mathematical language

Random Variables

In our randomized experiments with dice, we introduced the 随机变量(random variable)concept of . A random variable can be almost any number, and it can take a value among a set of possibilities in a random experiment.

joint probability

The first is called 联合概率(joint probability)P(A=a,B=b). Given arbitrary values ​​a and b, the joint probability can answer: What is the probability that A=a and B=b are satisfied simultaneously? Note that for any value of a and b, P(A=a,B=b)≤P(A=a) . This is certain, because for A=a and B=b to happen at the same time, A=a must happen and B=b must happen (and vice versa). Therefore, A=a and B=b are not more likely than A=a or B=b alone.

Conditional Probability

The inequality of the joint probability gives us an interesting ratio: 0≤P(A=a,B=b)/P(A=a)≤1 . We call this ratio 条件概率(conditional probability), and denote it by P(B=b∣A=a): it is the probability that B=b, provided that A=a has occurred.

Bayes' theorem

Using the definition of conditional probability, we can derive one of the most useful equations in statistics: Bayes定理(Bayes’ theorem). According to P(A,B)=P( 乘法法则(multiplication rule )B∣A )P(A) . According to the symmetry, P(A,B)=P(A∣B)P(B) can be obtained . Solving for one of the condition variables, assuming P(B)>0, we get insert image description here
where P(A,B) is one 联合分布(joint distribution)and P(A∣B) is one 条件分布(conditional distribution). This distribution can be evaluated at given values ​​A=a,B=b.

Marginalization

In order to be able to sum event probabilities, we need 求和法则(sum rule)that the probability of B is equivalent to computing all possible choices of A and aggregating the joint probabilities of all choices together: insert image description here
this is also called 边际化(marginalization). The probability or distribution of the marginalized outcome is called 边际概率(marginal probability)or 边际分布(marginal distribution).

independence

依赖(dependence)with 独立(independence). If the two random variables A and B are independent, it means that the occurrence of event A has nothing to do with the occurrence of event B. In this case, statisticians usually express this as A⊥B. According to Bayes' theorem, we immediately get the same P(A∣B)=P(A) . In all other cases, we call A and B dependent. For example, two consecutive events of throwing a die are independent of each other.
Since P(A∣B)=P(A,B)/P(B)=P(A) is equivalent to P(A,B)=P(A)P(B) , the two random variables are independent , if and only if the joint distribution of two random variables is the product of their respective distributions. Likewise, given another random variable C, the two random variables A and B are 条件独立的(conditionally independent)iff P(A,B∣C)=P(A∣C)P(B∣C) . This case is denoted A⊥B∣C .

Expectation and variance

In order to generalize the key characteristics of probability distributions, we need some measure. A random variable X is 期望(expectation,或平均值(average))represented as a measure of the bias of the random variable X from its expected value
insert image description here
when the input to the function f(x) is a random variable drawn from a distribution P.
insert image description here
This can be quantified by
insert image description here
variance The square root of the variance is called 标准差(standard deviation)the variance of a function of random variables measured by:The degree to which the function value deviates from the expectation of the function when different values ​​of x are sampled from the random variable distribution
insert image description here

Guess you like

Origin blog.csdn.net/qq_52118067/article/details/122783207