Summary of probability and statistics knowledge points in machine learning

introduction

When learning advanced knowledge, it is crucial to understand the basic concepts. Why? Because basic knowledge is the foundation upon which you build advanced knowledge. If you put more on top of a weak foundation, it may eventually splinter, meaning you end up not being able to fully understand anything you learn. So let’s try to understand the fundamentals in depth.

In this series, I will explain probability and statistics in machine learning. Mainly from the textbook: "Computer Vision: Models, Learning, and Inference" by Dr. Simon J.D. Prince. This textbook is great and to the point. In this series of my articles, I will summarize and add some additional ideas so that you can easily grasp these concepts.

Introduction to Probability and Random Variables

I'm sure you've learned about probability at some point. We also use it unconsciously when making decisions in real life. If you think you have the best chance of succeeding in the decision you're trying to make, you'll do it. Otherwise, you won't. This is an interesting area of ​​research, but sometimes tricky. So, in this part of the article, let’s review what probability is and introduce you to the concept of a “random variable”.

Let's say you have a card in your hand. And you are about to throw the card to the ground. What is the probability that the card will be face up when lying on the ground?

picture

Probabilities are often expressed as % in real life (e.g. 80% chance of rain), but when we deal with probabilities in mathematics we usually express them in decimals (e.g. 0.5 for 50%). Probability measures the likelihood of an event occurring, expressed as a number between 0 and 1.

Random Variable (RV)A variable that represents the outcome of interest. When we come to the card throwing example above. The random variable x is used to represent the status of a card. It has only 2 states, face up (x = 1) or face down (x = 0). If you're familiar with programming, it's like any other variable you would use. The RV can change its state every time an event occurs. So if you throw a card 5 times, x might be [0, 0, 1, 0, 1]. Since we are face down (x = 0) 3 times out of 5 trials, the probability that a card is face down is Pr(x=0) = 3/5 = 0.6. On the other hand, the probability of a card being face up (x = 1) is 2 out of 5 times, so Pr(x=1) = 2/5 = 0.4. Note that the sum of all states (x = 0 and x = 1) is 1, which is always the case if you sum over all possible states.

Introduction to Probability Density Functions (PDF)

Now you know what probability and random variables are. Let’s talk about the probability density function (often abbreviated as PDF, not a file type you often use!).

Remember when I showed you that the sum of all states is always 1? It's important to remember this because it's a very useful dependency property.

If I were to visualize an example of throwing cards in a bar chart, it would look like this:

picture

This is the discrete version of the probability density function. Called probability mass assumes that the width of each bar is always 1. Then the area of ​​Pr(x=0) is 0.4 * 1 = 0.4. The same is true for Pr(x=1). Note that the sum of all areas is always 1. If it's continuous, the same applies.

Let us review here what continuous and discrete probability distribution functions are.

picture


Both plots represent the same distribution. The only difference is whether it is continuous or discrete. In data science, especially when we work with data through programming, you are more likely to be dealing with discrete data that has multiple rows and columns, with each cell containing one data point.

Introduction to joint probability

We discussed probability and probability density distributions as well as continuous and discrete data representations. Now let's delve a little deeper into probability and talk about "joint probability".

Let's look at a simple example below. Suppose you have 2 random variables x and y, x represents whether it rains, and y represents whether you have an umbrella. Assume you know the probability of each event:

picture

Currently, these two conditions are independent of each other. But we want to know the probability of them happening simultaneously. This is where "joint probability" comes into play.

Let's give an example. What is the probability that it will rain and you have an umbrella? (Thank God you have an umbrella!) This is our case 1. We have the joint probability of Pr(x=1, y=1), where x=1 means the probability of raining and y=1 means you have an umbrella. Scenario 2 is the worst case scenario. It's raining and you didn't bring an umbrella. The joint probability is Pr(x=1, y=0).

picture

So through the above example, I hope you have a little idea of ​​what joint probability is. In more general terms, joint probability is the calculation of how likely it is that two (or more!) events will occur together at the same point in time.

To give you another perspective, let's try to visualize what joint probability is. Suppose you have 2 random variables (x and y) and want to know intuitively its joint probability. It looks like the example below.

picture

Think of it like a contour map. The darker areas (closer to black) are at the bottom, as the color gets lighter (closer to yellow) the higher the altitude. Essentially, you are trying to understand a 3D view by looking at a 2D map.

So, what does this tell us when it comes to joint probabilities? One important thing is that the entire area enclosed by the black box is always equal to 1. Remember when we discussed the probability density function (PDF)? The area must always be 1, and it's the same here. Since it is probability, the sum of all possible joint probabilities here must be 1. Another aspect is height. You can see there's a yellowish section near the center, slightly to the right. This is where the joint probability of Pr(x, y) is highest, meaning that a particular pair of x and y yields a higher probability for a particular event.


Introduction to Marginal Probability

Okay, now you know the joint probability. Let’s talk about “marginalization.” Marginalization is a way of going from joint probabilities to the standard probabilities we started with. Assume you only have joint probabilities:

picture

What we are interested in now are independent probabilities, such as Pr(x=1) or Pr(y=0). How do we calculate from these joint probabilities?

It's easy. Here are the formulas for the continuous and discrete cases:

picture

It simply sums over all possible states of a random variable that is not of interest. Let's see an example to understand this. Suppose you want to get Pr(x=0), but you only have the joint probability Pr(x=0, y=0) & Pr(x=0, y=1). To get Pr(x=0) you can perform marginalization by doing:

picture

So what is the intuition behind this calculation? Marginalization is trying to consider all possible scenarios for the state you are interested in. To get Pr(x=0). The problem is that you only have the joint probabilities Pr(x=0, y=0) and Pr(x=0, y=1). But if you think about it, for the particular state we're interested in (x=0), there are only two cases. It’s nothing more than y=0 or y=1. That's why by adding the probabilities of Pr(x=0, y=0) and Pr(x=0, y=1), we are considering all possible cases of Pr(x=0) and thus get the probability.

Introduction to conditional probability

Now the topic is going to get a little more complicated, but at the same time more interesting! Let’s talk about a very important concept first, called “conditional probability”.

Let’s go back to the example of rain and umbrellas again. Previously, we thought of each state as independent, meaning that one condition did not affect the other. Considering the possibility of rain in the afternoon (hypothetically), your chances of bringing an umbrella increase. This is where conditional probability comes into play.

picture

So the conditional probability looks like this:

picture

What does this equation mean?​ 

Going back to visualizing the joint probability contour plot, we can explain why this equation makes sense. I've explained that the joint probability looks like the contour plot on the left. Conditional probability is restricted given certain conditions (y = y*). In this case, you are not considering all possible y's, but only the specific y=y* to base your decision on (for example, you are trying to consider bringing Take an umbrella).

picture

Given y=y*, we can simply take a slice from the contour plot and use it as our conditional probability. But there is 1 problem. The area of ​​the slice is not 1. Why? Remember when I explained earlier that the sum of all possible joint probabilities in this region must be 1? That's why slicing won't give you 1, because you're ignoring all other possible joint probability cases. Therefore, to make this slice a PDF, we need to normalize the area so that the area of ​​this slice becomes 1.

picture

If it is a continuous case, the area is expressed by the integral above. By using it, we can calculate the conditional probability like this:

picture

So that's why the simplified form of conditional probability looks like this:

picture


Introduction to Bayes' Rule

The last topic of today’s article is another very important concept called “Bayes’ Rule”. I'm pretty sure you've heard it before. Let’s take a closer look to understand this concept.

The formula itself is simple. But what does this formula mean, and how do we get it from what we already know?

picture

First, let's try to understand the formula itself. So the well-known formula above can be rewritten based on what we have already learned.

picture

Doesn't this look familiar to you? Yes! This is like the conditional probability formula, we normalize the area to 1 so that we can treat it as a PDF.

So, to break down Bayes' Theorem, it's a way of calculating the "posterior" using 3 terms: "likelihood," "prior," and "evidence."

Posterior: What we know about y after observing x

Likelihood: The tendency to observe a particular value of x given a particular value of y

A priori: What we know about y before observing x

Evidence: Make sure the left-hand side is a constant (normalization term) of the valid distribution

It is useful because even though the posterior is difficult to compute, it is often easier to compute it in terms of likelihood, priors, and evidence. The same goes for other terms.

picture

I hope you understand the formula now. Let's see how to derive this formula from what we already know.

picture

Basically, we can derive Bayes' theorem from the definition of conditional probability. This is an important concept, be sure to take some time to understand it!

Introduction to independence

Let’s talk about “independence” first. Remember the joint probability we talked about last time? Simply put, joint probability is the probability that two or more events will occur simultaneously. If they are independent, we can simply think of the joint probability as multiplying the probabilities of each event. Do you remember conditional probability? The probability of an event is determined by other events. If the events are independent, the conditional probability will no longer be affected by other conditions. It just becomes its own probability.

picture

For example, suppose you want to know the probability that Bob wins the tennis contest and Amanda wins the speech contest. You are betting with your friends on whether they will win or lose (don’t do this in real life!). Both competitions are held at the same time and on the same day. According to his statistics, Bob's chance of winning is 0.35 (=35%). Amanda's speech is very good and she has won 5 games in a row, so let's assume it's about 0.95 (=95%). These events are independent in that one event does not affect the other. So if you bet on both sides to win the game, the joint probability would be 0.35 * 0.95 = 0.3325 (=33.25%). On the other hand, if you bet that Bob loses and Amanda wins, the joint probability would be (1 - 0.35) * 0.95 = 0.6175 (=61.75%). Therefore, you will decide to place a bet on this combination…

What to expect

Expected value or expectation is one of the most important fundamental concepts in probability and statistics, so spend some time and effort understanding this part!

If you want to think about it in a simple way, the expected or expected value is simply the mean or mean.

As a simple example, consider a random variable X and its corresponding probability P. The expected value can be expressed as follows:

picture

In general, the expected value is the sum of all combinations of random variables and probability density functions.

For convenience, we usually use the Greek character μ (mu) to write the expected value.

picture

Well, it might not be clear yet, so let me give you a concrete example.

picture

I hope you now understand the concept of expectations. Let's take a look at their properties from here. Properties are certain rules that a concept has, and it is often useful to understand these rules so that more advanced concepts (such as variance, covariance, skewness, etc.) can be understood more easily.

picture

Introduction to variance and covariance

After looking at expectations, it’s time to move onto the next exciting topic, variance! Sometimes it is abbreviated as "Var" or "var" when writing code. Variance is a measure that tells us how distributed our data is relative to the mean. The equation looks like this:

picture

I'm using the properties we learned about expected value to change the equation in a different form. The first formula I show, surrounded by red, tells us that the variance is subtracting the mean from the random variable, squaring it to account for negativity, and then taking the expected value.

Now comes "covariance" (often abbreviated to "Cov" or "cov" when writing code). Covariance is actually another very important concept. It will come up from time to time whenever you study more advanced topics. It basically tells how two random variables move together. If one random variable increases its value and the other random variable also increases its value, the covariance is high. If one random variable increases but the other does not change, the covariance is low. It might be hard to understand from these words, so I'll show you the equation first and let me imagine it.

picture

The actual equation for the covariance is contained in the red box. One interesting thing is that (by using the expectation property we discussed) the covariance can be expressed in another form, as in the second part of the equation, where the covariance is 0 if the random variables X and Y are independent.

Now let me visualize the covariance so you can better understand what it represents!

picture

Introduction to standard deviation

Now let’s talk about another important statistical measurement called “standard deviation”. Sometimes people abbreviate it as "Std" or "std" when writing code. Standard deviation is very similar to variance in that it also quantifies the amount of variation in the data. In the equation, it reads like this:

picture

One thing to keep in mind is using this measure on non-normal distributions. It will still work, but be aware that some properties that are useful when working with normally distributed data will not be available at this time. For example, if your data is normally distributed, 1 standard deviation might account for approximately 68% of the data. 2 standard deviations is almost 95% of the time. 3 standard deviations covers up to 99.7%. It is often referred to as the 68–95–99.7 rule. This useful property only applies to the normal distribution.​ 

picture

Introduction to Skewness and Kurtosis

In contrast to the concepts I explained above, you may not have heard of "skewness" or "kurtosis." But if you really think of these statistics as k-order moments, then it's obvious that skewness and kurtosis are just another statistical measure.

picture

Here, "σ" is the standard deviation.

First, let’s talk about “skewness”. Simply put, skewness is a statistical measure that tells us how skewed a distribution is to the right or left. Let me give you a visual example.

picture

You can remember what skewness means by the direction of the longer tail. As shown by the blue arrow, if the longer tail tends to the negative direction, it has negative skewness and vice versa.

Now about "kurtosis". Kurtosis is a measure of the sharpness of a distribution. The higher the kurtosis, the sharper the peak.

picture

Summarize

Probability: A measure of the likelihood of an event occurring, expressed as a number between 0 and 1.

Random Variable (RV): A variable that can be used to represent the possible values ​​of a random outcome, just like any other variable.

Probability Density Function (PDF): The density of a continuous random variable is expressed as a function where the area under the curve shows the probability of the RV itself.

Joint probability:The probability that two or more random variables occur simultaneously.

Marginalization:A method of ignoring certain random variables to obtain probabilities with fewer random variables.

Conditional probability:The probability of an event given (by hypothesis, presumption, assertion, or evidence) that another event has occurred.

Bayes’ rule:A rule that relates posterior probability to likelihood, prior, and evidence.

Independence:If two random variables are independent of each other, their joint probability can be considered as their multiplication. Likewise, if the conditional probabilities are independent, you can treat the conditional probability as its own probability, regardless of the conditions.

Expectation:The most important statistical measure. You can think of it as taking an average using random variables and their corresponding probabilities.

Standard deviation:A measure used to quantify the variability of the underlying distribution of the data you are looking for.

Variance:Another way of quantifying the variability in the underlying distribution of the data.

Skewness:How skewed the peak of the distribution is to one side (right or left).

Kurtosis:A measure of how sharp the peaks of a data distribution are.

おすすめ

転載: blog.csdn.net/qq_39312146/article/details/134651139