[Machine Learning Core Summary] What is EM (Maximum Expected Value Algorithm)

What is EM (Expectation Maximum Algorithm)

In real life, an apple is 100% an apple, and a pear is 100% a pear.

Please add a picture description

There are many things in life that are probability distributions, such as how many people are married, how many people have jobs,

Please add a picture description

What if we wanted to investigate the percentage of the population who smoked marijuana? It is difficult to get real answers to sensitive questions. At this time, probability can be used to anonymize the survey. In addition to the question "Do you smoke marijuana?", another question is asked, "Is your phone number an even number?" Invite participants to flip a coin and answer question 1 for heads and question 2 for tails.

Please add a picture description

The survey is conducted by telephone, and the proportion of mobile phone numbers ending in even numbers has been determined. As long as there are enough survey samples, flipping a coin can make the number of people answering questions 1 and 2 nearly equal. When we don’t know which question is answered, we still It is easy to guess the proportion of marijuana users in the crowd. This is the magic of probability.

Please add a picture description

Now let us slightly change question 2, and replace "the phone number is an even number" with an unknown probability event such as "do you smoke?" Can we still infer the probability of marijuana smokers?

Please add a picture description

The answer is still yes, but this time we changed the survey method, distributed the same question to every five people to invite them to answer, did not record what the question was, only recorded their answers, while ensuring anonymity, we got some I don't know where to belong is the answer.

Please add a picture description

Then it is the turn of the EM algorithm

Please add a picture description

Steps of EM Algorithm

  1. Randomization, if you don't know the answer to a question, you can't guess the proportion of smokers and marijuana users. If you don't know these two proportions, you can't guess which question the answer belongs to. assign a value
  2. Next, use these values ​​​​in reverse to speculate on the possibility that these groups of answers belong to two questions. This step is to estimate the unknown variable, which is the expectation of question attribution, so it is called E-step.
insert image description here
  1. Then we use this possibility to inversely estimate the probability of a smoker and a marijuana smoker. Since this probability is the most likely, it is called the M-step.
Please add a picture description
  1. Next, repeat the second step, use the new probability to estimate the possibility that the answer belongs to the two questions, and then use the possibility to infer the probability in reverse, and repeat until a relatively stable value is estimated, then stop
Please add a picture description

In this way, we calculated the approximate probability of smokers and marijuana smokers in the crowd. Is this process a bit familiar? The steps of K-means are also: 1. Random assignment, 2. Repeated control, 3. Continuous approximation. In fact, K-means is a special case of the EM algorithm. The goal of K-means is to obtain two center coordinates, thereby distinguishing pears and apples as two things. The EM algorithm can find the distribution law of samples, and help us find more pears and apples while clustering.

Please add a picture description

Guess you like

Origin blog.csdn.net/RuanJian_GC/article/details/131544178