What is EM (Expectation Maximum Algorithm)
In real life, an apple is 100% an apple, and a pear is 100% a pear.
There are many things in life that are probability distributions, such as how many people are married, how many people have jobs,
What if we wanted to investigate the percentage of the population who smoked marijuana? It is difficult to get real answers to sensitive questions. At this time, probability can be used to anonymize the survey. In addition to the question "Do you smoke marijuana?", another question is asked, "Is your phone number an even number?" Invite participants to flip a coin and answer question 1 for heads and question 2 for tails.
The survey is conducted by telephone, and the proportion of mobile phone numbers ending in even numbers has been determined. As long as there are enough survey samples, flipping a coin can make the number of people answering questions 1 and 2 nearly equal. When we don’t know which question is answered, we still It is easy to guess the proportion of marijuana users in the crowd. This is the magic of probability.
Now let us slightly change question 2, and replace "the phone number is an even number" with an unknown probability event such as "do you smoke?" Can we still infer the probability of marijuana smokers?
The answer is still yes, but this time we changed the survey method, distributed the same question to every five people to invite them to answer, did not record what the question was, only recorded their answers, while ensuring anonymity, we got some I don't know where to belong is the answer.
Then it is the turn of the EM algorithm
Steps of EM Algorithm
- Randomization, if you don't know the answer to a question, you can't guess the proportion of smokers and marijuana users. If you don't know these two proportions, you can't guess which question the answer belongs to. assign a value
- Next, use these values in reverse to speculate on the possibility that these groups of answers belong to two questions. This step is to estimate the unknown variable, which is the expectation of question attribution, so it is called E-step.
- Then we use this possibility to inversely estimate the probability of a smoker and a marijuana smoker. Since this probability is the most likely, it is called the M-step.
- Next, repeat the second step, use the new probability to estimate the possibility that the answer belongs to the two questions, and then use the possibility to infer the probability in reverse, and repeat until a relatively stable value is estimated, then stop
In this way, we calculated the approximate probability of smokers and marijuana smokers in the crowd. Is this process a bit familiar? The steps of K-means are also: 1. Random assignment, 2. Repeated control, 3. Continuous approximation. In fact, K-means is a special case of the EM algorithm. The goal of K-means is to obtain two center coordinates, thereby distinguishing pears and apples as two things. The EM algorithm can find the distribution law of samples, and help us find more pears and apples while clustering.