Level 16 How simple and how simple is Naive Bayes_Artificial Intelligence Course-Little Elephant Academy

Course Catalog Elephant Institute - AI

Follow the public account [Python Family] to receive a set of 1024G textbooks , exchange group learning , and business cooperation . I have sorted out and shared several sets of teaching materials from four-digit training institutions , and now share and exchange learning for free, and provide answers and exchange groups .

There may be all the tutorials on prostitution you want here~

 

Overview of this level

 

At this level we look at a new classification algorithm called Naive Bayes. It is an eternal classic in the field of text classification.

 

Whenever you encounter a text classification problem, the first method you need to think of is Naive Bayes, which is a very reliable baseline for text classification.

 

Don't look down on it because it is simple. In fact, what we really need is a simple and effective model.

 

Introduction of application scenarios

 

Let's first look at a spam email as follows:

If you see "link", "purchase", "latest" and other keywords in the email, you can think that this is probably a spam email. In fact, many mail filtering systems filter out spam in this way.

 

Or just get a vocabulary of keywords, and once the e-mail contains these words, it will be considered as spam, okay?

 

In fact, this is a feasible plan and can be treated as a benchmark. But the problem here is: even if there are some advertising words, it is not necessarily spam; if there are no advertising words, it is not necessarily a normal email. Therefore, although each word has a tendency, it cannot be generalized.

 

How to integrate these different words together in a statistical way, so that we can finally predict the probability that an email is spam or normal email? The answer is Naive Bayes!

 

Principles of Naive Bayes Algorithm

 

Then use a specific example to illustrate how to use Naive Bayes for classification!

Suppose we have collected 24 normal emails and 12 spam emails as training data.

 

Using the naive Bayes model generally requires two steps: First, count the contribution of each word to an email becoming spam or normal email. For example, p("advertisement"|spam), p("advertisement"|normal) respectively represent the probability of the keyword "advertisement" appearing in spam/normal emails. Second, use the results of these statistics to predict a new email.

 

Here, the first step is called the Naive Bayes training process, and the second step is the testing process.

 

For convenience, we assume that each email contains 10 words. Therefore, in the training data, there are 240 words in normal emails and 120 words in spam emails.

 

It is easy to get the probability of normal mail and spam:

Next, we separately count the frequency of each word in normal emails and spam emails:

So far we have calculated the probability of each word in different categories and the respective proportions of spam and normal emails.

 

After counting these probabilities, how do we use this information to make predictions?

 

Bayes Theorem

 

Here we need to use a very famous theorem called Bayes' Theorem . We need to use Bayes' theorem to further decompose the above conditional probability, and finally we can get the predicted result.

 

Before introducing the Bayes theorem, let's review what the probability of multiplication formula: .

 

The multiplication formula is simply shifted and deformed to get:

 

This formula is the Bayesian formula. Its function is that if we want the required value, but it is not easy to get it directly, through this formula, we can transform it into the required value.

 

For example, if an event is defined as the text content of a given email, and the event is defined as a spam email, the value we hope to find is the probability that the email is spam given the content of an email.

 

By Bayes' theorem, we will convert it for the sake and that these two probabilities we have already been through a simple statistical way.

 

After understanding the Bayes theorem, the remaining steps are relatively clear. Next, we use a complete example to demonstrate the use of the naive Bayes algorithm to predict whether an email is spam.

 

The first step: the training process

 

The training process of the naive Bayes algorithm is actually doing some simple statistical work. Assuming that we have calculated the probability values ​​we need from the data set as follows:

When the statistics are completed, the training is completed.

 

Step 2: The forecasting process

 

Suppose we have now received a new email. For the sake of convenience, assume that the content of this email is very simple, with only three words, and the content of the email is "the latest way to make money." Is this email spam? Let's use the results of the training to do the math:

 

Probability that the message is spam:

Probability that the email is a normal email:

By comparing the magnitude of these two probability values, it can be judged whether the mail is spam, and the task of classification is completed.

 

Let's analyze the denominator first. The denominators of the two probability values ​​are the same. Once the content of the email is determined, the probability value of p ("the latest method of making money") is a constant, so it can be ignored; let's analyze the numerator again, p ("the latest How to make money"|Spam) How to calculate this probability?

 

p("the latest way to make money"|spam) can be abstracted into , the calculation of this probability is actually not difficult in mathematics, it is a simple multiplication formula:

Conditional independence hypothesis

 

However, in practical applications, it is very difficult to obtain these probability values ​​on the right side of the equal sign. In order to simplify this calculation process, we introduce a conditional independence assumption, that is , the assumption is conditional independence, so the formula can be simplified as follows:

The meaning of the assumption of conditional independence in the previous example is to assume that the words "latest", "making money", and "method" are independent of each other. In other words, this assumption does not consider the order of words, but only Is the frequency of each word .

 

Obviously, this is not a rigorous and reasonable assumption, which is why we call this algorithm "simple". Nevertheless, this algorithm has achieved very good results in practice and has become a classic algorithm in the field of text classification.

 

With these foreshadowing, we continue to complete the previous example:

Well, all the probability values ​​have been solved, everything is ready, substituted into the formula, and the final size comparison:

Therefore, according to the naive Bayes algorithm, we judge the email "the latest way to make money" as spam.

 

summary

 

After learning this, let's summarize the two important knowledge points we have learned so far:

 

  • Bayesian formula:

Among them, it is our statistics on the probability of positive and negative samples in the data set, also known as the prior probability ; it is the classification probability we ultimately want to request, also known as the posterior probability .

 

  • Conditional independence assumption:

We assume that all words are independent of each other, thus transforming the calculation of complex conditional probability into simple multiplication of probabilities:

 

Laplacian smoothing

 

So the question is, if so many probabilities are multiplied, will there be such a potential risk: as long as there is a probability value of 0, then the result of the entire formula calculation is 0.

 

Imagine that just because a word has not appeared in the training data set, the result is 0, which is obviously unreasonable. In order to avoid this situation, we introduce a commonly used method to solve this problem: Laplacian smoothing , also called plus 1 smoothing .

 

Add 1 to smooth. As the name suggests, in order to avoid the occurrence of a word from 0, we artificially increase the number of occurrences of all words by 1.

 

Following the previous example, suppose we have a total of 120 words in the spam email, among which there are n non-repeated words, namely {"newest", "make money", "method", ...}, the statistics are as follows:

Obviously, adding up the probabilities of these n words must be equal to 1:

 

Let's introduce the idea of ​​Laplacian smoothing to deal with these probability values.

 

Please think about it. After introducing Laplacian smoothing, we will change the above-mentioned word frequency statistics to the following. Will this be a problem?

Of course there is a problem!

 

The probability of a simple plus 1, causes all the n-word at the molecular and add more than one: .

 

How to do it? I believe you must have thought about it. We not only add 1 to both the numerator, but also add n to both the denominator. Wouldn't it be solved?

Okay, the above is the processing method of Laplacian smoothing. Is it easy to understand?

 

Treatment of continuity features

 

So far, the naive Bayes model we discussed focused on the text classification problem. Since each feature is a word, it can be understood that the features are discrete, which is the main reason why the conditional probability can be calculated by counting the number of times.

 

So for the naive Bayes model, is it possible to make it deal with continuity features? For example, a person's height, temperature, age?

 

of course can! If the feature is continuous, we can use Gaussian distribution to describe this conditional probability.

 

The Gaussian distribution is particularly suitable for representing the real world. As long as we observe enough data, these data usually obey the Gaussian distribution.

 

For example, if we measure the height of 10,000 men across the country, these measurements will largely obey the Gaussian distribution. Of course, the larger the number of participants, the closer the measured value will be to the Gaussian distribution. This is the main reason why we like to fit real life data through Gaussian distribution.

 

Given a batch of data, how do we use Gaussian distribution for fitting?

 

The probability density function of the Gaussian distribution is known:

 

Among them, μ represents the mean value, and σ represents the standard deviation, we need to ask for these two values.

 

Let's return to the topic of Naive Bayes. After knowing how to fit continuous data through Gaussian distribution, we already know how to deal with continuous features in the data.

 

The processing process is as follows:

  • Collect all samples in a category separately
  • For each category, we can fit an independent Gaussian distribution.

 

After fitting the distribution, when we get each conditional probability, we can predict its category for any input.

 

Well, Gaussian Naive Bayes, let's talk about this.

 

To sum up, Naive Bayes itself is most suitable for solving text classification problems. So you can try to use Naive Bayes to solve any problems that are related to text later, at least it can be used as a very reliable benchmark.

 

Congratulations, you have completed another level of learning. Did you find that your machine learning skills have improved significantly? Today’s study is here first, bye~

 

Contact us and learn Python together

Share Python actual code, introductory materials, advanced materials, basic grammar, crawlers, data analysis, web sites, machine learning, deep learning, etc.


​Follow the public account " Python Family " to receive a full set of 1024G textbooks , exchange group learning , and business cooperation

 

Guess you like

Origin blog.csdn.net/qq_34409973/article/details/115245299
Recommended