[Machine Learning] An example perfectly explains Naive Bayes classifier

     The simplest solution is usually the most powerful one, and Naive Bayes is a good example of that. Although machine learning has made tremendous progress in the past few years, Naive Bayes has been shown to be not only simple, but also fast, accurate, and reliable. It has been used successfully in many projects, and it has been a great help in solving natural language processing ( NLP ) problems.

     Naive Bayes is a probabilistic algorithm that uses probability theory and Bayes' theorem to predict sample categories such as news or customer reviews. They are probabilistic, which means they compute the probability of each class for a given sample, and then output the class of the sample with the highest probability. The way they obtain these probabilities is by using Bayes' theorem, which describes the probability of a feature based on prior data on conditions that may be associated with that feature.

      We will use an algorithm called Polynomial Naive Bayes . We will introduce the algorithm applied to NLP by way of an example , so eventually you will know not only how the method works, but also why it works. We will use some advanced techniques to make Naive Bayes comparable to more complex machine learning algorithms such as SVMs and neural networks.

a simple example

Let's see how this example works in practice. Suppose we are building a classifier that says whether or not a text refers to sports. Our training set has 5 sentences:

Text

Category

A great game

Sports

The election was over

Not sports

Very clean match

Sports

A clean but forgettable game

(an unforgettable game)

Sports

It was a close election

( It was a close election )

Not sports

Since Naive Bayes is a probabilistic classifier, we want to calculate the probability that the sentence " A very close gameis a sport and the probability that it is not a sport.

Mathematically, what we want is the probability that the category of the sentence P ( Sports | a very close game ) is sports.

But how do we calculate these probabilities?

feature engineering

When creating a machine learning model, the first thing we need to do is decide what to use as features. For example, if we were classifying health, the features might be a person's height, weight, gender, etc. We exclude things that are useless to the model, such as people's names or favorite colors.

In this case, we don't even have numerical features. We only have words. We need to somehow convert this text into numbers that can be calculated.

So what should we do? Generally, word frequency is used. That is, we ignore word order and sentence construction, and treat each file as a word bank. Our features will be the counts of these words. Although it may seem overly simplistic, it works surprisingly well.

Bayes' theorem

Bayes' theorem is useful when working with conditional probabilities (as we do here) because it gives us a way to reverse them: P(A|B) = P(B|A) × P(A) /P(B) . In our case we have P ( sports | a very close game), so using this theorem we can reverse the conditional probability:


Since for our classifier we are just trying to find out which class has a greater probability, we can discard the divisor and just compare

This makes it easier to understand because we can actually calculate these probabilities! Just count how many times the sentence " A very close game appears in the training set of " Sports" and divide it by the total number to get P( a very close game | Sports ).   

There is a problem, but "A very close game" does not appear in our training set, so this probability is zero. The model won't be very useful unless every sentence we want to classify is present in our training set.

Being Naive

We assume that every word in a sentence is independent of other words. This means that we no longer look at whole sentences, but individual words. We write P(A very close game) as: P(a very close game)=P(a)×P(very)×P(close)×P(game)  This assumption is very powerful, but very useful. This allows the entire model to work well with small amounts of data or data that may be mislabeled. The next step is to apply it to what we said before:

P(a very close game|Sports)=P(a|Sports)×P(very|Sports)×P(close|Sports)×P(game|Sports)

Now that all of our words actually occur several times in our training set, we can count!

Calculate the probability

The process of calculating probabilities is really just the process of counting in our training set.

First, we compute the prior probability for each class: for a given sentence in the training set,  the probability of P(sports) is ⅗. Then, P (non-sports) is ⅖. Then, in the calculation of P( game | Sports ) is a sample of how many times "game" appears in sports, and then divide by the total number of sports ( 11 ). Therefore, P(game|Sports)=2/11.

However, we ran into a problem: "close" does not appear in any sports samples! That is to say P(close|Sports)=0. This is rather inconvenient because we will be multiplying it with other probabilities, so we will end up with P(a|Sports)×P(very|Sports)×0 ×P(game|Sports) equal to 0 . Doing something like this doesn't give us any information at all, so we have to find a way.

How can we do this? By using a method called Laplacian smoothing : we add 1 to each count, so it won't be zero. To balance this, we add the number of possible words to the divisor, so this part will never be greater than 1. In our case the possible words are [ "a", "great", "very", "over", 'it', 'but', 'game', 'election', 'close', 'clean', 'the' , 'was' , 'forgettable' , 'match' ] .

Since the number of possible words is 14 , applying Laplace smoothing we get . The full results are as follows:



Now we just double all the probabilities and see who is bigger:


Perfect! Our classifier gives "A very close game"  to be the Sport class.

Advanced technology

A lot can be done to improve this basic model. The following techniques can make Naive Bayes comparable to more advanced methods.

  • Removing stopwords . These commonly used words, do not really add any classification, eg, one, capable, and others, forever, etc. So for our purposes, the end of the election will be the election, and a very close race will be a very close race.
  • Lemmatizing words . This is a combination of different words. So election, general election, being elected etc will be grouped together and count as more occurrences of the same word.
  • Using n-grams  (use example) . We can calculate some common instances such as "race without insider" and "close election" . Instead of just one word, one word is calculated.
  • Use TF-IDF . Instead of just counting frequencies, we can do more advanced things


This article is recommended by Beiyou @爱可可-爱生活, translated by Alibaba Cloud Yunqi Community .
The original title of the article "A practical explanation of a Naive Bayes classifier"

Author: Bruno Stecanella , machine learning enthusiast , translator: Yuan Hu, reviewer:
The article is a simplified translation, for more details, please check the original text

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325779743&siteId=291194637