Machine Learning in Action(4) —— Classifying with probability theory: naïve Bayes

 

Machine Learning in Action(4) —— Classifying with probability theory: naïve Bayes

 

1.Hard decisions and probability theory

Hard decision —— what class does this instance belongs to ? —— result: right or wrong.

Probability theory—— what is the best guess about the class of this instance and what’s the probability estimate of that best guess? —— result: best guess or not.

 

2.Bayesian decision theory —— choosing the decision with the highest probability.

Bayesian probability —— allows prior knowledge and logic to be applied to uncertain statements.

Frequency probability —— only draws conclusions from data and doesn’t allow for logic and prior knowledge.

 

3.Conditional probability

Formal definition of the conditional probability —— take an example:

Bayes rule —— tells us how to swap the symbols in a conditional probability statement, if we have P(c | x) but we want to have P(x | c), then we can find it with the following formula:

4.Classifying with conditional probability and Bayesian theory

Description:

When given a new point, we want to estimate which class it belongs to.

 

We can calculate the probability of this point belonging to each unique class, with the Bayesian theory we choose the class with the biggest probability as the label assigned to that point. That means we need to calculate and compare P( Ci | x, y), which means given a point identified as x,y, what’s the probability it came from class Ci.

 

Together with Bayes rule, we can convert P( Ci | x, y) to another form:

So we can calculate the unknown term on the left from three known terms on the right.

 

What’s more , if we just want to compare the probability between two classes, calculating the numerator on the right is enough, because when comparing, the denominator is of the same value for one instance and we can just compare the numerator.

 

Thus the Bayesian classification rule is:

 

 

5.Document classification with naïve Bayes

Description:

We have some emails, according to presence or absence of some specific words in each email we can classify whether an email is a spam or not.

We can classify any type of text such as message board discussions, new stories .etc.

 

We are going to use individual words as features and look for the presence or absence of each word.

 

How many features we should maintain and how to get those features ?

 

We need data samples to train algorithm. How many samples do we need?

In order to generate good probability distributions, we need enough data samples,  assume that we need N samples for each feature and we have M features, so we need NM samples in total, thus the number will get large very quickly.

 

Naïve Bayes:

To reducing the size of the dataset, we make two naïve assumptions about the training samples.

First, assume independence among features, then our NM points get reduced to 1000*N,

Second, assume that every feature is equally important.

And this is the assumptions given by the Naïve Bayes.

 

Implementation:

1.Get features from text

tokens or words :

A token is any combination of characters, not only words, but also strings of other form such as URLs and IP address, when splitting up the text, we want information of this form appears in the form of a whole but not separate meaningless words and symbols.

How to represent features:

Reduce every piece of text to a vector of tokens where 1 represents the token existing in the document and 0 represents that it isn’t present. That is we can transform lists of text into vector of numbers.

 

How to do it?

First.  Find features —— vocabulary list

We consider all the words in all of our documents and decide what we’ll use for a vocabulary or set of words we’ll consider.

Problems during parsing the text file:

(1) get rid of punctuation

(2) get rid of empty strings

(3) converting strings to all lowercase.

(4) filter words

Annotations about functions in Python:

  1. regular expression r’\W*’

document url: https://docs.python.org/3/library/re.html

     2.[lambda expression]

     3.A very simple way to distill unique elements from a list

        Just give a list of items to the set constructor

        A very simple way to distill unique elements from some lists

        Just construct a set for each list and use the | operator to union these sets

 

second .Transform text into vector

We need to transform each individual document into a vector from our vocabulary. We take the vocabulary list and a document then outputs a vector of 1s and 0s to represent whether a word from our vocabulary is present or not.

Thus, we’ll create a vector the same length as the vocabulary list for a text. So every vector is of the same length.

Annotations about functions in Python:

Difference between append() and extend()

 

2.Train the algorithm:

calculating probabilities from word vectors. Recall that our goal is to classify for a new example and according to Naïve Bayes rule, we can just calculate:  for each class Ci and choose the one that has the biggest probability as the label for our new example. So our goal for training dataset is to calculate the formula on the right side of the equation. How can we get p(w|ci)? This is where our naïve assumption comes in. If we expand w into individual features, we could rewrite this as p(w0,w1,w2..wN|ci). Our assumption that all the words were independently likely, and something called conditional independence, says we can calculate this probability as p(w0|ci)p(w1|ci)p(w2|ci)...p(wN|ci).

 

Notice that trainingDataSet of this function is a list of some vectors of the same length —— the length of vocabulary list.

For each class, we loop all the documents belongs to this class and maintain a vocabulary vector to record the frequency of occurrence of each word, once a word occur in a document, we add 1 for the frequency of the occurrence of this word.

For every word in a document, it must contribute 1 or 0 for a specific word in the vocabulary list.

We should maintain one vocabulary list for documents in each unique class but not only one vocabulary list for all the documents.

 

Annotations about functions in Python:

An array in the Numpy can be divided by a float number, but this can’t be done with regular Python lists. An array can also be divided by another array or lists, and the division is the element-wise division.

 

 

3.Test the algorithm:

Two problems and solutions:

In order to lessen the impact of 0 factor for each class, we initialize occurrence counts to 1 for each word in the vocabulary.

In order to solve the under flow and round-off problem because of multiple multiplications, we turn to the natural logarithm and change the multiplication to sum of the logarithms.

 

Main steps:

# transform all the examples from text to vectors of words.

# using vectors formed from the first step to get the vocabulary list(features)

# according to the vocabulary list, represent all the examples with vectors

# hold-out cross validation —— randomly select some vectors as the training dataset and the remainder as the testing dataset

# using the training dataset to train the algorithm (calculate probability)

# using the testing dataset to test the accuracy of the algorithm(calculate error rate)

 

 

 

Annotations about functions in Python:

1. Difference between extend() and append()

2. Difference between random.sample(population,k ) and numpy.random.sample()

random.sample(population, k) will get k unique values from the population.

In addition: to get a good estimation of our classifier’s true error, we should do this testing multiple times and take average error rate.

猜你喜欢

转载自blog.csdn.net/qq_39464562/article/details/81068161
今日推荐