Machine Learning Practical Tutorial (7): Naive Bayes

An introduction

The Naive Bayes algorithm is a supervised learning algorithm that solves classification problems, such as whether customers are lost, whether they are worthy of investment, credit rating assessment and other multi-classification problems. The advantages of this algorithm are that it is simple and easy to understand, has high learning efficiency, and can be comparable to decision trees and neural networks in classification problems in certain fields. However, since this algorithm is based on the assumption of independence between independent variables (conditional feature independence) and the normality of continuous variables, the accuracy of the algorithm will be affected to some extent.

2. Naive Bayes Theory

Divide the sample space into several situations that are easy to study.

  • The total probability formula (from cause to result) examines the probability of event B occurring in each case and calculates the probability of B.
  • Bayes' formula (from result to cause) examines the conditional probability of each situation under the conditions that event B occurs.

Conditional Probability

Formula Derivation

We need to understand what conditional probability is (Conditional probability), which refers to the probability of event A occurring when event B occurs, represented by P(A|B).
Insert image description here
According to the Venn diagram, it can be clearly seen that when event B occurs, the probability of event A occurring is P(A∩B) divided by P(B).
Insert image description here
Therefore,
Insert image description here
in the same way, according to the conditional probability, we know the probability of B when A occurs.
Insert image description here
So
Insert image description here
,
Insert image description here
that is,
Insert image description here
this is the calculation formula of conditional probability.

Calculation case

Example: Throw a dice of uniform texture.

  1. What is the probability that the number of upward points is an even number?
  2. What is the probability that the number of upward points is an even number and greater than 4"?
    Insert image description here
    Example 2. The probability of a certain animal living to be 20 years old after birth is 0.7, and the probability of living to 25 years old is 0.56. Find out how long this animal, which is now 20 years old, will live to Probability of age 25.
    Insert image description here

total probability formula

First of all, we must understand what is "division of sample space" [also known as "complete event group".
We will satisfy (assuming that the sample space Ω is the sum of two events A and A')

  1. A1, A2,…,An is a set of mutually exclusive events
  2. A1 U A2 U,…,An=Ω

Such a set of events is called a "complete event group". In short, events are mutually exclusive, and the union of all events is the entire sample space (inevitable events).
Insert image description here
The probability of B occurring in the entire Ω is:
Insert image description here

Formula Derivation

Assume that the sample space S is the sum of two events A and A'.
Insert image description here
In the figure above, the red part is event A and the green part is event A', which together form the sample space S.
In this case, event B can be divided into two parts.
Insert image description here
That is,
Insert image description here
in the derivation of the previous section, we know
Insert image description here
that So,
Insert image description here
it is the total probability formula. What it means is that if A and A' form a division of the sample space, then the probability of event B is equal to the probability of A and A' multiplied by the sum of the conditional probabilities of B for the two events.

Calculation case

Example 1: There is a batch of products of the same model. It is known that 30% of the products are produced by the first factory, 50% are produced by the second factory, and 20% are produced by the third factory. It is also known that the defective rates of the products of these three factories are respectively 2%, 1%, 1%, what is the probability that any one of these products is defective?
Insert image description here
Insert image description here
Example 2: There is an electronic equipment manufacturer that uses components provided by three component manufacturers. According to past records, there is the following data:
Insert image description here
It is assumed that the products of these three factories are evenly mixed in the warehouse and have no distinguishing marks.

  1. Pick a component at random from the warehouse and find the probability that it is a defective product;
  2. Pick a component at random from the warehouse. If it is known that the component is a defective product, analyze which factory the product comes from and what are the probabilities that the product is produced by three factories. Find these probabilities.
    Insert image description here

Bayesian

Bayesian decision making

Suppose now we have a data set, which consists of two types of data. The data distribution is as shown in the figure below:
Insert image description here
We now use p1(x,y) to represent that the data point (x,y) belongs to category 1 (indicated by the red dot in the figure) category), use p2(x,y) to represent the probability that the data point (x,y) belongs to category 2 (the category represented by the blue triangle in the figure), then for a new data point (x,y), you can use The following rules determine its category:

  1. If p1(x,y)>p2(x,y), then the category is 1
  2. If p1(x,y)<p2(x,y), then the category is 2.
    That is to say, we will choose the category corresponding to the high probability. This is the core idea of ​​Bayesian decision theory, which is to choose the decision with the highest probability. We have already understood the core idea of ​​Bayesian decision theory. The implementation of Bayesian is how to calculate the probabilities of p1 and p2.

Bayesian derivation

By deforming the conditional probability formula, you can get the following form: Another deformation of conditional probability is obtained
Insert image description here
through the total probability formula. In conditional probability, we call P(A) "prior probability" (Prior probability), that is, when event B occurs Before, we made a judgment on the probability of event A. P(A|B) is called "posterior probability", which is our re-evaluation of the probability of event A after event B occurs. P(B|A)/P(B) is called the "Likelyhood", which is an adjustment factor that makes the estimated probability closer to the true probability. Therefore, conditional probability can be understood as the following formula:
Insert image description here

Insert image description here

Insert image description here



后验概率 = 先验概率 x 调整因子

This is what Bayesian inference means. We first estimate a "prior probability", and then add the experimental results to see whether the experiment enhances or weakens the "prior probability", thereby obtaining a "posterior probability" that is closer to the truth.

Here, if the "possibility function" P(B|A)/P(B)>1, it means that the "prior probability" is enhanced and the likelihood of event A becoming greater; if the "possibility function" = 1 means that event B does not help to judge the possibility of event A; if "possibility function" <1, it means that the "prior probability" is weakened and the possibility of event A becomes smaller.

Calculation case

To deepen our understanding of Bayesian inference, let's take an example.
Insert image description here
There are two identical bowls. Bowl one has 30 fruit candies and 10 chocolate candies, and bowl two has 20 fruit candies and 20 chocolate candies each. Now randomly select a bowl, take out a candy, and find that it is fruit candy. What is the probability that this fruit candy came from bowl No. 1?

We assume that H1 represents bowl No. 1 and H2 represents bowl No. 2. Since the two bowls are the same, P(H1)=P(H2), that is, before taking out the fruit candy, the two bowls have the same probability of being selected. Therefore, P(H1)=0.5, we call this probability "prior probability", that is, before doing the experiment, the probability of coming from bowl No. 1 is 0.5.

Assume further that E represents fruit sugar, so the question becomes, given E, what is the probability that it comes from bowl No. 1, that is, find P(H1|E). We call this probability "posterior probability", which is the correction to P(H1) after the event E occurs.

According to the conditional probability formula, it is
Insert image description here
known that P(H1) is equal to 0.5, and P(E|H1) is the probability of taking out fruit candy from bowl No. 1, which is equal to 30÷(30+10)=0.75. Then find P(E ) to get the answer. According to the total probability formula,
Insert image description here
so,
Insert image description here
plugging the numbers into the original equation, we get
Insert image description here
This shows that the probability of coming from bowl number one is 0.6. In other words, after removing fructose, the possibility of H1 event was enhanced.

At the same time, think about another question. When using this algorithm, if we don’t need to know the specific category probability, that is, P(H1|E)=0.6 above, we only need to know the category, that is, from bowl No. 1, we need to calculate P (E) Is this a complete probability? To know that we only need to compare the sizes of P(H1|E) and P(H2|E) and find the maximum probability. In this case, the denominators of both are the same, so we only need to compare the numerators. That is, compare the sizes of P(E|H1)P(H1) and P(E|H2)P(H2). Therefore, in order to reduce the amount of calculation, the full probability formula does not need to be used in actual programming.

Analysis of past data shows that when the machine is well adjusted, the product's pass rate is 90%, and when a certain fault occurs in the machine, the pass rate is 30%. When the machine is started every morning, the probability that the machine is well adjusted is 75%. What is the probability that the machine is well adjusted when it is known that the first product in the morning is a qualified product?

Insert image description here
The incidence rate of liver cancer among residents in a certain area is 0.0004, and the alpha-fetoprotein method is currently used for census. Medical research shows that there is a possibility of misdiagnosis in laboratory tests. It is known that 99% of the test results of people with liver cancer are positive (diseased), while 99.9% of the test results of people without liver cancer are negative (disease-free). If someone's test result is positive, ask him if he really has the disease. What is the probability of having liver cancer? .
Insert image description here

Naive Bayes

Naive Bayes is a simple but extremely powerful predictive modeling algorithm. It is called Naive Bayes because it assumes that each feature is independent.
If there are multiple features to predict a certain classification, because each feature is assumed to be independent, it can be decomposed into the result of the probability calculation of the classification under a single feature.
For example, if you receive a spam email,

  • is the probability of real estate,
  • is the probability of loan
  • is the probability of real estate and loans
    Insert image description here
    , simplified to:
    Insert image description here
    the naive Bayes model consists of two types of probabilities
    1. The probability of each category P(CJ)
    2. The conditional probability of each attribute P(AI|CJ)

Formula Derivation

According to the Bayesian formula (assuming feature X (multiple X1, X2...Xn), the corresponding classification result Y)
P(Y|X)=P(Y) * (P(X|Y) / P(X))
Because feature X is multi-dimensional P
(Y|X)=P(Y) * (P((X1 , ( X1 , _ _ _ The result of a certain classification is Y=male|female, and the X characteristics represent (height, weight). At this time, if the X characteristics of a person are given to determine whether they are male or female, it is actually a comparison of the specific height and weight given. Whichever is more likely to be male or female can be determined by the formula P(Y|(X1,X2…Xn))=P(Y) * (P(X1|Y) P(X2|Y) …P(Xn|Y) / P(X1) P(X2) …P(Xn)) In the case of male or female, P(Y) [the predicted probabilities of male and female are 0.5 and 0.5] and P(X1) P(X2) P(Xn) ) are the same. In fact, you only need to compare the probabilities of P(X1|Y) P(X2|Y) ...P(Xn|Y) . The one with the greater probability will be the classification result.









Calculation case

The discrete data is as follows:
Insert image description here
Calculation: Height is tall, weight is medium, shoe size is medium. Is this person male or female?
X1: represents height
X2: represents weight
X3: represents shoe size
Y: represents category Y1 represents male, Y2 represents female. Unknown representation Yj
P(Yj|X1,X2,X3) = P(Yj) * (P(X1|Yj) P(X2|Yj) …P(Xn|Yj) / P(X1) P(X2) …P (Xn))
Since the prior probability and denominator of both categories are the same, only the numerators are compared

P(X1X2X3|Yj) = P(X1|Yj)P(X2|Yj)…P(Xn|Yj)

Suppose the category is j=1 Y1 is male

P(X1|Y1) = 2/4     就是男生(1-4行)中身高是高(1-2行)的概率
P(X2|Y1) = 2/4
P(X3|Y1) = 1/4
P(X1|Y1)P(X2|Y1) P(X3|Y1) = 2/4*2/4*1/4 = 1/16

Suppose the category is j=2 Y1 is female

P(X1|Y2) = 0        就是女生(5-8行)中身高是高(没有)的概率
P(X2|Y2) = 2/4
P(X3|Y2) = 2/4
P(X1|Y2) P(X2|Y2)P(X3|Y2)) = 0*2/4*2/4 = 0
C1>C2

It can be deduced from this: the height is high, the weight is medium, and the shoe size is medium. This person is a male. In this
case, since the given height is determined data, the judgment is relatively simple. If it is continuous data

1. Discrete type: Some random variables have a limited number of possible different values ​​or an infinite number of possible values. It can also be said that the probability 1 is distributed among various possible values ​​according to a certain rule.
2. Continuous type: The values ​​of the random variable X cannot be listed one by one, but can only be any point within a certain interval of the number axis.

Insert image description here
Requirements: Height 180cm, weight 120cm, shoe size 41. Is this person male or female?
The formula is still the same as above, but height, weight, and shoe size are continuous variables, and discrete methods cannot be used to calculate probability. Assuming that height, weight, and shoe size are normally distributed, the mean and variance are calculated through samples, and the density function of the normal distribution is obtained. With the density function, the density function value of a point can be calculated. For example, if the average height of men is 179.5 and the standard deviation is 3.697, the probability of being taller than 180 is 0.1069

python implementation

#%%
import numpy as np
import pandas as pd
df = pd.read_excel('连续性.xlsx',sheet_name="Sheet1",index_col=0)
# 计算男女在每个特征维度的方差和均值
df2 = df.groupby("性别").agg([np.mean, np.var])
print(df2)

#%%

male_high_mean = df2.loc["男","身高"]["mean"]
male_high_var = df2.loc["男","身高"]["var"]

male_weight_mean = df2.loc["男","体重"]["mean"]
male_weight_var = df2.loc["男","体重"]["var"]

male_code_mean = df2.loc["男","鞋码"]["mean"]
male_code_var = df2.loc["男","鞋码"]["var"]
from scipy import stats

# pdf ——概率密度函数标准形式是,算出在男性中身高180的概率
male_high = stats.norm.pdf(180,male_high_mean,male_high_var)
# 算出在男性中体重120的概率
male_weight = stats.norm.pdf(120, male_weight_mean, male_weight_var)
# 算出在男性中鞋码41的概率
male_code = stats.norm.pdf(41, male_code_mean, male_code_var)
fz=(male_high*male_weight*male_code)
print(fz)

female_high_mean = df2.loc["女","身高"]["mean"]
female_high_var = df2.loc["女","身高"]["var"]

female_weight_mean = df2.loc["女","体重"]["mean"]
female_weight_var = df2.loc["女","体重"]["var"]

female_code_mean = df2.loc["女","鞋码"]["mean"]
female_code_var = df2.loc["女","鞋码"]["var"]


#计算在女性分类中的三种特征的概率
female_high = stats.norm.pdf(180, female_high_mean, female_high_var)
female_weight = stats.norm.pdf(120, female_weight_mean, female_weight_var)
female_code = stats.norm.pdf(41, female_code_mean, female_code_var)
ffz=female_high*female_weight*female_code
print(ffz)
if fz>ffz:
    print("男性")
else :
    print("女性")

Three Practice Speech Filters

TF-IDF feature vector

TF-IDF principle

F-IDF feature vector is a way to convert text data into a numerical representation, where each dimension represents a word and each sample (i.e., a text) is represented as a vector.

TF-IDF is a commonly used weighting technique for information retrieval and text mining.

TF-IDF (Term Frequency-Inverse Document Frequency) is an algorithm commonly used in information retrieval and text data analysis. It is used to measure the importance of a word to a document or one of the texts in a text collection.

TF (Term Frequency) refers to word frequency, which represents the number of times the word appears in a certain text. IDF (Inverse Document Frequency) refers to the inverse document frequency, which is used to measure the frequency of occurrence of the word in the entire text collection, that is, how many texts the word appears in. The more common a word is in a text collection, the lower its inverse document frequency is, indicating that the word is less important in distinguishing different texts.

Therefore, the TF-IDF value is obtained by multiplying the term frequency (TF) and the inverse document frequency (IDF). For a single text, the higher the TF-IDF value, the greater the importance of the word to the text and the more representative the topic represented by the text; and in the entire text collection, the higher the TF-IDF value, It means that the word can distinguish different texts well, and the more it can represent the theme represented by the text collection.

The TF-IDF algorithm has been widely used in information retrieval, text classification, keyword extraction and other fields.

Suppose we have the following two texts:

  1. The quick brown fox jumps over the lazy dog.
  2. The brown fox is quick and the blue dog is lazy.

First, these texts need to be preprocessed, including removing punctuation, stop words (the, is and), etc., and converting each text into a word list. For these two texts, the following word lists are possible:

['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
['brown', 'fox', 'quick', 'blue', 'dog', 'lazy']

Next, we need to calculate the number of times each word appears in each sample, that is, term frequency (TF, term frequency). This process can be implemented using the CountVectorizer class. Taking the first sample as an example, its word frequency vector is:

[1, 1, 1, 1, 1, 1]

That is, 'quick', 'brown', 'fox', 'jumps', 'lazy', and 'dog' all appear once in the first sample.
Next, the inverse document frequency (IDF) needs to be calculated to measure the importance of each word. The calculation formula of IDF is:
Insert image description here
where N represents the total number of documents, here is 2 texts, and df(t) represents the number of documents containing the word t. Taking the above two texts as an example to calculate, we can get the IDF value of each word:

[0.0, 0.0, 0.0, 0.6931471805599453, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.6931471805599453, 0.0, 0.0]

Among them, 'jumps' only appears once in the first sample and only once in all documents, so its IDF value is
log (2/1) = 0.6931. 'quick', 'brown', 'fox', 'lazy', and 'dog' appear in both documents, so their IDF values ​​are 0.

Finally, the TF vector of each text needs to be multiplied by the corresponding IDF vector to obtain the TF-IDF feature vector. Taking the first text as an example, its TF-IDF vector is:

[0.0, 0.0, 0.0, 0.6931471805599453, 0.0, 0.0]

Because 'jumps' appears once in the first text and only once in all documents, its TF-IDF value is
1*log(2/1) the TF-IDF values ​​of other words are all 0.

By analogy, the TF-IDF feature vectors of all texts can be obtained. It should be noted that the feature vector dimensions of each text are usually the same, so all texts need to be traversed when calculating TF-IDF.

The difference between TfidfVectorizer and CountVectorizer

TfidfVectorizer calculates the importance of words in the text, that is, the TF-IDF value.
CountVectorizer only counts the number of times a word appears in the text.

The following is an illustration of the difference between TfidfVectorizer and CountVectorizer using real data:

Suppose we have the following three documents:

  • Document A: The weather is clear, the temperature is moderate, and the sun is shining brightly.
  • Document B: The weather is cloudy, with moderate temperatures and occasional light rain.
  • Document C: The weather is cloudy, the temperature is low, and there is rain.

We can use TfidfVectorizer and CountVectorizer to perform feature vectorization on these three documents and obtain their word frequency matrices.

Specifically, using CountVectorizer you can get the following word frequency matrix:

vocabulary Document A Document B Document C
weather 1 1 1
temperature 1 1 1
suitable 1 1 0
sunny 1 0 0
partly cloudy 0 1 0
Occasionally light rain 0 1 0
cloudy day 0 0 1
On the low side 0 0 1
It rains 0 0 1

Using TfidfVectorizer, you can get the following word frequency matrix:

vocabulary Document A Document B Document C
weather 0.00 0.00 0.58
temperature 0.42 0.42 0.42
suitable 0.58 0.58 0.00
sunny 0.81 0.00 0.00
partly cloudy 0.00 0.81 0.00
Occasionally light rain 0.00 0.81 0.00
cloudy day 0.00 0.00 0.58
On the low side 0.00 0.00 0.81
It rains 0.00 0.00 0.58

As can be seen from the word frequency matrix above, CountVectorizer only considers the frequency of each word in the current document, while TfidfVectorizer considers both the frequency of a word in the current training text and the number of other training texts that contain this word. The reciprocal of , so TfidfVectorizer can better reflect the differences between different documents.

TfidfVectorizer and CountVectorizer instances

CountVectorizer counts word frequency

#%%

postingList=['my my dog has flea problems help please',                #切分的词条
            'maybe not take him to dog park stupid',
             'my dalmation is so cute I love him',
             'stop posting stupid worthless garbage',
             'mr licks ate my steak how to stop him',
             'quit buying worthless dog food stupid']
classVec = [0,1,0,1,0,1]                                                                   #类别标签向量,1代表侮辱性词汇,0代表不是

from sklearn.feature_extraction.text import CountVectorizer
# 初始化CountVectorizer并进行文本特征提取
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(postingList)

# 显示特征向量和对应的单词
print(X.toarray())
print(vectorizer.get_feature_names())

output

[[0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 2 0 0 1 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0]
 [0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0]
 [0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1]]
['ate', 'buying', 'cute', 'dalmation', 'dog', 'flea', 'food', 'garbage', 'has', 'help', 'him', 'how', 'is', 'licks', 'love', 'maybe', 'mr', 'my', 'not', 'park', 'please', 'posting', 'problems', 'quit', 'so', 'steak', 'stop', 'stupid', 'take', 'to', 'worthless']
12

Note that all words are deduplicated and used as feature columns. The data row is the current document, and the data in the row is the number of times this feature word appears.

TfidfVectorizer statistics tf-idf

#%%
from sklearn.feature_extraction.text import TfidfVectorizer

# 初始化TfidfVectorizer
tvectorizer = TfidfVectorizer(stop_words='english')

# 转换文本数据到词袋模型
X_train = tvectorizer.fit_transform(postingList)
# 显示特征向量和对应的单词
print(X_train.toarray())
print(vectorizer.get_feature_names())

output

[[0.         0.         0.         0.         0.37115593 0.53611046
  0.         0.         0.53611046 0.         0.         0.
  0.         0.         0.         0.53611046 0.         0.
  0.         0.         0.        ]
 [0.         0.         0.         0.         0.40249409 0.
  0.         0.         0.         0.         0.         0.58137639
  0.         0.58137639 0.         0.         0.         0.
  0.         0.40249409 0.        ]
 [0.         0.         0.57735027 0.57735027 0.         0.
  0.         0.         0.         0.         0.57735027 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.51136725 0.         0.         0.         0.
  0.         0.         0.51136725 0.         0.         0.
  0.41932846 0.3540259  0.41932846]
 [0.46262479 0.         0.         0.         0.         0.
  0.         0.         0.         0.46262479 0.         0.
  0.46262479 0.         0.         0.         0.         0.46262479
  0.37935895 0.         0.        ]
 [0.         0.46468841 0.         0.         0.32170956 0.
  0.46468841 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.46468841 0.
  0.         0.32170956 0.38105114]]
['ate', 'buying', 'cute', 'dalmation', 'dog', 'flea', 'food', 'garbage', 'has', 'help', 'him', 'how', 'is', 'licks', 'love', 'maybe', 'mr', 'my', 'not', 'park', 'please', 'posting', 'problems', 'quit', 'so', 'steak', 'stop', 'stupid', 'take', 'to', 'worthless']

Train a Naive Bayes Classifier

We first get the term vector

import numpy as np
postingList=['my dog has flea problems help please',                #切分的词条
            'maybe not take him to dog park stupid',
             'my dalmation is so cute I love him',
             'stop posting stupid worthless garbage',
             'mr licks ate my steak how to stop him',
             'quit buying worthless dog food stupid']
classVec = np.array([0,1,0,1,0,1])    

from sklearn.feature_extraction.text import CountVectorizer
# 初始化CountVectorizer并进行文本特征提取
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(postingList)

# 显示特征向量和对应的单词
v=np.array(X.toarray())
print(v)
fn=np.array(vectorizer.get_feature_names())
print(fn)

output


[[0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0]
 [0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0]
 [0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1]]
['ate' 'buying' 'cute' 'dalmation' 'dog' 'flea' 'food' 'garbage' 'has'
 'help' 'him' 'how' 'is' 'licks' 'love' 'maybe' 'mr' 'my' 'not' 'park'
 'please' 'posting' 'problems' 'quit' 'so' 'steak' 'stop' 'stupid' 'take'
 'to' 'worthless']

Next, we can train a naive Bayes classifier through the term vectors.

"""
  通过传入单词向量和分类结果训练数据集获取到每个特征在不同分类下的条件概率,以及对应分类的先验概率。
  利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率,
  即计算p(w0|1)p(w1|1)p(w2|1)。
  如果其中有一个概率值为0,那么最后相乘的结果也为0
  这样是不合理的,为了降低这种影响,可以将所有词的出现数初始化为1,并将分母初始化为2。这种做法就叫做拉普拉斯平滑(Laplace Smoothing)又被称为加1平滑,
  是比较常用的平滑方法,它就是为了解决0概率问题,具体参考拉普拉斯平滑目录。
"""
def trainData(vecList,classVec):
    #获取先验概率P(侮辱类),    P(非侮辱类)=1-P(侮辱类)
    PϹ侮辱类先验Ͻ=np.sum(classVec)/len(classVec)
    #找到所有classVec==0非侮辱类索引行并取得数据行。
    vec0=vecList[np.where(classVec==0)]
    #找到所有classVec==1侮辱类索引行并取得数据行。
    vec1=vecList[np.where(classVec==1)]
    #设置拉普拉斯平滑因子为1: ,分类的种类是2中所有分子+1,分母+2
    a=1
    #让分子都加上1
    vec0=np.add(vec0,a)
    vec1=np.add(vec1,a)
    #计算每个特征在对应分类下的条件概率,分母加上2
    PϹ特征l非侮辱类Ͻ=np.sum(vec0,axis=0)/(np.sum(vec0)+a*2)
    PϹ特征l侮辱类Ͻ=np.sum(vec1,axis=0)/(np.sum(vec1)+a*2)
    return PϹ特征l侮辱类Ͻ,PϹ特征l非侮辱类Ͻ,PϹ侮辱类先验Ͻ

PϹ特征l侮辱类Ͻ,PϹ特征l非侮辱类Ͻ,PϹ侮辱类先验Ͻ=(trainData(v,classVec))
print(PϹ特征l非侮辱类Ͻ,PϹ特征l侮辱类Ͻ,PϹ侮辱类先验Ͻ)

Output:

[0.03389831 0.02542373 0.03389831 0.03389831 0.03389831 0.03389831
 0.02542373 0.02542373 0.03389831 0.03389831 0.04237288 0.03389831
 0.03389831 0.03389831 0.03389831 0.02542373 0.03389831 0.05084746
 0.02542373 0.02542373 0.03389831 0.02542373 0.03389831 0.02542373
 0.03389831 0.03389831 0.03389831 0.02542373 0.02542373 0.03389831
 0.02542373]

 [0.02631579 0.03508772 0.02631579 0.02631579 0.04385965 0.02631579
 0.03508772 0.03508772 0.02631579 0.02631579 0.03508772 0.02631579
 0.02631579 0.02631579 0.02631579 0.03508772 0.02631579 0.02631579
 0.03508772 0.03508772 0.02631579 0.03508772 0.02631579 0.03508772
 0.02631579 0.02631579 0.03508772 0.05263158 0.03508772 0.03508772
 0.04385965] 
0.5

Reference for Laplacian smoothing concepts and examples: https://github.com/lzeqian/machinelearntry/tree/master/sklearn_bayes/%E6%8B%89%E6%99%AE%E6%8B%89%E6%96 %AF%E5%B9%B3%E6%BB%91

  • P (feature l non-insult category) is the probability of a feature under the non-insult category, which is the numerator of the conditional probability of P (non-insult category). The fifth feature is dog, which is the probability of P (dog | non-insult category) is 0.03389831.
  • P (feature l insult class) is the probability of a certain feature under the insult class, which is the numerator of the conditional probability of P (non-insult class). The fifth feature is dog, which is P (dog | insult class). The probability is 0.04385965 .
  • P (insult class prior) is the prior probability of the insult class.

Classify using training data

After obtaining P (feature l non-insult class), P (feature l insult class) and P (insult class prior), P (non-insult class prior) = 1-P (insult class prior), you can use it later These data and the incoming new vocabulary are used to determine the classification.

'''
  注意求条件概率是找到对应的单词下在对应的分类的乘积,比如
  you    are     a     dog       as               b
  0.001 0.0005  0.03   0.666     0.3             0.99
  
  传入的矩阵就是
  1       1      1      1        0               0
   
  实际条件侮辱类概率就是
  P(you|侮辱类)*P(are|侮辱类)*P(a|侮辱类)*P(dog|侮辱类)
  乘积小数位太多就可能导致小数位溢出,需要使用两个乘数的log来防止溢出
  log(P(you|侮辱类)*P(are|侮辱类)*P(a|侮辱类)*P(dog|侮辱类))=log(P(you|侮辱类))+log(P(are|侮辱类))+log(P(a|侮辱类))+log(P(dog|侮辱类))
  为了通过计算直接获取到对应的这些特征单词的和,可以先求出所有特征的log值和传入的矩阵乘积在求和就是上面的结果
'''
def classResult(wordVec,PϹ特征l侮辱类Ͻ,PϹ特征l非侮辱类Ͻ,PϹ侮辱类先验Ͻ):
    PϹ侮辱类Ͻ=np.sum(np.log(PϹ特征l侮辱类Ͻ)*wordVec)+np.log(PϹ侮辱类先验Ͻ)
    PϹ非侮辱类Ͻ=np.sum(np.log(PϹ特征l非侮辱类Ͻ)*wordVec)+np.log(1-PϹ侮辱类先验Ͻ)
    return 1 if PϹ侮辱类Ͻ>PϹ非侮辱类Ͻ else 0
#测试的词汇
text=["you are a dog"]
testX = vectorizer.transform(text)
testV=np.array(testX.toarray())
print(classResult(testV,PϹ特征l侮辱类Ͻ,PϹ特征l非侮辱类Ͻ,PϹ侮辱类先验Ͻ))

Pay attention to the multiplication of multiple decimals and use the log function to prevent decimal overflow. Theoretical reference: https://github.com/lzeqian/machinelearntry/blob/master/sklearn_bayes/%E4%B8%8B%E6%BA%A2%E5% 87%BA/%E4%B9%98%E7%A7%AF%E7%BB%93%E6%9E%9C%E5%8F%96%E8%87%AA%E7%84%B6%E5%AF %B9%E6%95%B0%E9%98%B2%E6%AD%A2%E4%B8%8B%E6%BA%A2%E5%87%BA.png

Four Naive Bayes Data Classification (sklearn)

The Naive Bayes classifier is a classifier used for classification problems with small feature dimensions and a large number of training samples. It assumes that all features are independent of each other under the condition that the category is known. When building a classifier, it is only necessary to estimate the distribution of training samples of each category on each dimensional feature one by one, and then the conditional probability density of each category can be obtained, which greatly reduces the number of parameters that need to be estimated. That is, the joint probability of observing features x1, x2,…,xn given the target feature value of the sample is equal to the product of the probabilities of each individual feature.

In scikit-learn, there are three Naive Bayes classification algorithm classes. They are GaussianNB, MultinomialNB and BernoulliNB.

  • GaussianNB is Naive Bayes whose prior is Gaussian distribution.
  • MultinomialNB is Naive Bayes with a priori multinomial distribution.
  • BernoulliNB is Naive Bayes with a priori Bernoulli distribution.

If the distribution of sample features is mostly continuous values, it would be better to use GaussianNB.
If most of the sample features are multivariate discrete values, it is more appropriate to use MultinomialNB.
If the sample features are binary discrete values ​​or very sparse multivariate discrete values, BernoulliNB should be used.
The prior probability model explained earlier is Naive Bayes where the prior probability is a polynomial distribution.

For news classification, it is a multi-classification problem. We can use MultinamialNB() to complete our news classification problem. The use of the other two functions will not be expanded for the time being and can be learned by yourself. MultinomialNB assumes that the prior probability of the feature is a polynomial distribution, which is the following formula:
Insert image description here
where, P(Xj = Xjl | Y = Ck) is the l-th value conditional probability of the j-th dimension feature of the k-th category. mk is the number of samples in the training set that output the kth class. λ is a constant greater than 0, and often takes the value 1, which is Laplace smoothing, and can also take other values.

Next, let’s take a look at the MultinamialNB function, which has only 3 parameters:

Insert image description here

Parameter description is as follows:

  • alpha: floating point optional parameter, the default is 1.0, which actually adds Laplacian smoothing, which is λ in the above formula. If this parameter is set to 0, no smoothing is added;
  • fit_prior: Boolean optional parameter, defaults to True. The Boolean parameter fit_prior indicates whether to consider the prior probability. If it is false, all sample category outputs have the same category prior probability. Otherwise, you can use the third parameter class_prior to enter the prior probability yourself, or do not enter the third parameter class_prior and let MultinomialNB calculate the prior probability from the training set samples. The prior probability at this time is P(Y=Ck)=mk /m. Where m is the total number of training set samples, and mk is the number of training set samples whose output is the kth category.
  • class_prior: Optional parameter, default is None.
    The summary is as follows:
    Insert image description here
    In addition, MultinamialNB also has some methods for us to use:
    Insert image description here

An important function of MultinomialNB is the partial_fit method. This method is generally used when the training set data is very large and cannot be loaded into the memory at one time. At this time, we can divide the training set into several equal parts and repeatedly call partial_fit to learn the training set step by step, which is very convenient. GaussianNB and BernoulliNB also have similar functions. After fitting the data using MultinomialNB's fit method or partial_fit method, we can make predictions. There are three methods for prediction at this time, including predict, predict_log_proba and predict_proba. The predict method is our most commonly used prediction method, which directly gives the predicted category output of the test set. predict_proba is different, it will give the predicted probability of each category of the test set sample. It is easy to understand that the category corresponding to the maximum value of the probability of each category predicted by predict_proba is the category obtained by the predict method. predict_log_proba is similar to predict_proba, it will give a logarithmic transformation of the predicted probability of each category of the test set sample. After conversion, the category corresponding to the maximum value of the logarithmic probability of each category predicted by predict_log_proba is the category obtained by the predict method. The specific details will not be explained, please refer to the official website manual.

Using Skearn to Classify Sina News

Data loading

Example source: https://cuijiahua.com/blog/2017/11/ml_5_bayes_2.html
The following are the categories of news classification

C000008	财经
C000010	IT
C000013	健康
C000014	体育
C000016	旅游
C000020	教育
C000022	招聘
C000023	文化
C000024	军事

Article data is located in multiple articles under each category directory
Insert image description here

Data set download: https://github.com/lzeqian/machinelearntry/tree/master/sklearn_bayes/%E6%96%B0%E9%97%BB%E5%88%86%E7%B1%BB%E6%95 %B0%E6%8D%AE
loads the data set (the words in the article need to be used as separate features and need to be segmented, jieba is used here)

Word segmentation sorting

The data set is ready, let's get straight to the point. Split Chinese statements and write the following code:

import os
import jieba
'''
  判断字符串是否为数字,清理包括:1,1.5,023,34%等特别的数字字符串
'''
def isNumber(num):
    if(num.isdigit() or num.isnumeric() or num.isdecimal()):
        return True
    if num.endswith('%'):
        num_str = num[:-1]  # 去掉百分号
        return isNumber(num_str)
    try:
        _ = float(num)
        return True
    except ValueError:
        return False
    return False
'''
  将某个字符串通过jieba分词后通过空格拼接,因为CountVectorizer统计词频传入的是带空格的字符串
'''
def wordToVec(word):
        word_cut = jieba.cut(word, cut_all = False) 
        filtered_words = filter(lambda w: w is not None and len(w.strip()) > 0 and not isNumber(w.strip()), list(word_cut))  # 过滤掉空字符串
        word_list=" ".join(filtered_words)
        return word_list
'''
 读取新闻分类数据/Sample目录下的所有数据
'''
def TextProcessing(folder_path):
    folder_list = os.listdir(folder_path)                        #查看folder_path下的文件
    data_list = []                                                #训练集
    class_list = []
 
    #遍历每个子文件夹
    for folder in folder_list:
        new_folder_path = os.path.join(folder_path, folder)        #根据子文件夹,生成新的路径
        files = os.listdir(new_folder_path)                        #存放子文件夹下的txt文件的列表
        j = 1
        #遍历每个txt文件
        for file in files:
            if j > 100:                                            #每类txt样本数最多100个
                break
            with open(os.path.join(new_folder_path, file), 'r', encoding = 'utf-8') as f:    #打开txt文件
                raw = f.read()
            word_list=wordToVec(raw)
            data_list.append(word_list)
            class_list.append(folder)
            j += 1
    print("词条行:",data_list)
    print("分类:",class_list)
    return data_list,class_list
    

Vectorize using CountVectorizer and print the top 50 occurrences of this

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
if __name__ == '__main__':
    #文本预处理
    folder_path = './新闻分类数据/Sample'                #训练集存放地址
    data_list1,class_list1=TextProcessing(folder_path)
    stop_words="";
    with open(os.path.join("./新闻分类数据", "stopwords_cn.txt"), 'r', encoding = 'utf-8') as f:    #打开txt文件
                stop_words = f.read()
    stop_words_array=stop_words.split("\n") 
    #除了停止词外,单个字母的都会被自动过滤掉
    vectorizer = CountVectorizer(stop_words=stop_words_array)
    X = vectorizer.fit_transform(data_list1)
    fn=np.array(vectorizer.get_feature_names())
    print("特征列:",fn)
    v=np.array(X.toarray())
    print("词条向量:\n",v)
    
    top=50
    wordcount=v.sum(axis=0)[0:top]
    print("获取单词出现次数:",wordcount)
    print("排序索引:",np.argsort(wordcount)[::-1])
    print("排序特征:",fn[np.argsort(wordcount)[::-1]])
    print("排序词频:",wordcount[np.argsort(wordcount)[::-1]])

Output:

特征列: ['04vs' '110min' '125min' ... '龙岗' '龙江' '龙珠']
词条向量:
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
获取单词出现次数: [ 1  1  1  1  1  1  2  1  1  1  1  1  3  2  1  2  6  6  5  1  1  1  1  1
  3  1  1  1  1  2  4  2  1  1  1  7  2  1  1  1  2  1  1  1  1  1  1  1
 10  5]
排序索引: [48 35 16 17 49 18 30 12 24 40  6 15 13 31 29 36  4  5  3  7 44  8  9 10
 11 47  2 14  1 46 45 20 19 39 38 37 41 34 33 32 42 28 27 26 25 43 23 22
 21  0]
排序特征: ['ceo' 'bbc' 'ak' 'an' 'cfo' 'and' 'ax' 'ac' 'armed' 'bittorrent' '3g'
 'ah' 'academic' 'a股' 'aw' 'bbn' '3d' '3dmax' '16i' '5140i' 'brings'
 '80mb' '95min' 'ab' 'abc' 'cbs' '125min' 'adj' '110min' 'career'
 'brothers' 'anti' 'answer' 'bennett' 'begins' 'be' 'bjeea' 'band' 'b09'
 'b06' 'bot' 'availwidth' 'availheight' 'assessment' 'army' 'bravo' 'area'
 'are' 'applications' '04vs']
排序词频: [10  7  6  6  5  5  4  3  3  2  2  2  2  2  2  2  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1]

Split the data into a training set and a test set (note that the data set must be scrambled first, because the current data sets are all read through classification, that is, sorted by distribution. Maybe 20% of the extracted data set will be a certain All the data under the classification have been removed, resulting in no training under this classification, resulting in low accuracy)

    from sklearn.utils import shuffle
    from sklearn.model_selection import train_test_split
    X, y = shuffle(v, class_list1, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Test the accuracy of the trainer

classifier = MultinomialNB().fit(X_train, y_train)
    test_accuracy = classifier.score(X_test, y_test)
    print(test_accuracy)

Output: 0.7222222222222222

Just enter a string and predict it

v1=vectorizer.transform([wordToVec("身体是革命的本钱")]).toarray();
print(classifier.predict(v1))

Output: ['C000020'] ie: Education

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/129282823