Naive Bayes Algorithm (1)

1.1 Brief introduction to Naive Bayes

Naive Bayes (NB) is one of the most widely used classification algorithms. It can be used for both binary classification and multi-classification. The spam classification back then was based on Naive Bayes classifier identification.

Naive Bayes hypothesis (also called conditional independence hypothesis): that is, under the condition of given y, the attributes are independent of each other.

Hypothesis purpose: simplify calculations.

Although this simplification reduces the classification effect of the Bayesian classification algorithm to a certain extent, in actual application scenarios, it greatly simplifies the complexity of the Bayesian method.

1.2 Review of basic probability


1. Definition of probability

  • Probability is defined as the likelihood of something happening
    • Throw a coin and the head will turn up
  • P(X): The value is [0, 1]

2. Case: Determine how much the goddess likes you

Before talking about these two probabilities, let's use an example to calculate some results:

计算概率

Questions are as follows:

  1. What is the probability that the goddess will like it?
  2. What is the probability of being a programmer by profession and having a well-proportioned body shape?
  3. Under the conditions that the goddess likes, what is the probability of being a programmer?
  4. Under the conditions that the goddess likes, what is the probability of being a programmer and being overweight?

The calculation result is:

P(喜欢) = 4/7
P(程序员, 匀称) = 1/7(联合概率)
P(程序员|喜欢) = 2/4 = 1/2(条件概率)
P(程序员, 超重|喜欢) = 1/4

3. Joint probability, conditional probability and mutual independence

  • Joint probability: contains multiple conditions, and the probability that all conditions are true at the same time
    • Written as: P(A,B)
  • Conditional probability: It is the probability of event A occurring under the condition that another event B has occurred.
    • Written as: P(A|B)
  • Mutually independent: If P(A, B) = P(A)P(B), then event A and event B are said to be independent of each other.

4.Bayes’ formula

4.1 Introduction to formulas

img

It will be much clearer if we change the expression form, as follows:

img

The p (category|feature) we finally seek is enough! It is equivalent to completing our mission.

According to the Naive Bayes hypothesis: p (feature/category) = p (feature 1/category) * p (feature 2/category)… *p (feature n/category)

4.2 Example analysis

Below I will give an example question first.

The given data is as follows:

img

The question for us now is, if a boy and girl are friends, and the boy wants to propose to the girl, and the boy's four characteristics are that he is not handsome, has a bad personality, is short, and is not motivated, can you please judge whether the girl should marry or not?

This is a typical classification problem. When converted into a mathematical problem, it is to compare p (marry | (not handsome, bad personality, short height, unmotivated)) and p (not marry | (not handsome, have a bad personality, short height, The probability of not being motivated)) , whoever has a higher probability, I can give the answer to marry or not to marry!

Here we are connected to the Naive Bayes formula:

img

We need to ask for p(married|(not handsome, bad personality, short, not motivated). This is something we don’t know, but it can be transformed into three easy-to-find quantities through the Naive Bayes formula, p(not handsome , bad personality, short height, unmotivated | marry), p (not handsome, bad personality, short height, unmotivated), p (marry) (as for why I can ask for it, I will talk about it later, then it is great, Convert the quantity to be sought into other values ​​that can be sought, which is equivalent to solving our problem!)

The probability of marrying is:

p (marry/not handsome, bad personality, short, not motivated) = p (not handsome, bad personality, short, not motivated/marry) ∗ p (marry) p (not handsome, bad personality, height Short, unmotivated) p(marry/not handsome, bad personality, short, unmotivated)=\frac {p(not handsome, bad personality, short height, unmotivated/marry)*p(marry)}{ p(not handsome, bad personality, short, not motivated)}p ( married / not handsome , bad personality , short , not motivated ) _=p ( not handsome , bad personality , short , not motivated ) _p ( not handsome , bad personality , short , not motivated / marry ) p ( marry ) _

= p (not handsome/marry) p (bad personality/marry) p (short/marry) p (not motivated/marry) ∗ p (marry) p (not handsome, bad character, short, unmotivated) \frac {p(not handsome/marry)p(bad personality/marry)p(short height/marry)p(not motivated/marry)*p(marry)}{p(not handsome, bad character, short height ,Not motivated)}p ( not handsome , bad personality , short , not motivated ) _p ( not handsome / marry ) p ( bad personality / marry ) p ( short / marry ) p ( not motivated / marry ) p ( marry ) _

The probability of not marrying is:

p (not married/not handsome, bad character, short in height, not motivated) = p (not handsome, not good in character, short in height, not motivated/not married) ∗ p (not married) p (not handsome, not in character Good, short, not motivated) p(not married/not handsome, bad personality, short, not motivated)=\frac {p(not handsome, bad personality, short, not motivated/not married)*p (not married)}{p(not handsome, bad personality, short, not motivated)}p ( not married / not handsome , bad personality , short , not motivated ) _=p ( not handsome , bad personality , short , not motivated ) _p ( not handsome , bad personality , short , not motivated / not married ) p ( not married ) _

= p (not handsome/not married) p (bad personality/not married) p (short/not married) p (not motivated/not married) ∗ p (not married) p (not handsome, bad personality, height Short, not motivated) \frac {p(not handsome/not married)p(bad personality/not married)p(short/not married)p(not motivated/not married)*p(not married)}{p (Not handsome, bad personality, short, not motivated)}p ( not handsome , bad personality , short , not motivated ) _p ( not handsome / not married ) p ( bad personality / not married ) p ( short / not married ) p ( not motivated / not married ) p ( not married ) _

We found that the denominators are the same: p (marry) > p (not marry), we only need the numerator to be greater than

Explanation of the term naive for the naive Bayes algorithm

So how are these three quantities obtained?

It is obtained statistically based on known training data. The solution process of this example is given in detail below.

Recall that our required formula is as follows:

img

Then I only ask for p (not handsome, bad personality, short, not motivated | marry) and p (marry). Okay, let me find out these probabilities separately. After the final comparison, I will get the final result. result.

p (not handsome, bad personality, short, not motivated | marry) = p (not handsome | marry)*p (bad personality | marry)*p (short height | marry)*p (not motivated | marry) , then I have to count the following probabilities separately, and get the probability on the left!

Let’s organize the above formula as follows:

p (marry/not handsome, bad personality, short, not motivated) = p (not handsome, bad personality, short, not motivated/marry) ∗ p (marry) p (not handsome, bad personality, height Short, unmotivated) p(marry/not handsome, bad personality, short, unmotivated)=\frac {p(not handsome, bad personality, short height, unmotivated/marry)*p(marry)}{ p(not handsome, bad personality, short, not motivated)}p ( married / not handsome , bad personality , short , not motivated ) _=p ( not handsome , bad personality , short , not motivated ) _p ( not handsome , bad personality , short , not motivated / marry ) p ( marry ) _

= p (not handsome/marry) p (bad personality/marry) p (short/marry) p (not motivated/marry) ∗ p (marry) p (not handsome, bad character, short, unmotivated) \frac {p(not handsome/marry)p(bad personality/marry)p(short height/marry)p(not motivated/marry)*p(marry)}{p(not handsome, bad character, short height ,Not motivated)}p ( not handsome , bad personality , short , not motivated ) _p ( not handsome / marry ) p ( bad personality / marry ) p ( short / marry ) p ( not motivated / marry ) p ( marry ) _

Below I will perform statistical calculations one by one ( when the amount of data is large, according to the central limit theorem, frequency is equal to probability. This is just an example, so I will just perform statistics ).

p(marry)=?

First, we organize the training data and the number of married samples is as follows:

img

Then p(marriage) = 6/12 (total number of samples) = 1/2

p(not handsome | marry)=? The statistics satisfy the sample number as follows:

img

Then p(not handsome | marry) = 3/6 = 1/2

p (bad character | marry) = ? The statistics satisfy the sample number as follows:

img

**Then p(bad character|marry)= 1/6
**

p(short | married) = ?The statistics satisfies the number of samples as follows:

img

Then p(short | marry) = 1/6

p (not motivated | married) = ?The statistics satisfy the number of samples as follows:

img

Then p(not making progress|marrying) = 1/6

At this point, all the items required for p (not handsome, bad personality, short, not motivated | married) have been found. I can just bring them in below.

p (marry/not handsome, bad personality, short, not motivated) = p (not handsome, bad personality, short, not motivated/marry) ∗ p (marry) p (not handsome, bad personality, height Short, unmotivated) p(marry/not handsome, bad personality, short, unmotivated)=\frac {p(not handsome, bad personality, short height, unmotivated/marry)*p(marry)}{ p(not handsome, bad personality, short, not motivated)}p ( married / not handsome , bad personality , short , not motivated ) _=p ( not handsome , bad personality , short , not motivated ) _p ( not handsome , bad personality , short , not motivated / marry ) p ( marry ) _

= p (not handsome/marry) p (bad personality/marry) p (short/marry) p (not motivated/marry) ∗ p (marry) p (not handsome, bad character, short, unmotivated) \frac {p(not handsome/marry)p(bad personality/marry)p(short height/marry)p(not motivated/marry)*p(marry)}{p(not handsome, bad character, short height ,Not motivated)}p ( not handsome , bad personality , short , not motivated ) _p ( not handsome / marry ) p ( bad personality / marry ) p ( short / marry ) p ( not motivated / marry ) p ( marry ) _

∝p (not handsome, bad personality, short, not motivated | marry) *p (marry)

=p(不帅|嫁)* p(性格不好|嫁)*p(身高矮|嫁)*p(不上进|嫁)*p(嫁)
= 1/2*1/6*1/6*1/6*1/2 = 1/864

Below we use the same method to find p (not married | not handsome, bad personality, short height, not motivated). It is exactly the same method. In order to facilitate understanding, I will also go through it here to help understand. First the formula is as follows:

p (not married/not handsome, bad character, short in height, not motivated) = p (not handsome, not good in character, short in height, not motivated/not married) ∗ p (not married) p (not handsome, not in character Good, short, not motivated) p(not married/not handsome, bad personality, short, not motivated)=\frac {p(not handsome, bad personality, short, not motivated/not married)*p (not married)}{p(not handsome, bad personality, short, not motivated)}p ( not married / not handsome , bad personality , short , not motivated ) _=p ( not handsome , bad personality , short , not motivated ) _p ( not handsome , bad personality , short , not motivated / not married ) p ( not married ) _

= p (not handsome/not married) p (bad personality/not married) p (short/not married) p (not motivated/not married) ∗ p (not married) p (not handsome, bad personality, height Short, not motivated) \frac {p(not handsome/not married)p(bad personality/not married)p(short/not married)p(not motivated/not married)*p(not married)}{p (Not handsome, bad personality, short, not motivated)}p ( not handsome , bad personality , short , not motivated ) _p ( not handsome / not married ) p ( bad personality / not married ) p ( short / not married ) p ( not motivated / not married ) p ( not married ) _

Below I will also perform statistical calculations one by one.

p (not marry) =? The calculation is as follows based on statistics ( red indicates that conditions are met ):

img

Then p(not marry)=6/12 = 1/2

p(not handsome|not married) = ? The statistics of samples that meet the conditions are as follows ( red indicates that the conditions are met ):

img

Then p (not handsome | not married) = 1/6

p (bad character | not marry) = ? The statistical calculation is as follows ( red indicates that conditions are met ):

img

Then p (bad character | not married) =3/6 = 1/2

p (short | not married) = ? The statistical calculation is as follows (red indicates that conditions are met):

img

Then p (short | not married) = 6/6 = 1

p (not motivated | not married) = ? The statistical calculation is as follows (red indicates that conditions are met):

img

Then p (not motivated | not married) = 3/6 = 1/2

Then according to the formula:

p (not married/not handsome, bad character, short in height, not motivated) = p (not handsome, not good in character, short in height, not motivated/not married) ∗ p (not married) p (not handsome, not in character Good, short, not motivated) p(not married/not handsome, bad personality, short, not motivated)=\frac {p(not handsome, bad personality, short, not motivated/not married)*p (not married)}{p(not handsome, bad personality, short, not motivated)}p ( not married / not handsome , bad personality , short , not motivated ) _=p ( not handsome , bad personality , short , not motivated ) _p ( not handsome , bad personality , short , not motivated / not married ) p ( not married ) _

= p (not handsome/not married) p (bad personality/not married) p (short/not married) p (not motivated/not married) ∗ p (not married) p (not handsome, bad personality, height Short, not motivated) \frac {p(not handsome/not married)p(bad personality/not married)p(short/not married)p(not motivated/not married)*p(not married)}{p (Not handsome, bad personality, short, not motivated)}p ( not handsome , bad personality , short , not motivated ) _p ( not handsome / not married ) p ( bad personality / not married ) p ( short / not married ) p ( not motivated / not married ) p ( not married ) _

∝p (not handsome, bad personality, short, not motivated | not married) *p (not married)

=p(不帅|不嫁)* p(性格不好|不嫁)*p(身高矮|不嫁)*p(不上进|不嫁)*p(不嫁)
= 1/6*1/2*1*1/2*1/2 = 1/48

Obviously 1/48> 1/864

So there is p (not to marry | not handsome, bad personality, short height, not motivated)>p (marry | not handsome, bad personality, short height, not motivated)

So we can give this girl the answer based on the Naive Bayes algorithm, which is not to marry! ! ! !

4.3 Article classification calculation

Requirement: Through the first four training samples (articles), determine whether the fifth article belongs to the China category

分类计算例å

  • Calculation results
P(C|Chinese, Chinese, Chinese, Tokyo, Japan) -->
P(Chinese, Chinese, Chinese, Tokyo, Japan|C) * P(C) / P(Chinese, Chinese, Chinese, Tokyo, Japan) 
∝
P(Chinese|C)^3 * P(Tokyo|C) * P(Japan|C) * P(C) 

# 分母值都相同不用计算:

# 首先计算是China类的概率: 
P(Chinese|C=china) = 5/8
P(Tokyo|C) = 0/8
P(Japan|C) = 0/8

# 接着计算不是China类的概率:
P(Chinese|C) = 1/3
P(Tokyo|C) = 1/3
P(Japan|C) = 1/3

img

In this example, m is 6, which means there are a total of 6 different words in the training set. Use these 6 words as the vocabulary. If there are words in the test set that are not in the vocabulary, they will not be calculated. (If a word appears in the test set but is not in the vocabulary, we ignore the word)

# 这个文章是需要计算是不是China类:

首先计算是China类的概率:  
    P(Chinese|C) = 5/8 --> 6/14
    P(Tokyo|C) = 0/8 --> 1/14
    P(Japan|C) = 0/8 --> 1/14

接着计算不是China类的概率: 
    P(Chinese|C) = 1/3 -->(经过拉普拉斯平滑系数处理) 2/9
    P(Tokyo|C) = 1/3 --> 2/9
    P(Japan|C) = 1/3 --> 2/9

1.2 Case: Sentiment Analysis of Product Reviews

learning target

  • Implementing sentiment analysis of product reviews using Naive Bayes API

1.api introduction

    • sklearn.naive_bayes.MultinomialNB(alpha = 1.0)
      • Naive Bayes Classification
      • alpha: Laplacian smoothing coefficient

2. Sentiment analysis of product reviews

image-20200106233750808

2.1 Step Analysis

  • 1) Get data
  • 2) Basic data processing
    • 2.1) Take out the content column and analyze the data
    • 2.2) Judgment criteria
    • 2.3) Select stop words
    • 2.4) Process the content and convert it into a standard format
    • 2.5) Count the number of words
    • 2.6) Prepare training set and test set
  • 3) Model training
  • 4) Model evaluation

2.2 Code implementation

import pandas as pd
import numpy as np
import jieba
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
  • 1) Get data
# 加载数据
data = pd.read_csv("./data/书籍评价.csv", encoding="gbk",index_col=0)
data
  • 2) Basic data processing
# 2.1) 取出内容列,对数据进行分析
content = data["内容"]
content.head()

# 2.2) 判定评判标准 -- 1好评;0差评
data.loc[data.loc[:, '评价'] == "好评", "评论标号"] = 1  # 把好评修改为1
data.loc[data.loc[:, '评价'] == '差评', '评论标号'] = 0

# data.head()


# 2.3) 选择停用词
# 加载停用词
stopwords=[]
with open('./data/stopwords.txt','r',encoding='utf-8') as f:
    lines=f.readlines()
    print(lines)
    for tmp in lines:
        line=tmp.strip()
        print(line)
        stopwords.append(line)
# stopwords  # 查看新产生列表

#对停用词表进行去重
stopwords=list(set(stopwords))#去重  列表形式
print(stopwords)

# 2.4) 把“内容”处理,转化成标准格式
comment_list = []
for tmp in content:
    print(tmp)
    # 对文本数据进行切割
    # cut_all 参数默认为 False,所有使用 cut 方法时默认为精确模式
    seg_list = jieba.lcut(tmp)
    print(seg_list)  
    seg_str = ' '.join(i for i in seg_list if i !=' ')  # 拼接字符串
    print(seg_str)
    comment_list.append(seg_str)  # 目的是转化成列表形式
# print(comment_list)  # 查看comment_list列表。

# 2.5) 统计词的个数
# 进行统计词个数
# 实例化对象
# CountVectorizer 类会将文本中的词语转换为词频矩阵
con = CountVectorizer(stop_words=stopwords)#只统计两个词及以上的频率
# 进行词数统计
X = con.fit_transform(comment_list)  # 它通过 fit_transform 函数计算各个词语出现的次数
name = con.get_feature_names()  # 通过 get_feature_names()可获取词袋中所有文本的关键字
print(X.toarray())  # 通过 toarray()可看到词频矩阵的结果
print(name)

# 2.6)准备训练集和测试集
# 准备训练集   这里将文本前10行当做训练集  后3行当做测试集
labels = data['评论标号'].values
x_train = X.toarray()[:10, :]
y_train = labels[:10]
# 准备测试集
x_text = X.toarray()[10:, :]
y_text = labels[10:]
  • 3) Model training
# 构建贝叶斯算法分类器
mb = MultinomialNB(alpha = 1.0)  # alpha 为可选项,默认 1.0,添加拉普拉修/Lidstone 平滑参数
# 训练数据
mb.fit(x_train, y_train)
# 预测数据
y_predict = mb.predict(x_text)
#预测值与真实值展示
print('预测值:',y_predict)
print('真实值:',y_text)
  • 4) Model evaluation
mb.score(x_text, y_text)

Test set
labels = data['Comment Label'].values
​​x_train = X.toarray()[:10, :]
y_train = labels[:10]

Prepare test set

x_text = X.toarray()[10:, :]
y_text = labels[10:]


- 3)模型训练

```python
# 构建贝叶斯算法分类器
mb = MultinomialNB(alpha = 1.0)  # alpha 为可选项,默认 1.0,添加拉普拉修/Lidstone 平滑参数
# 训练数据
mb.fit(x_train, y_train)
# 预测数据
y_predict = mb.predict(x_text)
#预测值与真实值展示
print('预测值:',y_predict)
print('真实值:',y_text)
  • 4) Model evaluation
mb.score(x_text, y_text)

Guess you like

Origin blog.csdn.net/weixin_52733693/article/details/127932288