[Study Notes] [Machine Learning] 9. Naive Bayes (Basics of Probability, Joint Probability, Conditional Probability, Bayesian Formula, Sentiment Analysis)

  1. video link
  2. Dataset download address: no download required

Learning objectives:
4. Explain conditional probability and joint probability
5. Explain Bayesian formula and the relationship between feature independence
6. Memorize Bayesian formula
7. Know Laplace smoothing coefficient
8. Apply Bayesian formula to realize probability Calculation
9. Naive Bayesian will be used to conduct sentiment analysis on product reviews

1. Introduction to Naive Bayes Algorithm

The Naive Bayes algorithm is mainly used for classification.

Q : So what is the difference from the previously learned KNN algorithm?
A : Naive Bayes algorithm and KNN algorithm are both classification algorithms, but there are some differences between them:

  • Naive Bayesian algorithm is a classification method based on Bayesian theorem and independent assumptions of feature conditions
  • The KNN algorithm is an instance-based learning that classifies by measuring the distance between different feature values

insert image description here

insert image description here

According to the above two figures, we can know that Naive Bayes judges the category according to probability , which is different from the KNN and decision tree judgment categories we learned before.

The KNN algorithm classifies by measuring the distance between different feature values. Its idea is: if most of the k most similar samples in the feature space (that is, the nearest neighbors in the feature space) of a sample belong to a certain category, then the sample also belongs to this category.

The decision tree algorithm is to classify by constructing a decision tree. It starts from the root node, tests a certain feature of the instance, and assigns the instance to its child nodes according to the test results; each child node corresponds to a value of the feature. Each child node is then tested in this manner until either all instances are correctly classified or there are no more features to choose from.

2. Probability Basics Review

learning target:

  • Understand the concepts of joint probability, conditional probability, and correlated independence
  • Know the Bayes formula
  • Knowing the Laplace smoothing coefficient

2.1 Definition of probability

Probability is a numerical value used to measure the likelihood of an event occurring. It is usually represented by a real number between [0, 1], where 0 means that the event cannot happen, and 1 means that the event must happen. There are many ways to define probability, the most common of which are classical probability and frequency probability.

  • Classical probability means that in a finite sample space, each basic event is equally likely to occur, then the probability of an event occurring is equal to the number of basic events contained in this event divided by the total number of all basic events in the sample space .
  • Frequency probability means that in a large number of repeated experiments, when the frequency of an event tends to be stable, this stable value is the probability of this event occurring.

In short: Probability is the likelihood that something will happen .

For example: when a coin is thrown, the probability that the head faces up is P ( X ) P(X)P ( X ) , its value is in[ 0 , 1 ] [0, 1][0,1 ] between.

2.2 Case: Judging girls' liking for boys

Before talking about these two probabilities, let's use an example to calculate some results:

Number of samples Profession figure do girls like
1 programmer overweight dislike
2 product well-proportioned like
3 programmer well-proportioned like
4 programmer overweight like
5 artist well-proportioned dislike
6 artist overweight dislike
7 product well-proportioned like

Questions are as follows:

  1. The probability of being liked by a girl?
  2. What is the probability of being a programmer by profession and having a well-proportioned body shape?
  3. Under the condition that girls like it, what is the probability of being a programmer?
  4. Under the condition that girls like it, what is the probability of being a programmer and being overweight?

The result of the calculation is:

P (Like) = total number of people who like it = 4 7 P (programmer, well-proportioned) joint probability = total number of people who meet both conditions = 1 7 P (programmer∣ like) conditional probability = girl likes and occupation is The number of programmers the number of girls like = 2 4 = 1 2 P (programmer, overweight ∣ like) joint conditional probability = the number of girls who like and are programmers and overweight the number of girls like = 1 4 \begin{aligned } &P(Like) = \frac{Number of people who like it}{Total number of people} = \frac{4}{7} \\ &P(Programmer, well-proportioned)_{Joint probability} = \frac{Satisfying two conditions at the same time Number of people}{Total number of people} = \frac{1}{7} \\ &P(Programmer|Like)_{conditional probability} = \frac{Number of girls who like and are programmers}{Number of girls who like it} = \frac{2}{4} = \frac{1}{2}\\ &P(programmer,overweight|like)_{joint conditional probability} = \frac{the number of girls who like and are programmers and are overweight }{number of girls like} = \frac{1}{4} \end{aligned}P ( like )=total peoplelikes=74P ( Programmer ,shapely )joint probability=total peopleThe number of people=71P ( programmer∣like ) _ _conditional probability=The number of girls likeThe number of girls who like it and are programmers=42=21P ( Programmer ,Overweight∣Like ) _ _Joint Conditional Probability=The number of girls likeThe number of girls who like and are programmers and are overweight=41

  • Joint Probability : Indicates the probability of two events occurring at the same time.
  • Conditional Probability : The probability of an event occurring given the occurrence of another event.
  • Joint conditional probability : Indicates the probability that two events will occur at the same time under the condition that one event occurs.

P (like) P(like)P ( like ) is a simple probability


Thinking question : When Xiao Ming is a product manager and is overweight, how to calculate the probability that Xiao Ming is liked by a girl?

That is, P (like | product, overweight) = ? P(like | product, overweight) = ?P ( like∣product , _ _overweight )=?

In this example, we cannot calculate the probability that Xiao Ming is liked by a girl because there is no information about Xiao Ming in the table. We can only calculate probabilities based on the data given in the table, we cannot infer the probability of people who are not in the table.

And in the table given, all the products are well-proportioned and there is no overweight.

At this point we need to use Naive Bayesian to solve it. Before explaining the Bayesian formula, first review the concepts of joint probability, conditional probability and mutual independence.

2.3 Joint probability, conditional probability and mutual independence

  • Joint probability : Contains multiple conditions, and the probability that all conditions are true at the same time.
    • Write: P ( A , B ) P(A,B)P(A,B)
  • Conditional probability : It is the probability of event A occurring under the condition that another event B has occurred.
    • Write: P ( A ∣ B ) P(A|B)P(AB)
  • Mutually independent : If P ( A , B ) = P ( A ) P ( B ) P(A,B) = P(A)P(B)P(A,B)=P ( A ) P ( B ) , then event A and event B are said to be independent of each other.

2.4 Bayesian formula

2.4.1 Formula introduction

P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = P(B|A)\frac{P(A)}{P(B)} P(AB)=P(BA)P(B)P(A)


P ( C ∣ W ) = P ( W ∣ C ) P ( C ) P ( W ) P(C|W) = \frac{P(W|C)P(C)}{P(W)} P(CW)=P(W)P(WC)P(C)

in:

  • W W W is the feature value of the given document (frequency statistics, provided by the predicted document)
  • C C C is the document category

2.4.2 Case calculation

Then the thinking problem can be solved by applying the Bayesian formula:

P ( C ∣ W ) = P ( W ∣ C ) P ( C ) P ( W ) P ( like ∣ product, overweight ) = P ( product, overweight ∣ like ) P ( like ) P ( product, overweight ) \begin {aligned} & P(C|W) = \frac{P(W|C)P(C)}{P(W)} \\ & P(like|product, overweight) = \frac{P(product, overweight|like)P(like)}{P(product, overweight)} \end{aligned}P(CW)=P(W)P(WC)P(C)P ( like∣product , _ _overweight )=P ( product ,overweight )P ( product ,Overweight like ) P ( like )

In the above formula, P (product, overweight∣ like) P(product, overweight|like)P ( product ,overweight∣like ) andP(product,overweight) P(product, overweight )P ( product ,Overweight ) results are all 0, resulting in the inability to calculate the result. This is because our sample size was too small to be representative.

Originally, in real life, there must be people who are product managers and overweight. P ( product, overweight) P ( product, overweight )P ( product ,Overweight ) cannot be 0. Moreover, the event "occupation is a product manager" and the event "overweight" are generally considered to be independent events, but according to our limited 7 samples, P (product, overweight) = P (product) P (overweight) P(product , overweight) = P(product) P(overweight)P ( product ,overweight )=P ( product ) P ( overweight ) does not hold. Naive Bayes can help us solve this problem.

Naive Bayes, simply understood, is a Bayesian formula that assumes that features are independent of each other . That is to say, Naive Bayes is simple because it assumes that features are independent of each other . Therefore, if the thinking questions are solved according to the Naive Bayesian thinking, it can be:

P (product, overweight) = P (product) × P (overweight) = 2 7 ∗ 3 7 = 6 49 P (product, overweight ∣ like) = P (product ∣ like) × P (overweight ∣ like) = 1 2 × 1 4 = 1 8 P (like ∣ product, overweight) = P (product, overweight ∣ like) × P (like) P (product, overweight) = 1 8 × 4 7 × 6 49 = 7 12 \begin{aligned } & P(product, overweight) = P(product) \times P(overweight) = \frac{2}{7} * \frac{3}{7} = \frac{6}{49}\\ & P (Product, Overweight| Like) = P(Product| Like) \times P(Overweight| Like) = \frac{1}{2} \times \frac{1}{4} = \frac{1}{8} \\ & P(like|product, overweight) = P(product, overweight|like) \times \frac{P(like)}{P(product, overweight)} = \frac{1}{8} \times \ frac{4}{7} \times \frac{6}{49} = \frac{7}{12} \end{aligned}P ( product ,overweight )=P ( product )×P ( overweight )=7273=496P ( product ,Overweight∣Like ) _ _=P ( Product∣Like ) _ _×P ( overweight∣like ) _ _=21×41=81P ( like∣product , _ _overweight )=P ( product ,Overweight∣Like ) _ _×P ( product ,overweight )P ( like )=81×74×496=127

Then if this formula is applied to the scene of article classification, we can look at it like this:

The formula can be understood as:

P ( C ∣ F 1 , F 2 , . . . ) = P ( F 1 , F 2 , . . . ∣ C ) P ( C ) P ( F 1 , F 2 , . . . ) P(C|F1, F2, ...) = \frac{P(F1, F2, ... | C)P(C)}{P(F1, F2, ...)} P(CF1,F2 , _...)=P(F1,F2 , _...)P(F1,F2 , _...∣C)P(C)

where CCC can be of different classes.

The formula is divided into three parts:

  1. P ( C ) P(C) P ( C ) : Probability of each document category (number of certain document categories/total number of documents)
  2. P ( W ∣ C ) P(W|C) P ( W C ) : Probability of features (words appearing in predicted documents) under a given category
    • Calculation method: P ( F 1 ∣ C ) = N i NP(F1 | C) = \frac{N_i}{N}P(F1∣C)=NNi(calculate in the training document)
      • N i N_i Nifor the F 1 F1F 1 word inCCThe number of occurrences in all documents of category C
      • N N N is the categoryCCThe number of occurrences of all words in the document under C and
  3. P ( F 1 , F 2 , . . . ) P(F1,F2,...) P(F1,F2 , _... ) to predict the probability of each word in the document

If we calculate the probability comparison of two categories, we only need to compare the previous sizes to find out who has the higher probability.

2.4.3 Article classification calculation

Requirements: Through the first four training samples (articles), determine whether the fifth article belongs to the China category.

Document ID word in document Belong to C=China category
Training set 1 Chinese Beijing Chinese Yes
2 Chinese Chinese Shanghai Yes
3 Chinese Macao Yes
4 Tokyo Japan Chinese No
test set 5 Chinese Chinese Chinese Tokyo Japan

Calculation results:

P ( C ∣ C h i n e s e , C h i n e s e , C h i n e s e , T o k y o , J a p a n ) = P ( C h i n e s e , C h i n e s e , C h i n e s e , T o k y o , J a p a n ∣ C ) × P ( C ) P ( C h i n e s e , C h i n e s e , C h i n e s e , T o k y o , J a p a n ) = P ( C h i n e s e ) 3 × P ( T o k y o ∣ C ) × P ( J a p a n ∣ C ) × P ( C ) [ P ( C h i n e s e ) 3 × P ( T o k y o ) × P ( J a p a n ) ] \begin{aligned} P(C|Chinese, Chinese, Chinese, Tokyo, Japan) & = P(Chinese, Chinese, Chinese, Tokyo, Japan|C) \\ & \times \frac{P(C)}{P(Chinese, Chinese, Chinese, Tokyo, Japan)}\\ & = P(Chinese)^3 \times P(Tokyo|C) \times P(Japan | C) \\ & \times \frac{P(C)}{[P(Chinese)^3 \times P(Tokyo) \times P(Japan)]} \end{aligned} P(CChinese,Chinese,Chinese,Tokyo,Japan)=P(Chinese,Chinese,Chinese,Tokyo,JapanC)×P(Chinese,Chinese,Chinese,Tokyo,Japan)P(C)=P(Chinese)3×P(TokyoC)×P(JapanC)×[P(Chinese)3×P(Tokyo)×P(Japan)]P(C)

This article needs to calculate whether it is a China category, whether or not the final denominator value is the same.

First calculate the probability of being a China class:

P ( C h i n e s e ∣ C ) = 5 / 8 P ( T o k y o ∣ C ) = 0 / 8 P ( J a p a n ∣ C ) = 0 / 8 \begin{aligned} & P(Chinese|C) = 5/8 \\ & P(Tokyo|C) = 0/8\\ & P(Japan|C) = 0/8 \end{aligned} P(ChineseC)=5/8P(TokyoC)=0/8P(JapanC)=0/8

Then calculate the probability of not being a China class:

P ( C h i n e s e ∣ C ) = 1 / 3 P ( T o k y o ∣ C ) = 1 / 3 P ( J a p a n ∣ C ) = 1 / 3 \begin{aligned} & P(Chinese|C) = 1/3\\ & P(Tokyo|C) = 1/3\\ & P(Japan|C) = 1/3\\ \end{aligned} P(ChineseC)=1/3P(TokyoC)=1/3P(JapanC)=1/3

Question : From the above example we get P ( Tokyo ∣ C ) P(Tokyo|C)P ( T o k yo C ) andP ( J apan ∣ C ) P(Japan|C)P ( J a p an C ) is all 0, which is unreasonable. If there are many occurrences of 0 in the word frequency list, it is likely that the calculation results are all zero.

The main reason is that the number of samples is too small, which does not have universality and regularity.

Solution : Laplace smoothing coefficient

P ( F 1 ∣ C ) = N i + α N + α m P(F1|C) = \frac{N_i + \alpha}{N + \alpha m} P(F1∣C)=N+square meterNi+a

in:

  • a \alphaα is the specified coefficient, generally 1
  • m m m is the number of feature words counted in the training document (repetition is only counted once)

This article needs to calculate whether it is a China category:

First calculate the probability of being a China class: 0.0003

P ( C h i n e s e ∣ C ) = 5 8 → 5 + 1 8 + 6 = 6 14 P ( T o k y o ∣ C ) = 0 8 → 0 + 1 8 + 6 = 1 14 P ( J a p a n ∣ C ) = 0 8 → 0 + 1 8 + 6 = 1 14 \begin{aligned} & P(Chinese|C) = \frac{5}{8} \rightarrow \frac{5 + 1}{8 + 6} = \frac{6}{14}\\ & P(Tokyo|C) = \frac{0}{8} \rightarrow \frac{0 + 1}{8 + 6} = \frac{1}{14}\\ & P(Japan|C)= \frac{0}{8} \rightarrow \frac{0 + 1}{8 + 6} = \frac{1}{14}\\ \end{aligned} P(ChineseC)=858+65+1=146P(TokyoC)=808+60+1=141P(JapanC)=808+60+1=141

m m m is the number of feature words in the training set, regardless of repetition, here is 6

Then calculate the probability of not being China: 0.0001

P ( C h i n e s e ∣ C ) = 1 3 → 1 + 1 3 + 6 = 2 9 P ( T o k y o ∣ C ) = 1 3 → 1 + 1 3 + 6 = 2 9 P ( J a p a n ∣ C ) = 1 3 → 1 + 1 3 + 6 = 2 9 \begin{aligned} & P(Chinese|C) = \frac{1}{3} \rightarrow \frac{1+1}{3+6} = \frac{2}{9}\\ & P(Tokyo|C)= \frac{1}{3} \rightarrow \frac{1+1}{3+6} = \frac{2}{9}\\ & P(Japan|C)= \frac{1}{3} \rightarrow \frac{1+1}{3+6} = \frac{2}{9}\\ \end{aligned} P(ChineseC)=313+61+1=92P(TokyoC)=313+61+1=92P(JapanC)=313+61+1=92

0.0003 > 0.0001 0.0003 > 0.0001 0.0003>0.0001 , so we think that the probability that the article belongs to China is higher.


Summary :

  • Probability [understand]: the possibility of an event happening
  • Joint probability [know]: Contains multiple conditions, and the probability that all conditions are true at the same time
  • Conditional probability [know]: the probability of event A occurring under the condition that another event B has occurred
  • Bayesian formula [master]: P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = P(B|A)\frac{P(A)} {P(B)}P(AB)=P(BA)P(B)P(A)

3. Case: Product Review Sentiment Analysis

learning target:

  • Using Naive Bayesian API to Realize Product Review Sentiment Analysis

3.1 API introduction

sklearn.naive_bayes.MultinomialINB(alpha=1.0)
  • Role : sklearn.naive_bayes.MultinomialNBIt is a naive Bayesian classifier suitable for classification with discrete features, such as word counts in text classification. Multinomial distributions usually require integer feature counts. However, in practice, fractional counts (like tf-idf) can also work.
  • Parameters :
    • alpha: float or array of shape (n_features,), optional, default 1.0. Add (Laplace/Lidstone) smoothing parameters (set alpha=0 and force_alpha=True to get no smoothing).
    • fit_prior: Boolean, optional, default is True. Whether to learn class prior probabilities. If false, a uniform prior will be used.
    • class_prior: array of shape (n_classes,), optional, default None. class prior probability. If specified, prior probabilities are not adjusted according to the data.
  • method :
    • fit(X, y[, sample_weight]): Fit a Naive Bayesian classifier according to X,y.

3.2 Sentiment Analysis of Product Reviews

insert image description here

The dataset is created as follows:

{
    
    "Unnamed: 0":{
    
    "0":0,"1":1,"2":2,"3":3,"4":4,"5":5,"6":6,"7":7,"8":8,"9":9,"10":10,"11":11,"12":12},"\u5185\u5bb9":{
    
    "0":"\u4ece\u7f16\u7a0b\u5c0f\u767d\u7684\u89d2\u5ea6\u770b\uff0c\u5165\u95e8\u6781\u4f73\u3002","1":"\u5f88\u597d\u7684\u5165\u95e8\u4e66\uff0c\u7b80\u6d01\u5168\u9762\uff0c\u9002\u5408\u5c0f\u767d\u3002","2":"\u8bb2\u89e3\u5168\u9762\uff0c\u8bb8\u591a\u5c0f\u7ec6\u8282\u90fd\u6709\u987e\u53ca\uff0c\u4e09\u4e2a\u5c0f\u9879\u76ee\u53d7\u76ca\u532a\u6d45\u3002","3":"\u524d\u534a\u90e8\u5206\u8bb2\u6982\u5ff5\u6df1\u5165\u6d45\u51fa\uff0c\u8981\u8a00\u4e0d\u70e6\uff0c\u5f88\u8d5e","4":"\u770b\u4e86\u4e00\u904d\u8fd8\u662f\u4e0d\u4f1a\u5199\uff0c\u6709\u4e2a\u6982\u5ff5\u800c\u5df2","5":"\u4e2d\u89c4\u4e2d\u77e9\u7684\u6559\u79d1\u4e66\uff0c\u96f6\u57fa\u7840\u7684\u770b\u4e86\u4f9d\u65e7\u770b\u4e0d\u61c2","6":"\u5185\u5bb9\u592a\u6d45\u663e\uff0c\u4e2a\u4eba\u8ba4\u4e3a\u4e0d\u9002\u5408\u6709\u5176\u5b83\u8bed\u8a00\u7f16\u7a0b\u57fa\u7840\u7684\u4eba","7":"\u7834\u4e66\u4e00\u672c","8":"\u9002\u5408\u5b8c\u5b8c\u5168\u5168\u7684\u5c0f\u767d\u8bfb\uff0c\u6709\u5176\u4ed6\u8bed\u8a00\u7ecf\u9a8c\u7684\u53ef\u4ee5\u53bb\u770b\u522b\u7684\u4e66\u3002","9":"\u57fa\u7840\u77e5\u8bc6\u5199\u7684\u633a\u597d\u7684!","10":"\u592a\u57fa\u7840","11":"\u7565_\u55e6\u3002\u3002\u9002\u5408\u5b8c\u5168\u6ca1\u6709\u7f16\u7a0b\u7ecf\u9a8c\u7684\u5c0f\u767d","12":"\u771f\u7684\u771f\u7684\u4e0d\u5efa\u8bae\u4e70"},"\u8bc4\u4ef7":{
    
    "0":"\u597d\u8bc4","1":"\u597d\u8bc4","2":"\u597d\u8bc4","3":"\u597d\u8bc4","4":"\u5dee\u8bc4","5":"\u5dee\u8bc4","6":"\u5dee\u8bc4","7":"\u5dee\u8bc4","8":"\u5dee\u8bc4","9":"\u597d\u8bc4","10":"\u5dee\u8bc4","11":"\u5dee\u8bc4","12":"\u5dee\u8bc4"}}

Copy the above JSON code, save it locally, and then use pd.read_json()the method to read it.

3.2.1 Step analysis

  1. retrieve data
  2. Basic Data Processing
    1. Take out the content column and analyze the data
    2. Judgment criteria
    3. select stop words
    4. Convert content to a standard format
    5. Count the number of words
    6. Prepare training and test sets
  3. model training
  4. model evaluation

3.2.2 Code implementation

import pandas as pd
import numpy as np
import jieba
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

3.2.2.1 Get data

# 1. 获取数据
data = pd.read_csv("./data/书籍评价.csv", encoding='gbk')

3.2.2.2 Basic data processing

# 2. 数据基本处理
## 2.1 取出内容列,对数据进行分析
content = data["内容"]
content.head()
0              从编程小白的角度看,入门极佳。
1            很好的入门书,简洁全面,适合小白。
2    讲解全面,许多小细节都有顾及,三个小项目受益匪浅。
3          前半部分讲概念深入浅出,要言不烦,很赞
4             看了一遍还是不会写,有个概念而已
Name: 内容, dtype: object
## 2.2 选择停用词
# 加载停用词
stopwords = []
with open("./data/stopwords.txt", 'r', encoding="utf-8") as f:
    lines = f.readlines()
    for tmp in lines:
        line = tmp.strip()
        stopwords.append(line)
stopwords
['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',

 ...

 '当着',
 '形成',
 '彻夜',
 '彻底',
 '彼',
 '彼时',
 ...]

Complete list of stop words: https://blog.csdn.net/Dorisi_H_n_q/article/details/82114913

## 2.3 把“内容”转换为标准格式
comment_lst = []
for tmp in content:
    print("原始数据:", tmp)
    
    # 通过结巴分词对文本数据进行切割(把一句句话变成一个个词)
    seg_lst = jieba.cut(tmp, cut_all=False)
    print("切割后的数据", seg_lst)
    
    # 拼接字符串
    seg_str = ','.join(seg_lst)  
    print("拼接后的字符串:", seg_str)
    comment_lst.append(seg_str)
    print()
comment_lst
原始数据: 从编程小白的角度看,入门极佳。
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAD070>
拼接后的字符串: 从,编程,小白,的,角度看,,,入门,极佳,。

原始数据: 很好的入门书,简洁全面,适合小白。
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAD4D0>
拼接后的字符串: 很,好,的,入门,书,,,简洁,全面,,,适合,小白,。

原始数据: 讲解全面,许多小细节都有顾及,三个小项目受益匪浅。
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCADD90>
拼接后的字符串: 讲解,全面,,,许多,小,细节,都,有,顾及,,,三个,小,项目,受益匪浅,。

原始数据: 前半部分讲概念深入浅出,要言不烦,很赞
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAC7B0>
拼接后的字符串: 前半部,分讲,概念,深入浅出,,,要言不烦,,,很赞

原始数据: 看了一遍还是不会写,有个概念而已
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAD700>
拼接后的字符串: 看,了,一遍,还是,不会,写,,,有个,概念,而已

原始数据: 中规中矩的教科书,零基础的看了依旧看不懂
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAC580>
拼接后的字符串: 中规中矩,的,教科书,,,零,基础,的,看,了,依旧,看不懂

原始数据: 内容太浅显,个人认为不适合有其它语言编程基础的人
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAC0B0>
拼接后的字符串: 内容,太,浅显,,,个人,认为,不,适合,有,其它,语言,编程,基础,的,人

原始数据: 破书一本
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAC040>
拼接后的字符串: 破书,一本

原始数据: 适合完完全全的小白读,有其他语言经验的可以去看别的书。
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAD2A0>
拼接后的字符串: 适合,完完全全,的,小白读,,,有,其他,语言,经验,的,可以,去,看,别的,书,。

原始数据: 基础知识写的挺好的!
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCADE70>
拼接后的字符串: 基础知识,写,的,挺,好,的,!

原始数据: 太基础
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAC120>
拼接后的字符串: 太,基础

原始数据: 略_嗦。。适合完全没有编程经验的小白
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAC430>
拼接后的字符串: 略,_,嗦,。,。,适合,完全,没有,编程,经验,的,小白

原始数据: 真的真的不建议买
切割后的数据 <generator object Tokenizer.cut at 0x00000200FBCAC350>
拼接后的字符串: 真的,真的,不,建议,买




['从,编程,小白,的,角度看,,,入门,极佳,。',
 '很,好,的,入门,书,,,简洁,全面,,,适合,小白,。',
 '讲解,全面,,,许多,小,细节,都,有,顾及,,,三个,小,项目,受益匪浅,。',
 '前半部,分讲,概念,深入浅出,,,要言不烦,,,很赞',
 '看,了,一遍,还是,不会,写,,,有个,概念,而已',
 '中规中矩,的,教科书,,,零,基础,的,看,了,依旧,看不懂',
 '内容,太,浅显,,,个人,认为,不,适合,有,其它,语言,编程,基础,的,人',
 '破书,一本',
 '适合,完完全全,的,小白读,,,有,其他,语言,经验,的,可以,去,看,别的,书,。',
 '基础知识,写,的,挺,好,的,!',
 '太,基础',
 '略,_,嗦,。,。,适合,完全,没有,编程,经验,的,小白',
 '真的,真的,不,建议,买']
## 2.4 统计词的个数
# 实例化统计词对象
cv = CountVectorizer(stop_words=stopwords)

# 进行词数统计
X = cv.fit_transform(comment_lst)  # 计算出各个词语出现的次数
name = cv.get_feature_names_out()  # 获取词袋中所有文本的关键字
print(X.toarray())  # 查看词频矩阵
print(name)
[[0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
  0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
  0]
 [0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1
  1]
 [0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
  0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0]
 [0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  0]
 [0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0
  0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
  0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0
  0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0
  0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
  0]]


['一本' '一遍' '三个' '中规中矩' '依旧' '入门' '内容' '分讲' '前半部' '受益匪浅' '基础' '基础知识' '完完全全'
 '小白' '小白读' '建议' '很赞' '教科书' '有个' '极佳' '概念' '浅显' '深入浅出' '看不懂' '真的' '破书'
 '简洁' '细节' '经验' '编程' '要言不烦' '角度看' '讲解' '语言' '适合' '项目' '顾及']
## 2.5 准备训练集和测试集(因为总共才13条数据,因此我们手动分训练集和测试集就行,不用sklearn的方法了)
# 准备训练集
x_train = X.toarray()[:10, :]
y_train = data["评价"][:10]

# 准备测试集
x_test = X.toarray()[10:, :]
y_test = data["评价"][10:]

print("训练集:\r\n", x_train, "\r\n", y_train)
print("测试集:\r\n", x_test, "\r\n", y_test)
训练集:
[[0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
0]
[0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1
1]
[0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
[0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0]
[0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0
0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0
0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]] 

0    好评
1    好评
2    好评
3    好评
4    差评
5    差评
6    差评
7    差评
8    差评
9    好评
Name: 评价, dtype: object


测试集:
[[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0
0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
0]] 

10    差评
11    差评
12    差评
Name: 评价, dtype: object

3.2.2.3 Model Training

# 3. 模型训练
# 构建朴素贝叶斯算法分类器
mnb = MultinomialNB(alpha=1)  # alpha为Laplace平滑系数

# 训练数据
mnb.fit(x_train, y_train)

# 预测数据
y_pred = mnb.predict(x_test)

# 预测值与真实值展示
print("预测值:", y_pred)
print("真实值:\r\n", y_test)
预测值:['差评' '差评' '差评']

真实值:
10    差评
11    差评
12    差评
Name: 评价, dtype: object

3.2.2.4 Model evaluation

score = mnb.score(x_test, y_test)
print("模型准确率为:", score * 100, "%")
模型准确率为: 100.0 %

In the above case, our target value is set to Chinese characters. In fact, for large-scale projects, it is better to set it to numbers, which will be better.


Application case: Baidu AI sentiment analysis


Summary :

  • Naive Bayes Classification
    • API: sklearn.naive_bayes.MultinomialNB(alpha=1.0)
      • alpha: Laplace smoothing coefficient

4. Summary of Naive Bayes Algorithm

4.1 Advantages and disadvantages of Naive Bayes

4.1.1 Advantages

  • The naive Bayesian model originated from classical mathematical theory and has stable classification efficiency .
  • It is less sensitive to missing data and the algorithm is relatively simple. It is often used in text classification .
  • High classification accuracy and fast speed
  • It is easy to implement and quickly predicts the classes of the training dataset.
  • It performs well in multi-class prediction.
  • It outperforms other models such as logistic regression when independent variables are assumed.
  • It requires less training data.
  • It outperforms numerical variables in classifying discrete input variables.

4.1.2 Disadvantages

  • Due to the assumption of independence of sample attributes, it does not work well if the feature attributes are correlated
  • Prior probabilities need to be calculated, and prior probabilities often depend on assumptions. There can be many kinds of hypothetical models, so at some point the prediction effect will be poor due to the hypothetical prior model.
  • It assumes that all features are independent among each other, which is not always true in real life. This assumption may affect the accuracy of the algorithm.
  • It may not work well for datasets with continuous features.

4.2 Naive Bayes content summary

4.2.1 Principle

Naive Bayesian method is a classification method based on Bayesian theorem and the independent assumption of feature conditions .

  • For a given item to be classified xxx , the posterior probability distribution calculated by the learned model
  • That is: the probability of occurrence of each target category under the condition that this item appears, the class with the largest posterior probability is taken as xxThe category to which x belongs.

Q : What is prior and what is posterior?
A : A prior probability is a measure of our uncertainty about a hypothesis before considering new evidence. It is usually based on previous experience or background knowledge.

Posterior probability is a measure of our uncertainty about a hypothesis after considering new evidence. It is calculated from Bayes' theorem, combining prior probabilities and the likelihood of new evidence.

As an example, suppose you want to know whether it will rain tomorrow. You know that there is a 30% chance of rain this season, so your prior probability is 30%. Then you look at the weather forecast and realize that it says there is a 90% chance of rain tomorrow. You can use Bayes' theorem to combine these two pieces of information and calculate the posterior probability that it will rain tomorrow.

4.2.2 Naive Bayes Where is naive?

In calculating the conditional probability distribution P ( X = x ∣ Y = ck ) P(X=x | Y=c_k)P(X=xY=ck) , Naive Bayes introduces a strong conditional independence assumption, that is,when YYWhen Y is determined,XXThe values ​​of each characteristic component of X are independent of each other.

The Naive Bayes algorithm is called "naive" because it assumes that all features are independent of each other. That means, it assumes that each feature has an independent impact on the classification result, independent of other features.

However, in real life, this assumption does not always hold. There may be correlations between features, which may affect the accuracy of the algorithm. Still, Naive Bayes performs well in many applications.

4.2.3 Why introduce conditional independence assumption?

In order to avoid the combinatorial explosion and sample sparse problems faced when solving Bayes theorem.

Assumed conditional probabilities are divided into:

P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , x ( 2 ) , . . . , x ( n ) ∣ Y = c k ) P(X=x|Y=c_k) = P(X^{(1)} = x^{(1)}, x^{(2)}, ..., x^{(n)} | Y=c_k) P(X=xY=ck)=P(X(1)=x(1),x(2),...,x(n)Y=ck)

where: x ( j ) x^{(j)}x( j ) may have values​​S j S_jSj, j = 1 , 2 , . . . , nj = 1,2,..., nj=1,2,...,nYYY may haveKKK , then the number of parameters isK ∏ j = 1 m S j K \prod^m_{j=1}S_jKj=1mSj, which leads to an exponential number of parameters for the conditional probability distribution.


The Naive Bayesian algorithm introduces conditional independence assumptions to simplify calculations. According to Bayesian theorem, we need to calculate the posterior probability of each category, and then select the category with the largest posterior probability as the prediction result. To calculate the posterior probability, we need to calculate the joint probability distribution, i.e. the probability that all features occur at the same time.

If the assumption of conditional independence is not introduced, the calculation of the joint probability distribution needs to consider the correlation between all features, which will make the calculation very complicated. However, if we assume that all features are conditionally independent, the joint probability distribution can be reduced to the product of the probabilities of each feature, which greatly simplifies the calculation.

Although the conditional independence assumption does not always hold in real life, the Naive Bayes algorithm still performs well in many applications.

4.2.4 Estimated conditional probability P ( X ∣ Y ) P(X|Y)What should I do if the probability is 0 when P ( X Y ) ?

When estimating the conditional probability P(X|Y), if the probability is 0, the entire posterior probability will be 0. To avoid this, smoothing techniques can be used to adjust the probability estimates.

In simple terms, introducing λ \lambdaL

  • λ = 0 \lambda=0l=0 , it is the ordinary maximum likelihood estimation
  • λ = 1 \lambda=1l=Laplacian smoothing

4.2.5 Why is the attribute independence assumption difficult to hold in actual situations, but Naive Bayes can still achieve better results?

Before people use the classifier, the first step (and the most important step) is often feature selection. The purpose of this process is to exclude collinearity between features, that is, to select relatively independent features.

For classification tasks, as long as the conditional probabilities of each category are sorted correctly, the correct classification can be obtained without precise probability values. If inter-attribute dependencies affect all categories equally, or if the effects of dependencies can cancel each other out, the attribute conditional independence assumption reduces computational complexity without negatively impacting performance.

Although the attribute independence assumption of the Naive Bayesian algorithm is difficult to hold in practice, it can still achieve good results in many applications. This is mainly because the goal of the Naive Bayes algorithm is to select the class with the largest posterior probability, not to estimate the posterior probability accurately.

Even when there are correlations between attributes, the Naive Bayes algorithm is still able to rank the categories correctly most of the time. This means that the class with the largest posterior probability is usually still the correct class, even if the absolute value of the posterior probability is not exact.

Also, Naive Bayes algorithms generally perform well with large amounts of data. When the amount of data is large enough, the algorithm can learn enough information from the data to make up for the lack of attribute independence assumptions.

4.2.6 Difference between Naive Bayes and Logistic Regression (LR)

Both Naive Bayes and Logistic Regression (LR) are commonly used classification algorithms, but there are some important differences between them.

  • Model type: Naive Bayes is a generative model that tries to learn the probability distribution of each class and the relationship between each feature and class. Logistic regression is a discriminative model that attempts to learn a decision boundary to directly distinguish between different classes.
  • Assumption: Naive Bayes assumes that all features are conditionally independent, which makes the calculation simple, but may also affect the accuracy of the algorithm. Logistic regression does not have this assumption and thus can better handle cases where there are correlations between features.
  • Data requirements: Naive Bayes generally requires less data to achieve better results, while logistic regression requires more data to accurately estimate model parameters.

Difference 1 : Naive Bayes is a generative model, while LR is a discriminative model.

Naive Bayes:

  • Bayesian estimation based on existing samples to learn the prior probability P ( Y ) P(Y)P ( Y ) and conditional probabilityP ( X ∣ Y ) P(X|Y)P ( X Y ) , and then find the joint distribution probabilityP ( XY ) P(XY)P ( X Y )
  • Finally, use Bayes' theorem to solve P ( Y ∣ X ) P(Y|X)P(YX)

Logistic regression LR:

  • Directly calculate the conditional probability P ( Y ∣ X ) P(Y|X) according to the maximized logarithmic likelihood functionP(YX)

There are two main strategies for understanding machine learning from a probabilistic framework:

  • The first type : given xxx , can be directly modeled byP ( c ∣ x ) P(c|x)P ( c x ) to predictccc , so that the "discriminative model" (Discriminative Models) is obtained
  • The second : You can also first set the joint probability distribution P ( x , c ) P(x, c)P(x,c ) modeling, and then obtainP ( c ∣ x ) P(c|x)P ( c x ) , so we get "generative models" (Generative Models)

Obviously, the logistic regression and decision tree introduced earlier can all be classified into the category of discriminant model, as well as the BP neural network and support vector machine learned later.

For generative models, it is necessary to consider P ( c ∣ x ) = P ( x , c ) P ( x ) P(c|x) = \frac{P(x, c)}{P(x)}P(cx)=P(x)P(x,c)

Difference two :

  • Naive Bayes is based on a strong assumption of conditional independence (in known categories YYUnder the condition of Y , the values ​​of each characteristic variable are independent of each other)
  • LR does not require this

Difference three :

  • Naive Bayes is suitable for scenarios with small data sets
  • While LR is suitable for large-scale data sets

Further explanation :

The former is a generative model, and the latter is a discriminative model. The difference between the two is the difference between a generative model and a discriminative model.

First, Naive Bayes obtains the prior probability P ( Y ) P(Y) through known samplesP ( Y ) and conditional probabilityP ( X ∣ Y ) P(X|Y)P ( X∣Y ) . _ _ For a given instance, calculate the joint probability, and then find the posterior probability. In other words, it tries to find out how the data was generated (produced), and then classify it. Whichever category is most likely to generate the signal belongs to that category.

Advantages :

  • Converges faster when sample size increases
  • It is also applicable when hidden variables exist

Disadvantages :

  • long time
  • Need more samples
  • waste of computing resources

In contrast, LR does not care about the proportion of categories in the sample and the probability of features appearing under the category, it directly gives the formula of the prediction model. Assume that each feature has a weight, and the training sample data updates the weight www , resulting in the final expression.

Advantages :

  • Direct predictions tend to be more accurate
  • Simplify the problem
  • It can reflect the distribution of data and the difference characteristics of categories
  • Applicable to the identification of more categories

Disadvantages :

  • slow convergence
  • Not suitable for cases with hidden variables

Guess you like

Origin blog.csdn.net/weixin_44878336/article/details/131155619