Basics of Machine Learning: "Classification Algorithm (5)—Principle of Naive Bayes Algorithm"

1. Naive Bayes Algorithm

1. What is the Naive Bayes classification method?
Previously, the KNN algorithm was used, and there was a result directly after classification. However, after Naive Bayes classification, some probability values ​​will appear, such as:

these six categories have a certain possibility.

Another example, classify articles:

Classify it into three categories. After classifying each sample using Naive Bayes, you will get the result like this. The one with a higher probability will be taken as the final result.

2. Basics of probability

1. Definition of probability Probability
is defined as the possibility of something happening
. For example: if you throw a coin, what is the probability that it will end up heads?

2. Value range
P(X): The value is in [0, 1].
If the value is 0, it is an impossible event. If the value is 1, it is an inevitable event

3. Does the goddess like calculation cases?

It is known that Xiao Ming is a product manager and is overweight. Will he be liked by the goddess?
There are two characteristics, occupation and body type. The target value is whether the goddess will like it. It is a two-classification problem.

4. Question
(1) What is the probability that the goddess likes you?
There are 7 samples, and the goddess likes 4
p(like) = 4/7

(2) What is the probability of being a programmer by profession and having a well-proportioned body shape?
P(programmer, proportional) = 1/7
-- joint probability

(3) Under the conditions that the goddess likes, what is the probability of being a programmer?
P(programmer | like) = 2/4
-- conditional probability

(4) Under the conditions that the goddess likes, the probability of being a programmer and being overweight?
P(Programmer, Overweight | Like) = 1/4
--Conforms to both conditional probability and joint probability

3. Joint probability, conditional probability and mutual independence

1. Joint probability: contains multiple conditions, and the probability that all conditions are true at the same time
is recorded as: P(A,B)
Characteristics: P(A,B) = P(A)P(B)
For example: P(programmer, well-proportioned), P(programmer, overweight|like)

2. Conditional probability: It is the probability of occurrence of event A under the condition that another event B has occurred. It is
recorded as: P(A|B)
Characteristics: P(A1,A2|B) = P(A1|B)P(A2| B)
For example: P(programmer|like), P(programmer, overweight|like)

3. Mutual independence: If P(A,B) = P(A)P(B), then event A and event B are said to be independent of each other. Example:
In
Do Goddess Like Data, are programmers and symmetry independent of each other?
P(programmer, symmetry) = 1/7
P(programmer) = 3/7
P(symmetry) = 4/7So
programmer and symmetry are not independent of each other

4. It is known that Xiao Ming is a product manager and is overweight. Will he be liked by the goddess?
The goal is to find: P(like | product, overweight) = ?
This is where Bayes’ formula is used

4. Bayes’ formula

1, official

Note: W is the feature value, C is the category

2. Solve Xiao Ming’s problem
Numerator: P (product, overweight | like) * P (like)
Denominator: P (product, overweight)

What is naive: with the assumption that features are independent of each other
P(product, overweight) = P(product) * P(overweight)

In the above formula, the results of P(product, overweight | like) and P(product, overweight) are both 0, resulting in the result being unable to be calculated. This is because our sample size is too small and not representative. In real life, there must be people who are product managers and are overweight. P (product, overweight) cannot be 0; and the event "occupation" Is a product manager" and the event "overweight" are generally considered to be independent events.

Naive Bayes can help us solve this problem

Naive Bayes, simply understood, is the Bayesian formula that assumes that features are independent of each other.

In other words, Naive Bayes is naive because it assumes that features are independent of each other.

Therefore, if the thinking question is solved according to the Naive Bayesian idea, it can be:
P(product, overweight) = P(product) * P(overweight) = 2/7 * 3/7 = 6/49
P(product , overweight|like) = P(product|like) * P(overweight|like) = 1/2 * 1/4 = 1/8
P(like) = 4/7

Numerator / Denominator = (1/8) * (4/7) / (6/49) = 196/336 = 7/12

5. Summary of Naive Bayes Algorithm

1. The KNN algorithm can use one sentence to determine my category based on my neighbors.

2. Naive Bayes algorithm = Naive + Bayes formula

6. Application scenarios

1. The characteristic of Naive Bayesian is to assume that features are independent of each other. It is often used in text classification and text sentiment analysis.

2. Because we need to convert articles into data that can be processed by machine learning, words are used as features.

3. Generally speaking, words are used as features, and there is an assumption that words are independent of each other.

7. Cases of text classification

1. Case

2, official

3. Calculation results
(1) P(C|Chinese, Chinese, Chinese, Tokyo, Japan)
When there are these 5 words in the article, what is the probability that the article belongs to the China category
: P(Chinese, Chinese, Chinese, Tokyo, Japan|C) * P(C)
denominator: P(Chinese, Chinese, Chinese, Tokyo, Japan)
actually only requires the numerator, the denominator is the same
P(C) = 3/4
P(Chinese|C)^ 3 * P(Tokyo|C) * P(Japan|C)
P(Chinese|C) = 5/8
P(Tokyo|C) = 0/8 (0 appears because the sample size is too small and Lapra needs to be introduced
After applying Laplace smoothing coefficient ) :
P(Chinese|C) = (5+1) / (8+1*6) = 6/14 = 3/7
P(Tokyo|C) = (0 +1) / (8+1*6) = 1/14
P(Japan|C) = (0+1) / (8+1*6) = 1/14

(3/7)^3 * 1/14 * 1/14 = 27/343 * 1/14 * 1/14 = 27/67228 = 0.0004016183732968406

(2) The probability that the article P(not C|Chinese, Chinese, Chinese, Tokyo, Japan)
does not belong to the China category. Numerator
: P(Chinese, Chinese, Chinese, Tokyo, Japan|not C) * P(not C)
denominator :P(Chinese, Chinese, Chinese, Tokyo, Japan)
actually only requires the numerator, the denominator is the same
P(not C) = 1/4
P(Chinese|not C)^3 * P(Tokyo|not C) *
After applying Laplace smoothing coefficient to P(Japan|Not C) :
P(Chinese|Not C) = (1+1) / (3+1*6) = 2/9
P(Tokyo|Not C) = ( 1+1) / (3+1*6) = 2/9
P(Japan|non-C) = (1+1) / (3+1*6) = 2/9

(2/9)^3 * 2/9 * 2/9 = 32/59049 = 0.0005419228098697692

(3) Taken together, the probability of non-C is greater than the probability of C, so the test set does not belong to the China category.

8. Laplace smoothing coefficient

1. Purpose:
To prevent the calculated classification probability from being 0

2, official

Nine, API

1. sklearn.naive_bayes.MultinomialNB (alpha = 1.0)
Naive Bayes classification
alpha: Laplace smoothing coefficient

10. Case: 20 categories of news classification

1. Data

2. Step analysis
(1) Obtain data
(2) Divide the data set
(3) Feature engineering
  text feature extraction
(4) Naive Bayes predictor process
(5) Model evaluation

3. Code

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

def KNN_iris():
    """
    用KNN算法对鸢尾花进行分类
    """
    # 1、获取数据
    iris = load_iris()
    print("iris.data:\n", iris.data)
    print("iris.target:\n", iris.target)
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3、特征工程:标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    # 用训练集的平均值和标准差对测试集的数据来标准化
    # 这里测试集和训练集要有一样的平均值和标准差,而fit的工作就是计算平均值和标准差,所以train的那一步用fit计算过了,到了test这就不需要再算一遍自己的了,直接用train的就可以
    x_test = transfer.transform(x_test)
    # 4、KNN算法预估器
    estimator = KNeighborsClassifier(n_neighbors=3)
    estimator.fit(x_train, y_train)
    # 5、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    return None
 
def KNN_iris_gscv():
    """
    用KNN算法对鸢尾花进行分类,添加网格搜索和交叉验证
    """
    # 1、获取数据
    iris = load_iris()
    print("iris.data:\n", iris.data)
    print("iris.target:\n", iris.target)
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3、特征工程:标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    # 用训练集的平均值和标准差对测试集的数据来标准化
    # 这里测试集和训练集要有一样的平均值和标准差,而fit的工作就是计算平均值和标准差,所以train的那一步用fit计算过了,到了test这就不需要再算一遍自己的了,直接用train的就可以
    x_test = transfer.transform(x_test)
    # 4、KNN算法预估器
    estimator = KNeighborsClassifier()
    # 加入网格搜索和交叉验证
    # 参数准备
    param_dict = {"n_neighbors": [1, 3, 5, 7, 9, 11]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)
    estimator.fit(x_train, y_train)
    # 5、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    #最佳参数:best_params_
    print("最佳参数:\n", estimator.best_params_)
    #最佳结果:best_score_
    print("最佳结果:\n", estimator.best_score_)
    #最佳估计器:best_estimator_
    print("最佳估计器:\n", estimator.best_estimator_)
    #交叉验证结果:cv_results_
    print("交叉验证结果:\n", estimator.cv_results_)
    return None

def nb_news():
    """
    用朴素贝叶斯算法对新闻进行分类
    """
    # 1、获取数据
    news = fetch_20newsgroups(subset="all")
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(news.data, news.target)
    # 3、特征工程:文本特征抽取-tfidf
    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    # 4、朴素贝叶斯算法预估器流程
    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)
    # 5、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    return None
 
if __name__ == "__main__":
    # 代码1:用KNN算法对鸢尾花进行分类
    #KNN_iris()
    # 代码2:用KNN算法对鸢尾花进行分类,添加网格搜索和交叉验证
    #KNN_iris_gscv()
    # 代码3:用朴素贝叶斯算法对新闻进行分类
    nb_news()

4. Operation results

y_predict:
 [11 17 15 ...  9  2  4]
直接比对真实值和预测值:
 [ True  True  True ...  True  True  True]
准确率为:
 0.8497453310696095

11. Summary of Naive Bayes Algorithm

1. Advantages:
The Naive Bayes model originated from classical mathematical theory and has stable classification efficiency. It is
not very sensitive to missing data and the algorithm is relatively simple. It is often used for text classification. It has high
classification accuracy and fast speed.

2. Disadvantages:
Since the assumption of independence of sample attributes is used, the effect is not good if the feature attributes are related.
 

Guess you like

Origin blog.csdn.net/csj50/article/details/132494973