Machine Learning: Classification Prediction Based on Naive Bayes

Table of contents

1. Introduction and environment preparation

Introduction:

environment:

2. Practical exercises

2.1 Use the grape (Wine) data set for Bayesian classification

1. Data import

2. Model training

3. Model prediction

2.2 Simulating Discrete Datasets – Bayesian Classification

1. Data import and analysis

2. Model training and prediction

 3. Principle Analysis

Naive Bayes Algorithm

Advantages and disadvantages:


1. Introduction and environment preparation

Introduction:

Naive Bayes (Naive Bayes, NB) is a classification algorithm based on Bayes theorem in machine learning . It assumes that the input features are independent of each other and have the same impact on the classification results, so it is called Naive Bayes.

Specifically, it determines the classification of the input sample by calculating the prior probability and conditional probability, where the prior probability refers to the probability of each classification appearing in the entire data set, and the conditional probability refers to the probability of a given classification. In the case, the probability distribution of the input sample on each feature.

In practical applications, Naive Bayes is often used in tasks such as text classification and spam filtering, and has the advantages of fast calculation speed and insensitivity to data volume.

Bayesian formula is a data formula proposed by British mathematicians:

p(A,B): Indicates the probability that event A and event B occur at the same time.

p(B): indicates the probability of event B occurring, called prior probability; p(A): indicates the probability of event A occurring.

p(A|B): Indicates that the probability of event A occurring under the condition that event B occurs is called the posterior probability.

p(B|A): Indicates the probability of event B occurring under the condition that event A occurs.

We understand Bayesian in one sentence: There is a certain connection between many things in the world, assuming event A and event B. People often use an event that has happened to infer the probability between what we want to know.
For example, when a doctor makes a diagnosis, he will judge what disease the patient has based on the patient's tongue coating and heartbeat. For patients, they only pay attention to what disease they have, and doctors will use the events that have happened to
diagnose the specific situation. Bayesian thinking is used here, A is the patient's symptom that has occurred, and it is the probability of B_i under the condition that A occurs.

environment:

pycharm, suggests:

1. python3.7
2. numpy >= '1.16.4'
3. sklearn >= '0.23.1'

2. Practical exercises

All data sets can be called without additional download. Here is a brief introduction to sklearn's datasets.

There are several ready-made datasets available in scikit-learn (sklearn) that can be used for machine learning tasks and demonstrations. Currently (2021), the datasets provided in sklearn include the following 20:

  1. Boston Housing Dataset (Boston Housing)
  2. Iris dataset (Iris)
  3. Diabetes dataset (Diabetes)
  4. Handwritten digit dataset (Digits)
  5. Breast Cancer Dataset (Breast Cancer)
  6. Wisconsin Breast Cancer dataset
  7. Linnerud Physical Training Dataset (Linnerud)
  8. Swiss Banknotes Dataset (Swiss Banknotes)
  9. 20 categories of news text datasets (20 newsgroups)
  10. Olivetti face image dataset (Olivetti faces)
  11. Pokemon Dataset (Pokemon)
  12. Forest Covertypes Dataset (Forest Covertypes)
  13. California Housing Dataset (California Housing)
  14. Far Infrared Dataset (Infrared)
  15. Spam dataset (Spam)
  16. Facial recognition dataset (Faces)
  17. KDD Cup 1999 Network Intrusion Detection Dataset (KDD Cup 1999)
  18. Index Index Dataset (Digits)
  19. Loan Default Dataset (Credit Default)
  20. New Coronavirus Diagnostic Dataset (Covid-19)

In addition to these ready-made data sets, sklearn also provides some tools to generate artificial data sets, such as make_classification, make_regression and make_blobs.

2.1 Use the grape (Wine) data set for Bayesian classification

1. Data import

import warnings
warnings.filterwarnings('ignore')
import numpy as np
# 加载莺尾花数据集
from sklearn import datasets
# 导入高斯朴素贝叶斯分类器
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

X, y = datasets.load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

We need to calculate two probabilities: conditional probability: P ( X ( i ) = x ( i ) ∣ Y = ck ) and prior probability of category ck: P(Y=ck).
Through analysis, it is found that the training data is numerical data. Here, it is assumed that each feature obeys a Gaussian distribution, so we choose Gaussian Naive Bayesian for classification calculation.

2. Model training

# 使用高斯朴素贝叶斯进行计算
clf = GaussianNB(var_smoothing=1e-8)
clf.fit(X_train, y_train)

3. Model prediction

# 评估
y_pred = clf.predict(X_test)
acc = np.sum(y_test == y_pred) / X_test.shape[0]
print("Test Acc : %.3f" % acc)

# 预测
y_proba = clf.predict_proba(X_test[:1])
print(clf.predict(X_test[:1]))
print("预计的概率值:", y_proba)

 Gaussian Naive Bayes assumes that each feature obeys a Gaussian distribution. We call a random variable X a data distribution with a mathematical expectation of μ and a variance of σ2 as a Gaussian distribution. For each feature we generally use the mean to estimate μ and the variance of all features to estimate σ2.

 From the prediction results in the above example, we can see that category 0 corresponds to the largest posterior probability value, so we think category 0 is the optimal result.

It is very easy to draw inferences from one instance, just change the imported database, and then look at the classification of the next discrete dataset.

2.2 Simulating Discrete Datasets – Bayesian Classification

1. Data import and analysis

import random
import numpy as np
# 使用基于类目特征的朴素贝叶斯
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split

Randomly generate data for training, you can DIY

# 模拟数据
rng = np.random.RandomState(1)
# 随机生成500个100维的数据,每一维的特征都是[0, 10]之前的整数
X = rng.randint(11, size=(500, 100))
y = np.array([1, 2, 3, 4, 5] * 100)
data = np.c_[X, y]
# X和y进行整体打散
random.shuffle(data)
X = data[:,:-1]
y = data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

X and y represent a feature matrix and a target vector.

X is a 2D array of shape (500, 100) containing 500 samples, each containing 100 features. This feature matrix is ​​random integers generated from a uniform distribution from 0 to 10 (excluding 11) using numpy's randint function.

y is a 1D array of shape (500,) containing 500 target values. These target values ​​are 1, 2, 3, 4, 5, respectively, repeated 100 times, corresponding to the samples in the feature matrix one by one. In machine learning tasks, it is usually necessary to use feature matrices and target vectors to train and test models in order to make predictions on new unknown data.

2. Model training and prediction

All data features are discrete features, and we introduce a Naive Bayesian classifier based on discrete features.

clf = CategoricalNB(alpha=1)
clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print("Test Acc : %.3f" % acc)
# 随机数据测试,分析预测结果,贝叶斯会选择概率最大的预测结果
# 比如这里的预测结果是7,7对应的概率最大,由于我们是随机数据
# 读者运行的时候,可能会出现不一样的结果。
x = rng.randint(5, size=(1, 100))
print(clf.predict_proba(x))
print(clf.predict(x))

You can see the results of the test data, and Bayesian will choose the prediction result with the highest probability. For example, the prediction result here is 7, and the probability corresponding to 7 is the highest. Since we are random data, when the reader runs, it may be different. the result of.

At the same time, note that there is a variable alpha in the first line. When alpha=0, it is the maximum likelihood estimate. Usually the value alpha=1, this is Laplace smoothing (Laplace smoothing), which is called Bayesian estimation, mainly because if you use maximum likelihood estimation, if a certain feature value does not appear in the training data, At this time, there will be a situation where the probability is 0, resulting in the entire estimate being 0, because of the introduction of Bayesian estimation.

The accuracy of the test data here has no meaning, because the data is randomly generated and does not necessarily have Bayesian a priori, here is just an example.

 3. Principle Analysis

Naive Bayes Algorithm

Naive Bayes method = Bayes theorem + feature conditional independence.

The input X \in R^nspace is a set of n-dimensional vectors, and the output space y=\{c_1,c_2,...,c_K\}. All X and y are random variables in the corresponding space. P(X,Y) is the joint probability of X and Y respectively. The training data set (by P(X,Y ) produced independently and identically distributed):
T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}

The original text is more detailed, so I won’t go into details here, you can see the link at the end of the text.

To sum up, when classifying, it calculates the conditional probability of sample features for each category, so as to select the category with the highest probability as the prediction result. And the conditions are independent of each other.

Advantages and disadvantages:

Advantages:
The Naive Bayesian algorithm is mainly based on the classic Bayesian formula and has a good mathematical principle. Moreover, it performs well when the amount of data is small , and can perform incremental calculations when the amount of data is large. Since Naive Bayes uses the prior probability to estimate the posterior probability, it has good model interpretability.

Cons:
Naive Bayesian models have the smallest theoretical error rate compared to other classification methods. But this is not always the case in practice. This is because the Naive Bayesian model assumes that the attributes are independent of each other when the output category is given. This assumption is often not true in practical applications. When the number of attributes is relatively large or the attributes When the correlation between them is large, the classification effect is not good. Naive Bayes performs best when the attribute correlation is small. For this point, there are algorithms such as semi-naive Bayes that are moderately improved by considering partial correlations. For example, in order not to be too computationally intensive, we assume that each attribute only depends on the other one. To solve the correlation between features, we can also use the method of data dimensionality reduction (PCA) to remove the feature correlation, and then perform Naive Bayesian calculation.


Original: A. Machine learning entry algorithm (2): Classification prediction based on Naive Bayes (Naive Bayes)_Ting, Artificial Intelligence Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/m0_62237233/article/details/130142251