Hebei University of Technology Data Mining Experiment 4 Bayesian Decision Classification Algorithm

1. Experimental purpose

(1) Familiar with the Naive Bayes decision-making algorithm.
(2) Query the AllElectronics customer database to obtain the prior probability and class conditional probability.
(3) Write a program that uses the Naive Bayes algorithm to classify on the sample set, run the Naive Bayes classification algorithm on the task-related data, and debug the experiment.
(4) Write an experiment report.

2. Experimental principles

1. Priori probability and class conditional probability

Prior probability : The prior probability is defined as the training sample set belonging to C i C_iCiThe number of samples (tuples) of the class and the total number of samples N i N_iNiThe ratio is recorded as P ( C i ) = N i NP(C_i)=\frac{N_i}{N}P(Ci)=NNi.
Class conditional probability : Class conditional probability is defined as the training sample set belonging to C i C_iCiClasses with characteristics XXThe number of samples (tuples) of X ni n_iniand belongs to C i C_iCiNumber of samples (tuples) of the class N i N_iNiThe ratio is recorded as P (P(XCi)=Nini

2. Bayesian decision-making

The Bayesian decision (classification) method assigns samples (tuples) to C i C_iCiclass, if and only if
P ( C_i)>P(X|C_j)P(C_j), for 1\leq j\leq m, j\neq iP(XCi)P(Ci)>P(XCj)P(Cj),pair 1jm,j=i
among which, the samples (tuples) in the training sample set can be divided intommm category.

3. Experimental content and procedures

1. Experimental content

Use Bayesian classifier to classify the known feature vector X:

  1. The training sample set (tuple) labeled by the AllElectronics customer database class is programmed to calculate the prior
    probability P(Ci) and class conditional probability P(X|Ci), and the function and implementation method of the key code are pointed out in the experimental report;

  2. Use Bayesian classification method programming to classify feature vectors X, and point out the functions and implementation methods of key program fragments in the experimental report ;
  3. Use the test sample to estimate the classification error rate;
  4. Draw a block diagram of a program or routine in your lab report.

2. Experimental steps

Since this classification problem is to determine whether customers are inclined to buy computers, that is, C1 corresponds to buys_computer=yes, and C2 corresponds to buys_computer=no, which is a two-category classification problem. The experimental steps are as follows:

  1. Determine the characteristic attributes and divisions: Browse the given database and find out the divided characteristic attributes;

  2. Get training samples: that is, the training sample set (tuple) marked by the given AllElectronics customer database class ;
  3. Calculate the prior probability of each category in the training sample: P(Ci), i=1, 2;
  4. Calculate the class conditional probability in the training sample: Let the feature (attribute) vector be X, program to calculate the class conditional
    probability P(X|Ci), i=1, 2;
  5. Classify using a classifier;

3. Program block diagram

Insert image description here

4. Experimental samples

data.txt
Insert image description here

5. Experimental code

#!/usr/bin/env python  
# -*- coding: utf-8 -*-
#
# Copyright (C) 2021 #
# @Time    : 2022/5/30 21:26
# @Author  : Yang Haoyuan
# @Email   : [email protected]
# @File    : Exp4.py
# @Software: PyCharm

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
import argparse

parser = argparse.ArgumentParser(description='Exp4')
parser.add_argument('--mode', type=str, choices=["KFold", "train", "test"])
parser.add_argument('--k', type=int, default=7)
parser.add_argument('--AGE', type=str, choices=["youth", "middle_aged", "senior"])
parser.add_argument('--INCOME', type=str, choices=["high", "medium", "low"])
parser.add_argument('--STUDENT', type=str, choices=["yes", "no"])
parser.add_argument('--CREDIT', type=str, choices=["excellent", "fair"], default="fair")
parser.set_defaults(augment=True)
args = parser.parse_args()
print(args)


# 载入数据集
def loadDataset(filename):
    dataSet = []
    with open(filename, 'r') as file_to_read:
        while True:
            lines = file_to_read.readline()  # 整行读取数据
            if not lines:
                break
            p_tmp = [str(i) for i in lines.split(sep="\t")]
            p_tmp[len(p_tmp) - 1] = p_tmp[len(p_tmp) - 1].strip("\n")
            dataSet.append(p_tmp)

    return pd.DataFrame(dataSet, columns=["AGE", "INCOME", "STUDENT", "CREDIT", "BUY"])


# 计算总样本数和各类数量
def count_total(data):
    count = {
    
    }
    group_df = data.groupby(["BUY"])

    count["yes"] = group_df.size()["yes"]
    count["no"] = group_df.size()["no"]

    total = count["yes"] + count["no"]
    return count, total


# 计算各类概率
def cal_base_rates(categories, total):
    rates = {
    
    }
    for label in categories:
        priori_prob = categories[label] / total
        rates[label] = priori_prob
    return rates


# 计算各类条件概率
def f_prob(data, count):
    likelihood = {
    
    'yes': {
    
    }, 'no': {
    
    }}

    # 根据AGE(youth, middle_aged, senior)和BUY(yes, no)统计概率
    df_group = data.groupby(['AGE', 'BUY'])
    try:
        c = df_group.size()["youth", "yes"]
    except:
        c = 0
    likelihood['yes']['youth'] = c / count['yes']

    try:
        c = df_group.size()["youth", "no"]
    except:
        c = 0
    likelihood['no']['youth'] = c / count['no']

    try:
        c = df_group.size()["middle_aged", "yes"]
    except:
        c = 0
    likelihood['yes']['middle_aged'] = c / count['yes']

    try:
        c = df_group.size()["middle_aged", "no"]
    except:
        c = 0
    likelihood['no']['middle_aged'] = c / count['no']

    try:
        c = df_group.size()["senior", "yes"]
    except:
        c = 0
    likelihood['yes']['senior'] = c / count['yes']

    try:
        c = df_group.size()["senior", "no"]
    except:
        c = 0
    likelihood['no']['senior'] = c / count['no']

    # 根据INCOME(high, medium, low)和BUY(yes, no)统计概率
    df_group = data.groupby(['INCOME', 'BUY'])
    try:
        c = df_group.size()["high", "yes"]
    except:
        c = 0
    likelihood['yes']['high'] = c / count['yes']

    try:
        c = df_group.size()["high", "no"]
    except:
        c = 0
    likelihood['no']['high'] = c / count['no']

    try:
        c = df_group.size()["medium", "yes"]
    except:
        c = 0
    likelihood['yes']['medium'] = c / count['yes']

    try:
        c = df_group.size()["medium", "no"]
    except:
        c = 0
    likelihood['no']['medium'] = c / count['no']

    try:
        c = df_group.size()["low", "yes"]
    except:
        c = 0
    likelihood['yes']['low'] = c / count['yes']

    try:
        c = df_group.size()["low", "no"]
    except:
        c = 0
    likelihood['no']['low'] = c / count['no']

    # 根据STUDENT(yes, no)和BUY(yes, no)统计概率
    df_group = data.groupby(['STUDENT', 'BUY'])
    try:
        c = df_group.size()["yes", "yes"]
    except:
        c = 0
    likelihood['yes']['yes'] = c / count['yes']

    try:
        c = df_group.size()["yes", "no"]
    except:
        c = 0
    likelihood['no']['yes'] = c / count['no']

    try:
        c = df_group.size()["no", "yes"]
    except:
        c = 0
    likelihood['yes']['no'] = c / count['yes']

    try:
        c = df_group.size()["no", "no"]
    except:
        c = 0
    likelihood['no']['no'] = c / count['no']

    # 根据CREDIT(excellent, fair)和BUY(yes, no)统计概率
    df_group = data.groupby(['CREDIT', 'BUY'])
    try:
        c = df_group.size()["excellent", "yes"]
    except:
        c = 0
    likelihood['yes']['excellent'] = c / count['yes']

    try:
        c = df_group.size()["excellent", "no"]
    except:
        c = 0
    likelihood['no']['excellent'] = c / count['no']

    try:
        c = df_group.size()["fair", "yes"]
    except:
        c = 0
    likelihood['yes']['fair'] = c / count['yes']

    try:
        c = df_group.size()["fair", "no"]
    except:
        c = 0
    likelihood['no']['fair'] = c / count['no']

    return likelihood


# 训练
def train(train_data):
    # 获取各类数量和训练样本总数
    count, total = count_total(train_data)
    # 获取先验概率
    priori_prob = cal_base_rates(count, total)
    # 保存先验概率
    np.save("priori_prob.npy", priori_prob)
    # 获取各特征的条件概率
    feature_prob = f_prob(train_data, count)
    # 保存条件概率
    np.save("feature_prob.npy", feature_prob)
    print("训练完成")


# 分类器
def NaiveBayesClassifier(AGE=None, INCOME=None, STUDENT=None, CREDIT=None):
    res = {
    
    }
    priori_prob = np.load('priori_prob.npy', allow_pickle=True).item()
    feature_prob = np.load('feature_prob.npy', allow_pickle=True).item()
    # 根据特征计算各类的概率
    for label in ['yes', 'no']:
        prob = priori_prob[label]
        prob *= feature_prob[label][AGE] * feature_prob[label][INCOME] * feature_prob[label][STUDENT] \
                * feature_prob[label][CREDIT]
        res[label] = prob
    print("预测概率:", res)
    # 选择概率最高的类作为分类结果
    res = sorted(res.items(), key=lambda kv: kv[1], reverse=True)
    return res[0][0]


# 测试
def test(test_data):
    correct = 0
    for idx, row in test_data.iterrows():
        prob = NaiveBayesClassifier(row["AGE"], row["INCOME"], row["STUDENT"], row["CREDIT"])
        if prob == row["BUY"]:
            correct = correct + 1
    return correct / test_data.shape[0]


# 启用k-折交叉验证
def KFoldEnabled():
    kf = KFold(n_splits=args.k)
    data_set = loadDataset("date.txt")
    corr = 0
    for train_idx, test_idx in kf.split(data_set):
        train_data = data_set.loc[train_idx]
        test_data = data_set.loc[test_idx]
        train(train_data)
        corr = corr + test(test_data)

    print("k折交叉验证正确率: ", corr / 7)


if __name__ == '__main__':
    if args.mode == "KFold":
        KFoldEnabled()
    if args.mode == "train":
        train_data = loadDataset("date.txt")
        train(train_data)
    if args.mode == "test":
        '''priori_prob = np.load('priori_prob.npy', allow_pickle=True).item()
        print("先验概率: ", priori_prob)
        feature_prob = np.load('feature_prob.npy', allow_pickle=True).item()
        print("类条件概率: ", feature_prob)'''
        ret = NaiveBayesClassifier(args.AGE, args.INCOME, args.STUDENT, args.CREDIT)
        print("预测结果: ", ret)

4. Experimental results

Insert image description here
Use k-fold cross-validation for training and testing, k=7
Insert image description here
training model
Insert image description here
model prediction

5. Experimental analysis

This experiment mainly implements the Naive Bayes classification algorithm.

The Bayesian method is based on Bayes' principle and uses the knowledge of probability and statistics to classify sample data sets. Because of its solid mathematical foundation, the false positive rate of the Bayesian classification algorithm is very low. The characteristic of the Bayesian method is to combine the prior probability and the posterior probability, which avoids the subjective bias of using only the prior probability and the over-fitting phenomenon of using sample information alone. The Bayesian classification algorithm shows high accuracy when the data set is large, and the algorithm itself is relatively simple.
The Naive Bayesian algorithm is simplified based on the Bayesian algorithm, that is, it is assumed that the attributes are conditionally independent of each other when the target value is given. That is to say, no attribute variable has a larger proportion to the decision result, and no attribute variable has a smaller proportion to the decision result. Although this simplification reduces the classification effect of the Bayesian classification algorithm to a certain extent, in actual application scenarios, it greatly simplifies the complexity of the Bayesian method.

This enhances the robustness of Naive Bayes classification, but in fact, its assumption of independence between attributes also limits its classification performance.

There is no distinction between the training set and the validation set in this experimental data. In order to verify the performance of the Naive Bayes classifier on this data set, I used the K-fold cross-validation algorithm. The final classification accuracy was only 50%. I think this can be attributed to the small number of data sets, resulting in the naive Bayes classifier being insufficiently trained on this task.

I save the training data as a .npy file so that I can directly read the parameters and perform classification when applying classification.

Guess you like

Origin blog.csdn.net/d33332/article/details/127462258