朴素贝叶斯+拉普拉斯平滑代码实现

计算步骤:

P(好瓜) = P(好瓜)P(色泽|好瓜)P(根蒂|好瓜)P(敲声|好瓜)P(纹理|好瓜)P(脐部|好瓜)P(触感|好瓜)
P(坏瓜) = P(坏瓜)P(色泽|坏瓜)P(根蒂|坏瓜)P(敲声|坏瓜)P(纹理|坏瓜)P(脐部|坏瓜)P(触感|坏瓜)
例:P(色泽|好瓜) = P(好瓜|色泽)*P(色泽)/P(好瓜)

数据的读取

import pandas as pd

# melon2 = pd.read_csv('E:\\work\ml\\Python_Project_01\\sklearn_week\\week_10\\melon2.0.csv', index_col='编号')

melon2 = pd.DataFrame([["青绿", "蜷缩", "浊响", "清晰", "凹陷", "硬滑", "是"],
                         ["乌黑", "蜷缩", "沉闷", "清晰", "凹陷", "硬滑", "是"],
                         ["乌黑", "蜷缩", "浊响", "清晰", "凹陷", "硬滑", "是"],
                         ["青绿", "蜷缩", "沉闷", "清晰", "凹陷", "硬滑", "是"],
                         ["浅白", "蜷缩", "浊响", "清晰", "凹陷", "硬滑", "是"],
                         ["青绿", "稍蜷", "浊响", "清晰", "稍凹", "软粘", "是"],
                         ["乌黑", "稍蜷", "浊响", "稍糊", "稍凹", "软粘", "是"],
                         ["乌黑", "稍蜷", "浊响", "清晰", "稍凹", "硬滑", "是"],
                         ["乌黑", "稍蜷", "沉闷", "稍糊", "稍凹", "硬滑", "否"],
                         ["青绿", "硬挺", "清脆", "清晰", "平坦", "软粘", "否"],
                         ["浅白", "硬挺", "清脆", "模糊", "平坦", "硬滑", "否"],
                         ["浅白", "蜷缩", "浊响", "模糊", "平坦", "软粘", "否"],
                         ["青绿", "稍蜷", "浊响", "稍糊", "凹陷", "硬滑", "否"],
                         ["浅白", "稍蜷", "沉闷", "稍糊", "凹陷", "硬滑", "否"],
                         ["乌黑", "稍蜷", "浊响", "清晰", "稍凹", "软粘", "否"],
                         ["浅白", "蜷缩", "浊响", "模糊", "平坦", "硬滑", "否"],
                         ["青绿", "蜷缩", "沉闷", "稍糊", "稍凹", "硬滑", "否"]],
                        columns=["色泽", "根蒂", "敲声", "纹理", "脐部", "触感", "好瓜"])

取好坏瓜:

m2_bad = melon2[melon2['好瓜'] == '否']
m2_good = melon2[melon2['好瓜'] == '是']

求先验:

# # 好不好的先验
p_good_priori = (len(m2_good) + 1) / (len(melon2) + 2)
p_bad_priori = (len(m2_bad) + 1) / (len(melon2) + 2)

特征提取:

# # 各个特征的好、不好的拉普拉斯平滑:使用列表作为整体,每个特征实现一个字典
# 计数每个特征的值类别数
feature_num = melon2.shape[-1] - 1  # 全局性隐含特征序一致
features_name = []  # 特征的值的集合,这里一致,然后防止好瓜、坏瓜中没有相关的特征值
features_counts = []  # 特征的个数,可以拉普拉斯平滑的分母修正项
for ii in range(feature_num):
    features_name.append(set(melon2.iloc[:, ii]))
    features_counts.append(len(set(melon2.iloc[:, ii]))
  
features_name
	[{
    
    '乌黑', '浅白', '青绿'},
	 {
    
    '硬挺', '稍蜷', '蜷缩'},
 	 {
    
    '沉闷', '浊响', '清脆'},
 	 {
    
    '模糊', '清晰', '稍糊'},
 	 {
    
    '凹陷', '平坦', '稍凹'},
	 {
    
    '硬滑', '软粘'}]
	 
features_counts:[3, 3, 3, 3, 3, 2]

求P(*|好瓜):

# 好瓜部分
ps_feature_good = []
# 先对特征计数
for ii in range(feature_num):
    ps_feature_good.append(dict(m2_good.iloc[:, ii].value_counts()))  # Series本质上就是字典
# 然后用拉普拉斯计算条件概率
for ii in range(feature_num):
    for ff in features_name[ii]:  # 下一行的get防止出空
        ps_feature_good[ii][ff] = (ps_feature_good[ii].get(ff, 0) + 1) / (len(m2_good) + features_counts[ii])

求P(*|坏瓜):

# 坏瓜部分
ps_feature_bad = []
# 先对特征计数
for ii in range(feature_num):
    ps_feature_bad.append(dict(m2_bad.iloc[:, ii].value_counts()))
# 然后用拉普拉斯计算条件概率
for ii in range(feature_num):
    for ff in features_name[ii]:
        ps_feature_bad[ii][ff] = (ps_feature_bad[ii].get(ff, 0) + 1) / (len(m2_bad) + features_counts[ii])

预测好坏瓜的函数:

# # 预测的函数 好坏分开,连乘比大小
def predict(features):
    p_good = p_good_priori
    for ii in range(feature_num):
        p_good *= ps_feature_good[ii][features[ii]]

    p_bad = p_bad_priori
    for ii in range(feature_num):
        p_bad *= ps_feature_bad[ii][features[ii]]

    return '是' if p_good > p_bad else '否'

验证结果:

# 验证结果
for idx in melon2.index:
    print(predict(melon2.loc[idx]), melon2.loc[idx][-1],
          predict(melon2.loc[idx]) == melon2.loc[idx][-1])

输出:

是 是 True
是 是 True
是 是 True
是 是 True
是 是 True
是 是 True
否 是 False
是 是 True
否 否 True
否 否 True
否 否 True
否 否 True
是 否 False
否 否 True
是 否 False
否 否 True
否 否 True

这是老师给的代码,下一篇文章介绍本人自己的写的代码,欢迎阅读

Guess you like

Origin blog.csdn.net/weixin_51756104/article/details/121239101