机器学习-朴素贝叶斯文本分类Python实现

一朴素贝叶斯必须了解的概率

条件概率
联合概率（乘法公式）
全概率公式
朴素贝叶斯公式
以上详解请看概率基础

二朴素贝叶斯介绍

前面提到的K最近邻算法和决策树算法，数据实例最终被明确的划分到某个分类中，下面介绍
朴素贝叶斯是一种运用概率给对象进行分类，而不是完全确定实例应该分到哪个类；K近邻算法和决策树，对象被明确划分到了某个类。

朴素贝叶斯模型被广泛应用于海量互联网文本分类任务。由于其较强的特征条件独立假设，使得模型预测所需要估计的参数规模从幂指数量级向线性量级减少，极大的节约了内存消耗和计算时间。到那时，也正是受这种强假设的限制，模型训练时无法将各个特征之间的联系考量在内，使得该模型在其他数据特征关联性较强的分类任务上的性能表现不佳

优点：在数据较少的情况下仍然有效，可以处理多类别问题

缺点：要求数据相互独立，往往数据并不是完全独立的
适用数据类型：标称型数据。

三朴素贝叶斯案例介绍

3.1 案例背景介绍

公司发展部给到了20万多家公司信息，包括公司名称和公司经营范围介绍；其中大约有2/3 的公司已经标注了属于哪个行业，其他公司没有标注行业信息，需要对没有行业分类的公司进行行业分类。

3.2 朴素贝叶斯过程

朴素贝叶斯的一般过程
收集数据：可以使用任何方式
准备数据：中文分词—构建词向量—TF-IDF
分类数据：分类器分类
优化：训练集数据校验测试集数据，评价模型，改进模型

3.3 数据及代码目录结构

暂时还没有时间，数据和代码目录还没有上传

3.4 代码

# 模型调优max_df和alpha
# -*- coding: utf-8 -*-
# @Time    : 2019/7/22 上午8:05
# @Author  : Einstein Yang！！
# @Nickname : 穿着开裆裤上大学
# @FileName: bayes_formal.py
# @Software: PyCharm
# @PythonVersion: python3.5
# @Blog    ：https://blog.csdn.net/weixin_41734687


import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
from sklearn.datasets import base
from utilstool import *
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB # 导入多项式贝叶斯算法
from sklearn import metrics


class BayesTextType(object):
    def __init__(self):
        self.bunch = base.Bunch(target_name=[], label=[], id=[], contents=[], tdm=[], vocabulary={})
        self.pd_re = pd.DataFrame(columns=['id', 'manage_scope', 'company_name', 'industry'])
        self.mysql = UtilsMysql(host="localhost", port=3306, user="crawl", passwd="123456tt", db="yhouse_dev",
                                    charset="utf8mb4")

    @staticmethod
    def _readfile(path):
        with open(path, "r", encoding="utf-8") as fp:
            content = fp.read()
        return content

    def vector_space(self):
        stopword_path = "./hlt_stop_words.txt"
        # 停用词
        stpwrdlst = BayesTextType._readfile(stopword_path).splitlines()
        # max_df 单词在文档中最高出现率，如max_df=0.5,表示一个单词在50%的文档中都出现了，那么它只携带了非常少的信息，因
        # 此不作为分词统计，一般很少设置 min_df，因为 min_df 通常都会很小
        # sublinear_tf：boolean， optional，应用线性缩放TF，例如，使用1 + log(tf)覆盖tf
        vectorizer = TfidfVectorizer(stop_words=stpwrdlst, sublinear_tf=True, max_df=0.6)
        # 计算每个单词在每个文档中的TF-IDF值，向量里的顺序是按照词语的id顺序来的
        self.bunch.tdm = vectorizer.fit_transform(self.bunch.contents)
        # 输出每个单词对应的id值
        self.bunch.vocabulary = vectorizer.vocabulary_
        # print(self.bunch.vocabulary)

    def metrics_result(self):
        x_train, x_test, y_train, y_test = train_test_split(self.bunch.tdm, self.bunch.label, test_size=0.25)
        # 训练分类器：输入词袋向量和分类标签，alpha:0.001 alpha越小，迭代次数越多，精度越高;如果词列表总数多alpha越小，
        # 精度越高，但是如果词频列表总数少，alphada一点精度更高（词太少，根本无法在把词划分）
        clf = MultinomialNB(alpha=1).fit(x_train, y_train)
        # 预测分类结果
        predicted = clf.predict(x_test)
        for flabel, expct_cate, content in zip(y_test, predicted, x_test):
            if flabel != expct_cate:
                print( ": 实际类别:", flabel, " -->预测类别:", expct_cate)
        print("预测完毕!!!")

        # 计算分类精度：
        print('精度:{0:.3f}'.format(metrics.precision_score(y_test, predicted, average='weighted')))
        print('召回:{0:0.3f}'.format(metrics.recall_score(y_test, predicted, average='weighted')))
        print('f1-score:{0:.3f}'.format(metrics.f1_score(y_test, predicted, average='weighted')))

    def bayes_text_type(self):
        # 目标分类是已经确定的
        target_name = ["房地产业", "旅行社", "餐饮业", "美容美发服务"]
        self.bunch.target_name = target_name
        select_sql = 'select id, manage_scope, company_name, industry from china_product_info where industry in ("房地产业", "旅行社", "餐饮业", "美容美发服务") ;'
        se_re = list(self.mysql.select_sql(select_sql))
        self.pd_re = pd.DataFrame(se_re, columns=['id', 'manage_scope', 'company_name', 'industry'])
        # 先分词，停用词在tfidf处处理
        self.pd_re["contents"] = (self.pd_re["manage_scope"] +","+ self.pd_re["company_name"]).apply(lambda x: " ".join(jieba.cut(x)))
        self.bunch.label = self.pd_re['industry']
        self.bunch.contents = self.pd_re["contents"]
        # print(self.pd_re.head(10))
        # print(self.bunch.contents)
        self.vector_space()
        self.metrics_result()

    
if __name__ == '__main__':
    bayes_text_type = BayesTextType()
    bayes_text_type.bayes_text_type()