Machine learning in practice----Naive Bayes

1. Introduction

win10, notebook ,python 3.6

 

The Naive Bayes algorithm is a supervised learning algorithm that solves classification problems, such as whether customers are lost, whether they are worthy of investment, credit rating assessment and other multi-classification problems.

Advantages: Simple and easy to understand, high learning efficiency, comparable to decision trees and neural networks in classification problems in certain fields.

However, since this algorithm is based on the assumption of independence (conditional feature independence) between independent variables and the normality of continuous variables, the accuracy of the algorithm will be affected to some extent.

Some advantages of Naive Bayesian inference:

Generative models, which perform classification by calculating probabilities, can be used to handle multi-classification problems.
It performs well on small-scale data, is suitable for multi-classification tasks, is suitable for incremental training, and the algorithm is relatively simple.
Some disadvantages of Naive Bayesian inference:

Sensitive to the expression form of input data.
Due to the "naive" characteristics of Naive Bayes, some accuracy losses will occur.
Prior probabilities need to be calculated, and there is an error rate in classification decisions.

Reference: https://blog.csdn.net/c406495762/article/details/77341116

 

2. Principle

Bayesian decision theory----Conditional probability----Bayesian inference----Example 1----Example 2 (Naive Bayes)

 

1. Bayesian decision theory

Naive Bayes is part of Bayesian decision theory , so it is necessary to quickly understand Bayesian decision theory before talking about Naive Bayes.

Two types of data:

p1(x,y) represents the probability that the data point (x,y) belongs to category 1 (the category represented by the red dot in the figure),

p2(x,y) represents the probability that data point (x,y) belongs to category 2 (the category represented by the blue triangle in the figure)

  • If p1(x,y) > p2(x,y), then the category is 1
  • If p1(x,y) < p2(x,y), then the category is 2

That is, the core idea of ​​Bayesian decision theory: select categories corresponding to high probability

 

2. Conditional probability and total probability

Conditional probability: When event B occurs, the probability of event A occurring is represented by P(A|B)

The derivation is as shown in the figure:

 

3. Bayesian inference

  • P(A) is called "prior probability", which is our judgment on the probability of event A before event B occurs.
  • P(A|B) is called "posterior probability", which is our re-evaluation of the probability of event A after event B occurs.
  • P(B|A)/P(B) is called the "Likelyhood", which is an adjustment factor that makes the estimated probability closer to the true probability.
     

It can be understood as:

Posterior probability = prior probability x adjustment factor

That is Bayesian inference : we first estimate a "prior probability", and then add the experimental results to see whether the experiment enhances or weakens the "prior probability", thereby obtaining a "posterior probability" that is closer to the fact.

  • "Possibility function" P(B|A)/P(B)>1: means that the "prior probability" is enhanced and the possibility of event A becoming greater ;
  • "Possibility function" = 1: means that event B does not help determine the possibility of event A;
  • "Possibility function" <1: It means that the "prior probability" is weakened and the possibility of event A becomes smaller .

 

4. Example 1

Priori probability:

Since the two bowls are the same, P(H1)=P(H2), that is, before taking out the sugar, the two bowls have the same probability of being selected.

Therefore, P(H1)=0.5, we call this probability "prior probability", that is, before doing the experiment, the probability of coming from bowl No. 1 is 0.5.

 

Posterior probability:

E represents sugar, so the question becomes, given E, what is the probability that it comes from bowl No. 1, that is, find P(H1|E).

We call this probability "posterior probability", which is the correction to P(H1) after the event E occurs.

 

 

5. Example 2

The concepts of Bayes and Naive Bayes are different. The difference lies in the word "naive". Naive Bayes makes the assumption of conditional independence for conditional probability distributions . 

The basic method of Bayesian classifier: based on statistical data and certain features, calculate the probability of each category to achieve classification.

 

3. Code to implement filtering of insulting remarks

1. Overall process:

Text segmentation----》Vectorization----》Calculating word frequency----》Calculating posterior probability prediction

As shown below:

2. Generate a text dictionary and vectorize the text:

import numpy as np

# 切分的词条
postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],               
            ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
            ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
            ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
            ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
            ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
# 标签
classVec = [0,1,0,1,0,1] 

# 转换为词条向量
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0] * len(vocabList)                                    #创建一个其中所含元素都为0的向量
    for word in inputSet:                                                #遍历每个词条
        if word in vocabList:                                            #如果词条存在于词汇表中,则置1
            returnVec[vocabList.index(word)] = 1
        else: print("the word: %s is not in my Vocabulary!" % word)
    return returnVec 

# 单词词典,是一个集合
def createVocabList(dataSet):
    vocabSet = set([])                      #创建一个空的不重复列表
    for document in dataSet:               
        vocabSet = vocabSet | set(document) #取并集
    return list(vocabSet)

myVocabList = createVocabList(postingList)
print('myVocabList:\n',myVocabList)

trainMat = []
for postinDoc in postingList:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
print('trainMat:\n', trainMat)

 The result is as follows:

myVocabList:
 ['take', 'him', 'stupid', 'problems', 'help', 'mr', 'please', 'not', 'dalmation', 'my', 'has', 'buying', 'worthless', 'licks', 'to', 'is', 'love', 'quit', 'food', 'park', 'so', 'ate', 'dog', 'steak', 'I', 'cute', 'garbage', 'posting', 'how', 'stop', 'flea', 'maybe']
trainMat:
 [[0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0], [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

3. Training, that is, finding the following among the training samples: 

  • p0V: the frequency of occurrence of each word in the non-insult category
  • p1V: How often each word appears in the non-insult category
  • pAb: Proportion of insult samples
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)                            #计算训练的文档数目
    numWords = len(trainMatrix[0])                            #计算每篇文档的词条数
    pAbusive = sum(trainCategory)/float(numTrainDocs)        #文档属于侮辱类的概率
    
    '''
    未进行拉普拉斯平滑
    p0Num = np.zeros(numWords); 
    p1Num = np.zeros(numWords)   
    p0Denom = 0.0; 
    p1Denom = 0.0  
    '''
    # 创建numpy.ones数组,词条出现数初始化为2,拉普拉斯平滑
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords) 
    # 分母初始化为2,拉普拉斯平滑
    p0Denom = 2.0
    p1Denom = 2.0                            
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:                            #统计属于侮辱类的条件概率所需的数据,即P(w0|1),P(w1|1),P(w2|1)···
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:                                                #统计属于非侮辱类的条件概率所需的数据,即P(w0|0),P(w1|0),P(w2|0)···
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = np.log(p1Num/p1Denom)                            #取对数,防止下溢出         
    p0Vect = np.log(p0Num/p0Denom)         
    return p0Vect,p1Vect,pAbusive                            #返回属于侮辱类的条件概率数组,属于非侮辱类的条件概率数组,文档属于侮辱类的概率

p0V, p1V, pAb = trainNB0(trainMat, classVec)

4. Prediction:

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + np.log(pClass1)        #对应元素相乘。logA * B = logA + logB,所以这里加上log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0


def testingNB():
    testEntry = ['love', 'my', 'dalmation']
    # 文本向量化
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    if classifyNB(thisDoc,p0V,p1V,pAb):
        print(testEntry,'属于侮辱类')
    else:
        print(testEntry,'属于非侮辱类')
    
    
    testEntry = ['stupid', 'garbage']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    if classifyNB(thisDoc,p0V,p1V,pAb):
        print(testEntry,'属于侮辱类')
    else:
        print(testEntry,'属于非侮辱类')

testingNB()

result:

['love', 'my', 'dalmation'] 属于非侮辱类
['stupid', 'garbage'] 属于侮辱类

 

4. Code to implement spam filtering

1. Process:

Text----Vectorization----Training----Prediction

Sample: 25 spam emails, 25 non-spam emails,

import re 
import numpy as np
import random

# 分词
def textParse(bigString):                                                   #将字符串转换为字符列表
    listOfTokens = re.split(r'\W*', bigString)                              #将特殊符号作为切分标志进行字符串切分,即非字母、非数字
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]            #除了单个字母,例如大写的I,其它单词变成小写

# 创建词典
def createVocabList(dataSet):
    vocabSet = set([])                      #创建一个空的不重复列表
    for document in dataSet:               
        vocabSet = vocabSet | set(document) #取并集
    return list(vocabSet)

# 向量化
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0] * len(vocabList)                                    #创建一个其中所含元素都为0的向量
    for word in inputSet:                                                #遍历每个词条
        if word in vocabList:                                            #如果词条存在于词汇表中,则置1
            returnVec[vocabList.index(word)] = 1
        else: print("the word: %s is not in my Vocabulary!" % word)
    return returnVec                                                    #返回文档向量

# 样本字典
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)                                        #创建一个其中所含元素都为0的向量
    for word in inputSet:                                                #遍历每个词条
        if word in vocabList:                                            #如果词条存在于词汇表中,则计数加一
            returnVec[vocabList.index(word)] += 1
    return returnVec                                                    #返回词袋模型

train:

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)                            #计算训练的文档数目
    numWords = len(trainMatrix[0])                            #计算每篇文档的词条数
    pAbusive = sum(trainCategory)/float(numTrainDocs)        #文档属于侮辱类的概率
    p0Num = np.ones(numWords); p1Num = np.ones(numWords)    #创建numpy.ones数组,词条出现数初始化为1,拉普拉斯平滑
    p0Denom = 2.0; p1Denom = 2.0                            #分母初始化为2,拉普拉斯平滑
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:                            #统计属于侮辱类的条件概率所需的数据,即P(w0|1),P(w1|1),P(w2|1)···
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:                                                #统计属于非侮辱类的条件概率所需的数据,即P(w0|0),P(w1|0),P(w2|0)···
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = np.log(p1Num/p1Denom)                            #取对数,防止下溢出         
    p0Vect = np.log(p0Num/p0Denom)         
    return p0Vect,p1Vect,pAbusive                            #返回属于侮辱类的条件概率数组,属于非侮辱类的条件概率数组,文档属于侮辱类的概率

predict 

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + np.log(pClass1)        #对应元素相乘。logA * B = logA + logB,所以这里加上log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0
def spamTest():
    docList = []; classList = []; fullText = []
    for i in range(1, 26):                                                  #遍历25个txt文件
        # 垃圾邮件
        wordList = textParse(open('email/spam/%d.txt' % i, 'r').read())     #读取每个垃圾邮件,并字符串转换成字符串列表
        docList.append(wordList)
        fullText.append(wordList)
        classList.append(1)                                                 #标记垃圾邮件,1表示垃圾文件
        
        # 非垃圾邮件
        wordList = textParse(open('email/ham/%d.txt' % i, 'r').read())      #读取每个非垃圾邮件,并字符串转换成字符串列表
        docList.append(wordList)
        fullText.append(wordList)
        classList.append(0)                                                 #标记非垃圾邮件,1表示垃圾文件   
    # 词汇表    
    vocabList = createVocabList(docList)                                    #创建词汇表,不重复
    # print(vocabList)
    
    # 数据集切分
    trainingSet = list(range(50)); testSet = []                             #创建存储训练集的索引值的列表和测试集的索引值的列表                       
    for i in range(10):                                                     #从50个邮件中,随机挑选出40个作为训练集,10个做测试集
        randIndex = int(random.uniform(0, len(trainingSet)))                #随机选取索索引值
        testSet.append(trainingSet[randIndex])                              #添加测试集的索引值
        del(trainingSet[randIndex])                                         #在训练集列表中删除添加到测试集的索引值
    trainMat = []; trainClasses = []                                        #创建训练集矩阵和训练集类别标签系向量             
    for docIndex in trainingSet:                                            #遍历训练集
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))       #将生成的词集模型添加到训练矩阵中
        trainClasses.append(classList[docIndex])                            #将类别添加到训练集类别标签系向量中
    p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))  #训练朴素贝叶斯模型
    errorCount = 0                                                          #错误分类计数
    for docIndex in testSet:                                                #遍历测试集
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])           #测试集的词集模型
        if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:    #如果分类错误
            errorCount += 1                                                 #错误计数加1
            print("分类错误的测试集:",docList[docIndex])
    print('错误率:%.2f%%' % (float(errorCount) / len(testSet) * 100))


spamTest()

result:

分类错误的测试集: ['scifinance', 'now', 'automatically', 'generates', 'gpu', 'enabled', 'pricing', 'risk', 'model', 'source', 'code', 'that', 'runs', '300x', 'faster', 'than', 'serial', 'code', 'using', 'new', 'nvidia', 'fermi', 'class', 'tesla', 'series', 'gpu', 'scifinance', 'derivatives', 'pricing', 'and', 'risk', 'model', 'development', 'tool', 'that', 'automatically', 'generates', 'and', 'gpu', 'enabled', 'source', 'code', 'from', 'concise', 'high', 'level', 'model', 'specifications', 'parallel', 'computing', 'cuda', 'programming', 'expertise', 'required', 'scifinance', 'automatic', 'gpu', 'enabled', 'monte', 'carlo', 'pricing', 'model', 'source', 'code', 'generation', 'capabilities', 'have', 'been', 'significantly', 'extended', 'the', 'latest', 'release', 'this', 'includes']
错误率:10.00%

It can be seen: most of the energy is spent on data processing.

The training algorithm uses the trainNB0() function established previously.

 

5. Sina News Classification Based on SKLearn Naive Bayes

1. Introduction to scikit-learn Naive Bayes

In scikit-learn, there are three Naive Bayes classification algorithm classes. They are GaussianNB, MultinomialNB and BernoulliNB.

  • GaussianNB: Naive Bayes with Gaussian distribution as the prior
  • MultinomialNB: Naive Bayes with a multinomial prior
  • BernoulliNB: Naive Bayes with prior Bernoulli distribution

For news classification, it is a multi-classification problem. We can use MultinamialNB() to complete our news classification problem.

MultinomialNB assumes that the prior probability of the feature is a polynomial distribution, that is, as follows:

Among them, P(Xj = Xjl | Y = Ck) is the l-th value conditional probability of the j-th dimension feature of the k-th category. mk is the number of samples in the training set that output the kth class. λ is a constant greater than 0. Try taking a value of 1, which is Laplace smoothing, and it can also take other values.

 

2. MultinamialNB parameters:

The function MultinamialNB has only 3 parameters:

  • alpha: floating point optional parameter, the default is 1.0, which actually adds Laplacian smoothing, which is λ in the above formula. If this parameter is set to 0, no smoothing is added;
  • fit_prior: Boolean optional parameter, defaults to True. The Boolean parameter fit_prior indicates whether to consider the prior probability. If it is false, all sample category outputs have the same category prior probability. Otherwise, you can use the third parameter class_prior to enter the prior probability yourself, or do not enter the third parameter class_prior and let MultinomialNB calculate the prior probability from the training set samples. The prior probability at this time is P(Y=Ck)=mk /m. Where m is the total number of training set samples, and mk is the number of training set samples whose output is the kth category.
  • class_prior: Optional parameter, default is None.
     

 

3. Code

Process: text segmentation ---- feature selection (removing stop words, etc.) ---- removing high-frequency words ---- training classification ---- prediction

 

Data processing, split into training set and test set

from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
import os
import random
import jieba

# 数据预处理,把原始数据(txt文件)整理成测试集,训练集,词典
def TextProcessing(folder_path, test_size = 0.2):
    folder_list = os.listdir(folder_path)                        #查看folder_path下的文件
    data_list = []                                                #数据集数据
    class_list = []                                                #数据集类别

    #遍历每个子文件夹
    for folder in folder_list:
        new_folder_path = os.path.join(folder_path, folder)        #根据子文件夹,生成新的路径
        files = os.listdir(new_folder_path)                        #存放子文件夹下的txt文件的列表

        j = 1
        #遍历每个txt文件
        for file in files:
            if j > 100:                                            #每类txt样本数最多100个
                break
            with open(os.path.join(new_folder_path, file), 'r', encoding = 'utf-8') as f:    #打开txt文件
                raw = f.read()

            word_cut = jieba.cut(raw, cut_all = False)            #精简模式,返回一个可迭代的generator
            word_list = list(word_cut)                            #generator转换为list

            data_list.append(word_list)                            #添加数据集数据
            class_list.append(folder)                            #添加数据集类别
            j += 1

    data_class_list = list(zip(data_list, class_list))            #zip压缩合并,将数据与标签对应压缩
    random.shuffle(data_class_list)                                #将data_class_list乱序
    index = int(len(data_class_list) * test_size) + 1            #训练集和测试集切分的索引值
    train_list = data_class_list[index:]                        #训练集
    test_list = data_class_list[:index]                            #测试集
    train_data_list, train_class_list = zip(*train_list)        #训练集解压缩
    test_data_list, test_class_list = zip(*test_list)            #测试集解压缩

    all_words_dict = {}                                            #统计训练集词频
    for word_list in train_data_list:
        for word in word_list:
            if word in all_words_dict.keys():
                all_words_dict[word] += 1
            else:
                all_words_dict[word] = 1

    #根据键的值倒序排序
    all_words_tuple_list = sorted(all_words_dict.items(), key = lambda f:f[1], reverse = True)
    all_words_list, all_words_nums = zip(*all_words_tuple_list)    #解压缩
    all_words_list = list(all_words_list)                        #转换成列表
    return all_words_list, train_data_list, test_data_list, train_class_list, test_class_list

Stop words, remove high-frequency words

# 停用词词典
def MakeWordsSet(words_file):
    words_set = set()                                            #创建set集合
    with open(words_file, 'r', encoding = 'utf-8') as f:        #打开文件
        for line in f.readlines():                                #一行一行读取
            word = line.strip()                                    #去回车
            if len(word) > 0:                                    #有文本,则添加到words_set中
                words_set.add(word)                               
    return words_set                                             #返回处理结果

# 去除高频词后的特征词词典
def words_dict(all_words_list, deleteN, stopwords_set = set()):
    feature_words = []                            
    n = 1
    for t in range(deleteN, len(all_words_list), 1):
        if n > 1000:                            #feature_words的维度为1000
            break                               
        #如果这个词不是数字,不是停用词,并且单词长度大于1小于5,那么这个词就可以作为特征词
        if not all_words_list[t].isdigit() and all_words_list[t] not in stopwords_set and 1 < len(all_words_list[t]) < 5:
            feature_words.append(all_words_list[t])
        n += 1
    return feature_words

vectorization

# 将训练集和测试集向量化
def TextFeatures(train_data_list, test_data_list, feature_words):
    def text_features(text, feature_words):                        #出现在特征集中,则置1                                               
        text_words = set(text)
        features = [1 if word in text_words else 0 for word in feature_words]
        return features
    train_feature_list = [text_features(text, feature_words) for text in train_data_list]
    test_feature_list = [text_features(text, feature_words) for text in test_data_list]
    return train_feature_list, test_feature_list                #返回结果

predict

# 使用SKLearn 中的贝叶斯函数进行预测
def TextClassifier(train_feature_list, test_feature_list, train_class_list, test_class_list):
    classifier = MultinomialNB().fit(train_feature_list, train_class_list)
    test_accuracy = classifier.score(test_feature_list, test_class_list)
    return test_accuracy
if __name__ == '__main__':
    folder_path = './SogouC/Sample' 
    # 数据集划分
    all_words_list, train_data_list, test_data_list, train_class_list, test_class_list = TextProcessing(folder_path, test_size=0.2)

    # 停用词词典
    stopwords_file = './stopwords_cn.txt'
    stopwords_set = MakeWordsSet(stopwords_file)


    test_accuracy_list = []
    deleteNs = range(0, 1000, 20)                #0 20 40 60 ... 980
    for deleteN in deleteNs:
        # 去除高频词后的特征词词典
        feature_words = words_dict(all_words_list, deleteN, stopwords_set)
        # 将数据集向量化
        train_feature_list, test_feature_list = TextFeatures(train_data_list, test_data_list, feature_words)
        #预测
        test_accuracy = TextClassifier(train_feature_list, test_feature_list, train_class_list, test_class_list)
        test_accuracy_list.append(test_accuracy)

    plt.figure()
    plt.plot(deleteNs, test_accuracy_list)
    plt.title('Relationship of deleteNs and test_accuracy')
    plt.xlabel('deleteNs')
    plt.ylabel('test_accuracy')
    plt.show()

result:

Most of the energy is also data processing,

Calling functions in SKLearn is very simple 

Guess you like

Origin blog.csdn.net/bailixuance/article/details/85060762