Machine Learning Practice Chapter 4 Classification Method Based on Probability Theory: Naive Bayes

Chapter 4 Classification method based on probability theory: Naive Bayes

Overview of Naive Bayes

贝叶斯分类是一类分类算法的总称,这类算法均以贝叶斯定理为基础,故统称为贝叶斯分类。本章首先介绍贝叶斯分类算法的基础——贝叶斯定理。最后,我们通过实例来讨论贝叶斯分类的中最简单的一种: 朴素贝叶斯分类。

Bayesian Theory & Conditional Probability

Bayesian theory

We now have a data set, which consists of two types of data. The data distribution is as shown in the figure below:

Naive Bayes example data distribution

We now use p1(x,y) to represent the probability that data point (x,y) belongs to category 1 (the category represented by a dot in the figure), and use p2(x,y) to represent the probability that data point (x,y) belongs to category 2 (the category represented by the triangle in the figure), then for a new data point (x, y), the following rules can be used to determine its category:

  • If p1(x,y) > p2(x,y) , then the category is 1
  • If p2(x,y) > p1(x,y) , then the category is 2

In other words, we will choose the category corresponding to the high probability. This is the core idea of ​​Bayesian decision theory, which is to choose the decision with the highest probability.

Conditional Probability

If you are familiar with the p(x,y|c1) notation, you can skip this section.

There is a jar containing 7 stones, 3 of which are white and 4 of which are black. If a stone is picked at random from a jar, what is the probability that it is white? Since there are 7 possibilities for taking the stone, 3 of which are white, the probability of taking the white stone is 3/7. So what is the probability of getting a black stone? Obviously, it's 4/7. We use P(white) to represent the probability of getting white stones. The probability value can be obtained by dividing the number of white stones by the total number of stones.

Set containing 7 stones

If these 7 stones are placed in two buckets as shown in the picture below, how should the above probability be calculated?

7 stones into two buckets

Calculating P(white) or P(black), if we know the information about the bucket where the stone is in advance, the result will be changed. This is called conditional probability. Assume that the probability of getting a white stone from bucket B is calculated. This probability can be recorded as P(white|bucketB). We call it "the probability of taking out a white stone under the condition that the stone comes from bucket B". It is easy to get that the value of P(white|bucketA) is 2/4 and the value of P(white|bucketB) is 1/3.

The formula for calculating conditional probability is as follows:

P(white|bucketB) = P(white and bucketB) / P(bucketB)

First, we divide the number of white stones in bucket B by the total number of stones in the two buckets, and get P(white and bucketB) = 1/7. Secondly, since there are 3 stones in bucket B, and the total number of stones is 7, so P(bucketB) is equal to 3/7. So P(white|bucketB) = P(white and bucketB) / P(bucketB) = (1/7) / (3/7) = 1/3.

Another efficient way to calculate conditional probabilities is called Bayes' criterion. Bayes' criterion tells us how to exchange conditions and results in conditional probability, that is, if P(x|c) is known and P(c|x) is required, then the following calculation method can be used:

How to calculate p(c|x)

Use conditional probabilities to classify

Above we mentioned that Bayesian decision theory requires the calculation of two probabilities p1(x, y) and p2(x, y):

  • If p1(x, y) > p2(x, y), then it belongs to category 1;
  • If p2(x, y) > p1(X, y), then it belongs to category 2.

This is not all that Bayesian decision theory is about. The use of p1() and p2() is just to simplify the description as much as possible, but what really needs to be calculated and compared is p(c1|x, y) and p(c2|x, y). The specific meanings of these symbols are: Given a data point represented by x, y, what is the probability that the data point comes from category c1? What is the probability that the data point is from category c2? Note that these probabilities are not the same as the probabilities p(x, y|c1), but you can use Bayes' criterion to exchange conditions and outcomes in probabilities. Specifically, applying Bayes' criterion we get:

Apply Bayesian Criterion

Using the above definitions, the Bayesian classification criterion can be defined as:

  • If P(c1|x, y) > P(c2|x, y), then it belongs to category c1;
  • If P(c2|x, y) > P(c1|x, y), then it belongs to category c2.

In document classification, the entire document (such as an email) is the instance, while certain elements within the email constitute features. We can observe the words that appear in the document and use each word as a feature, and the occurrence or absence of each word as the value of the feature, so that the number of features obtained will be as many as the number of words in the vocabulary .

We assume that features are independent of each other . The so-called independence refers to independence in a statistical sense, that is, the possibility of a feature or word appearing has no relationship with its proximity to other words. For example, the occurrence of "I" and "we" in "we" Probability has nothing to do with the adjacency of these two words. This assumption is exactly what the word naive means in Naive Bayes classifiers. Another assumption in the Naive Bayes classifier is that every feature is equally important .

Note: Naive Bayes classifiers are usually implemented in two ways: one based on the Bernoulli model and the other based on the polynomial model. The former implementation method is used here. This implementation does not consider the number of times a word appears in the document, only its absence. Therefore, in this sense, it is equivalent to assuming that words have equal weight.

Naive Bayes scenario

An important application of machine learning is the automatic classification of documents.

In document classification, the entire document (such as an email) is the instance, while certain elements within the email constitute features. We can observe the words that appear in the document and use each word as a feature, and the occurrence or absence of each word as the value of the feature, so that the number of features obtained will be as many as the number of words in the vocabulary .

Naive Bayes is an extension of the Bayesian classifier introduced above and is a commonly used algorithm for document classification. Below we will carry out some practical projects on Naive Bayes classification.

Naive Bayes Principle

How Naive Bayes works

提取所有文档中的词条并进行去重
获取文档的所有类别
计算每个类别中的文档数目
对每篇训练文档: 
    对每个类别: 
        如果词条出现在文档中-->增加该词条的计数值(for循环或者矩阵相加)
        增加所有词条的计数值(此类别下词条总数)
对每个类别: 
    对每个词条: 
        将该词条的数目除以总词条数目得到的条件概率(P(词条|类别))
返回该文档属于每个类别的条件概率(P(类别|文档的所有词条))

Naive Bayes development process

收集数据: 可以使用任何方法。
准备数据: 需要数值型或者布尔型数据。
分析数据: 有大量特征时,绘制特征作用不大,此时使用直方图效果更好。
训练算法: 计算不同的独立特征的条件概率。
测试算法: 计算错误率。
使用算法: 一个常见的朴素贝叶斯应用是文档分类。可以在任意的分类场景中使用朴素贝叶斯分类器,不一定非要是文本。

Characteristics of Naive Bayes algorithm

优点: 在数据较少的情况下仍然有效,可以处理多类别问题。
缺点: 对于输入数据的准备方式较为敏感。
适用数据类型: 标称型数据。

Naive Bayes project case

Project Case 1: Blocking insulting comments on community message boards

Project Overview

Build a quick filter to block abusive comments on online community message boards. If a comment uses negative or insulting language, the comment will be marked as inappropriate. Two categories are established for this problem: insulting category and non-insulting category, represented by 1 and 0 respectively.

Development Process
收集数据: 可以使用任何方法
准备数据: 从文本中构建词向量
分析数据: 检查词条确保解析的正确性
训练算法: 从词向量计算概率
测试算法: 根据现实情况修改分类器
使用算法: 对社区留言板言论进行分类

Collect data: any method can be used

This example is a vocabulary list we constructed ourselves:

def loadDataSet():
    '''
    创建数据集,都是假的 fake data set
    :return: 单词列表postinglist, 所属类别classvec
    '''
    postingList=[['my','dog','has','flea','problems','help','please'],
                 ['maybe','not','take','him','to','dog','park','stupid'],
                 ['my','dalmation','is','so','cute','I','love','him'],
                 ['stop','posting','stupid','worthless','garbage'],
                 ['mr','licks','ate','my','steak','how','to','stop','him'],
                 ['quit','buying','worthless','dog','food','stupid']]
    classVec=[0,1,0,1,0,1]  # 1代表侮辱性文字,0 代表正常言论
    return postingList,classVec

Prepare the data: Build word vectors from text

def createVocabList(dataSet):
    """
    获取所有单词的集合
    :param dataSet:数据集
    :return:所有单词的集合(即不含重复元素的单词列表)
    """
    vocabSet=set([])
    for document in dataSet:
        # | 求两个集合的并集
        vocabSet=vocabSet|set(document)
    return list(vocabSet)


def setOfWords2Vec(vocabList,inputSet):
    """
    遍历查看该单词是否出现,出现该单词则将该单词置1
    :param vocabList:所有单词集合列表
    :param inputSet:输入数据集
    :return:匹配列表[0,1,0,1...],其中 1与0 表示词汇表中的单词是否出现在输入的数据集中
    """
    # 创建一个和词汇表等长的向量,并将其元素都设置为0
    returnVec=[0]*len(vocabList)
    # 遍历文档中的所有单词,如果出现了词汇表中的单词,则将输出的文档向量中的对应值设为1
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]=1
        else:
            print("the word:%s is not in my Vocabulary!"%word)
    return returnVec

Analyze data: Check terms to ensure parsing is correct

Check the function execution, check the vocabulary list, there are no duplicate words, and sort them if necessary.

>>> listOPosts, listClasses = bayes.loadDataSet()
>>> myVocabList = bayes.createVocabList(listOPosts)
>>> myVocabList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 
'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 
'stupid', 'so', 'take', 'mr', 'steak', 'my']

Check function validity. For example: What word is the element at index 2 in myVocabList? It should be help. The word appears in the first document, now check to see if it appears in the fourth document.

>>> bayes.setOfWords2Vec(myVocabList, listOPosts[0])
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]

>>> bayes.setOfWords2Vec(myVocabList, listOPosts[3])
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Training algorithm: Calculate probabilities from word vectors

Now you know whether a word appears in a document and what category the document belongs to. Next we rewrite the Bayesian criterion, replacing the previous x, y with w . The bold w indicates that this is a vector, that is, it consists of multiple values. In this example, the number of values ​​is the same as the number of words in the vocabulary.

Rewriting Bayesian Criterion

We calculate this value for each class using the formula above, and then compare the two probability values.

Question: Why is P(w) not calculated in the above code implementation?

Answer: According to the above formula, we can see that the formula on the right is equivalent to the formula on the left, because for each ci, P(w) is fixed. And we only need to compare the size of the numerator on the left to make a classification decision, then we can simplify it to making a decision and classification by comparing the size of the numerator on the right.

First, the probability p(ci) can be calculated by dividing the number of documents in category i (insulting comments or non-insulting comments) by the total number of documents. Next, calculate p( w | ci), where the naive Bayes hypothesis is used. If w is expanded into independent features, then the above probability can be written as p(w0, w1, w2...wn | ci). It is assumed here that all words are independent of each other. This assumption is also called the conditional independence assumption (for example, when two people A and B throw dice, the probabilities do not affect each other, that is, they are independent of each other. When A throws 2 points, B throws 3 points. The probability is 1/6 * 1/6), which means that you can use p(w0 | ci)p(w1 | ci)p(w2 | ci)…p(wn | ci) to calculate the above probability, which is extremely The earth simplifies the calculation process.

Naive Bayes classifier training function

def _trainNB0(trainMatrix, trainCategory):
    """
    训练数据原版
    :param trainMatrix: 文件单词矩阵 [[1,0,1,1,1....],[],[]...]
    :param trainCategory: 文件对应的类别[0,1,1,0....],列表长度等于单词矩阵数,其中的1代表对应的文件是侮辱性文件,0代表不是侮辱性矩阵
    :return:
    """
    # 文件数
    numTrainDocs = len(trainMatrix)
    # 单词数
    numWords = len(trainMatrix[0])
    # 侮辱性文件的出现概率,即trainCategory中所有的1的个数,
    # 代表的就是多少个侮辱性文件,与文件的总数相除就得到了侮辱性文件的出现概率
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    # 构造单词出现次数列表
    p0Num = zeros(numWords) # [0,0,0,.....]
    p1Num = zeros(numWords) # [0,0,0,.....]

    # 整个数据集单词出现总数
    p0Denom = 0.0
    p1Denom = 0.0
    for i in range(numTrainDocs):
        # 是否是侮辱性文件
        if trainCategory[i] == 1:
            # 如果是侮辱性文件,对侮辱性文件的向量进行加和
            p1Num += trainMatrix[i] #[0,1,1,....] + [0,1,1,....]->[0,2,2,...]
            # 对向量中的所有元素进行求和,也就是计算所有侮辱性文件中出现的单词总数
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # 类别1,即侮辱性文档的[P(F1|C1),P(F2|C1),P(F3|C1),P(F4|C1),P(F5|C1)....]列表
    # 即 在1类别下,每个单词出现的概率
    p1Vect = p1Num / p1Denom# [1,2,3,5]/90->[1/90,...]
    # 类别0,即正常文档的[P(F1|C0),P(F2|C0),P(F3|C0),P(F4|C0),P(F5|C0)....]列表
    # 即 在0类别下,每个单词出现的概率
    p0Vect = p0Num / p0Denom
    return p0Vect, p1Vect, pAbusive

Testing Algorithms: Modifying Classifiers to Realistic Situations

When using a Bayesian classifier to classify documents, the product of multiple probabilities needs to be calculated to obtain the probability that the document belongs to a certain category, that is, calculate p(w0|1) * p(w1|1) * p(w2| 1). If one of the probability values ​​is 0, then the final product is also 0. To reduce this impact, you can initialize the occurrence number of all words to 1 and initialize the denominator to 2 (the purpose of taking 1 or 2 is mainly to ensure that the numerator and denominator are not 0, and you can change it according to business needs).

Another problem encountered is underflow, which is caused by multiplying too many very small numbers. When calculating the product p(w0|ci) * p(w1|ci) * p(w2|ci)… p(wn|ci), since most of the factors are very small, the program will underflow or get incorrect Answer. (Trying to multiply many very small numbers in Python will end up rounding to 0). One solution is to take the natural logarithm of the product. In algebra, ln(a * b) = ln(a) + ln(b), so errors caused by underflow or floating-point rounding can be avoided by finding the logarithm. At the same time, there is no loss in using the natural logarithm.

The figure below shows the curves of functions f(x) and ln(f(x)). It can be seen that they increase or decrease at the same time in the same area, and reach the extreme value at the same point. Although their values ​​are different, they do not affect the final result.

Function image

def trainNB0(trainMatrix, trainCategory):
    """
    训练数据优化版本
    :param trainMatrix: 文件单词矩阵
    :param trainCategory: 文件对应的类别
    :return:
    """
    # 总文件数
    numTrainDocs = len(trainMatrix)
    # 总单词数
    numWords = len(trainMatrix[0])
    # 侮辱性文件的出现概率
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    # 构造单词出现次数列表
    # p0Num 正常的统计
    # p1Num 侮辱的统计
    p0Num = ones(numWords)#[0,0......]->[1,1,1,1,1.....]
    p1Num = ones(numWords)

    # 整个数据集单词出现总数,2.0根据样本/实际调查结果调整分母的值(2主要是避免分母为0,当然值可以调整)
    # p0Denom 正常的统计
    # p1Denom 侮辱的统计
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            # 累加辱骂词的频次
            p1Num += trainMatrix[i]
            # 对每篇文章的辱骂的频次 进行统计汇总
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # 类别1,即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
    p1Vect = log(p1Num / p1Denom)
    # 类别0,即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
    p0Vect = log(p0Num / p0Denom)
    return p0Vect, p1Vect, pAbusive

Use algorithms: Classify community message board comments

Naive Bayes classification function

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    """
    使用算法:
        # 将乘法转换为加法
        乘法:P(C|F1F2...Fn) = P(F1F2...Fn|C)P(C)/P(F1F2...Fn)
        加法:P(F1|C)*P(F2|C)....P(Fn|C)P(C) -> log(P(F1|C))+log(P(F2|C))+....+log(P(Fn|C))+log(P(C))
    :param vec2Classify: 待测数据[0,1,1,1,1...],即要分类的向量
    :param p0Vec: 类别0,即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
    :param p1Vec: 类别1,即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
    :param pClass1: 类别1,侮辱性文件的出现概率
    :return: 类别1 or 0
    """
    # 计算公式  log(P(F1|C))+log(P(F2|C))+....+log(P(Fn|C))+log(P(C))
    # 大家可能会发现,上面的计算公式,没有除以贝叶斯准则的公式的分母,也就是 P(w) (P(w) 指的是此文档在所有的文档中出现的概率)就进行概率大小的比较了,
    # 因为 P(w) 针对的是包含侮辱和非侮辱的全部文档,所以 P(w) 是相同的。
    # 使用 NumPy 数组来计算两个向量相乘的结果,这里的相乘是指对应元素相乘,即先将两个向量中的第一个元素相乘,然后将第2个元素相乘,以此类推。
    # 我的理解是:这里的 vec2Classify * p1Vec 的意思就是将每个词与其对应的概率相关联起来
    p1 = sum(vec2Classify * p1Vec) + log(pClass1) # P(w|c1) * P(c1) ,即贝叶斯准则的分子
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) # P(w|c0) * P(c0) ,即贝叶斯准则的分子·
    if p1 > p0:
        return 1
    else:
        return 0


def testingNB():
    """
    测试朴素贝叶斯算法
    """
    # 1. 加载数据集
    listOPosts, listClasses = loadDataSet()
    # 2. 创建单词集合
    myVocabList = createVocabList(listOPosts)
    # 3. 计算单词是否出现并创建数据矩阵
    trainMat = []
    for postinDoc in listOPosts:
        # 返回m*len(myVocabList)的矩阵, 记录的都是0,1信息
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    # 4. 训练数据
    p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
    # 5. 测试数据
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)

Project Case 2: Using Naive Bayes to Filter Spam

Project Overview

Completes one of the most famous applications of Naive Bayes: email spam filtering.

Development Process

Classify emails using Naive Bayes

收集数据: 提供文本文件
准备数据: 将文本文件解析成词条向量
分析数据: 检查词条确保解析的正确性
训练算法: 使用我们之前建立的 trainNB() 函数
测试算法: 使用朴素贝叶斯进行交叉验证
使用算法: 构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上

Collect data: Provide text file

The content of the text file is as follows:

Hi Peter,

With Jose out of town, do you want to
meet once in a while to keep things
going and do some interesting stuff?

Let me know
Eugene

Prepare the data: Parse the text file into term vectors

Use regular expressions to split text

>>> mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'
>>> import re
>>> regEx = re.compile('\\W*')
>>> listOfTokens = regEx.split(mySent)
>>> listOfTokens
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']

Analyze data: Check terms to ensure parsing is correct

Training algorithm: Use the trainNB0() function we established previously

def trainNB0(trainMatrix, trainCategory):
    """
    训练数据优化版本
    :param trainMatrix: 文件单词矩阵
    :param trainCategory: 文件对应的类别
    :return:
    """
    # 总文件数
    numTrainDocs = len(trainMatrix)
    # 总单词数
    numWords = len(trainMatrix[0])
    # 侮辱性文件的出现概率
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    # 构造单词出现次数列表
    # p0Num 正常的统计
    # p1Num 侮辱的统计
    p0Num = ones(numWords)#[0,0......]->[1,1,1,1,1.....]
    p1Num = ones(numWords)

    # 整个数据集单词出现总数,2.0根据样本/实际调查结果调整分母的值(2主要是避免分母为0,当然值可以调整)
    # p0Denom 正常的统计
    # p1Denom 侮辱的统计
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            # 累加辱骂词的频次
            p1Num += trainMatrix[i]
            # 对每篇文章的辱骂的频次 进行统计汇总
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # 类别1,即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
    p1Vect = log(p1Num / p1Denom)
    # 类别0,即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
    p0Vect = log(p0Num / p0Denom)
    return p0Vect, p1Vect, pAbusive

Test algorithm: Cross-validation using Naive Bayes

File parsing and complete spam test function

# 切分文本
def textParse(bigString):
    '''
    Desc:
        接收一个大字符串并将其解析为字符串列表
    Args:
        bigString -- 大字符串
    Returns:
        去掉少于 2 个字符的字符串,并将所有字符串转换为小写,返回字符串列表
    '''
    import re
    # 使用正则表达式来切分句子,其中分隔符是除单词、数字外的任意字符串
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

def spamTest():
    '''
    Desc:
        对贝叶斯垃圾邮件分类器进行自动化处理。
    Args:
        none
    Returns:
        对测试集中的每封邮件进行分类,若邮件分类错误,则错误数加 1,最后返回总的错误百分比。
    '''
    docList = []
    classList = []
    fullText = []
    for i in range(1, 26):
        # 切分,解析数据,并归类为 1 类别
        wordList = textParse(open('db/4.NaiveBayes/email/spam/%d.txt' % i).read())
        docList.append(wordList)
        classList.append(1)
        # 切分,解析数据,并归类为 0 类别
        wordList = textParse(open('db/4.NaiveBayes/email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    # 创建词汇表    
    vocabList = createVocabList(docList)
    trainingSet = range(50)
    testSet = []
    # 随机取 10 个邮件用来测试
    for i in range(10):
        # random.uniform(x, y) 随机生成一个范围为 x ~ y 的实数
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
    errorCount = 0
    for docIndex in testSet:
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            errorCount += 1
    print 'the errorCount is: ', errorCount
    print 'the testSet length is :', len(testSet)
    print 'the error rate is :', float(errorCount)/len(testSet)

Use the algorithm: Build a complete program to classify a set of documents and output the misclassified documents to the screen

Project Case 3: Using Naive Bayes Classifier to Obtain Regional Preferences from Personal Advertisements

Project Overview

Advertisers often want to know certain demographic information about a person so that they can better target their ads.

We will select some people from two cities in the United States and analyze the information posted by these people to compare whether people in these two cities use different advertising words. If the conclusions are indeed different, then what are the commonly used words for each of them? From the words people use, can we have some understanding of what people in different cities care about.

Development Process
收集数据: 从 RSS 源收集内容,这里需要对 RSS 源构建一个接口
准备数据: 将文本文件解析成词条向量
分析数据: 检查词条确保解析的正确性
训练算法: 使用我们之前建立的 trainNB0() 函数
测试算法: 观察错误率,确保分类器可用。可以修改切分程序,以降低错误率,提高分类结果
使用算法: 构建一个完整的程序,封装所有内容。给定两个 RSS 源,改程序会显示最常用的公共词

Collect data: Collect content from RSS sources. Here you need to build an interface for the RSS sources.

That is to say, to import the RSS feed, we use python to download the text, browse the relevant documents under http://code.google.com/p/feedparser/, install feedparse, first decompress the downloaded package, and switch the current directory to where the decompressed file is. folder, and then enter at the python prompt:

>>> python setup.py install

Prepare the data: Parse the text file into term vectors

Document Bag of Words Model

We treat the occurrence or absence of each word as a feature, which can be described as a set-of-words model . If a word appears more than once in a document, it may mean that it contains some information that cannot be expressed whether the word appears in the document. This method is called a bag- of-words model . In a bag of words, each word can appear multiple times, whereas in a word set, each word can appear only once. In order to adapt to the bag-of-words model, the function setOfWords2Vec() needs to be slightly modified. The modified function is bagOfWords2Vec().

The naive Bayes code based on the bag-of-words model is given below. It is almost identical to the function setOfWords2Vec(), the only difference is that whenever a word is encountered, it increments the corresponding value in the word vector instead of just setting the corresponding value to 1.

def bagOfWords2VecMN(vocaList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocaList:
            returnVec[vocabList.index(word)] += 1
    return returnVec
#创建一个包含在所有文档中出现的不重复词的列表
def createVocabList(dataSet):
    vocabSet=set([])    #创建一个空集
    for document in dataSet:
        vocabSet=vocabSet|set(document)   #创建两个集合的并集
    return list(vocabSet)
def setOfWords2VecMN(vocabList,inputSet):
    returnVec=[0]*len(vocabList)  #创建一个其中所含元素都为0的向量
    for word in inputSet:
        if word in vocabList:
                returnVec[vocabList.index(word)]+=1
    return returnVec

#文件解析
def textParse(bigString):
    import re
    listOfTokens=re.split(r'\W*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

Analyze data: Check terms to ensure parsing is correct

Training algorithm: Use the trainNB0() function we established previously

def trainNB0(trainMatrix, trainCategory):
    """
    训练数据优化版本
    :param trainMatrix: 文件单词矩阵
    :param trainCategory: 文件对应的类别
    :return:
    """
    # 总文件数
    numTrainDocs = len(trainMatrix)
    # 总单词数
    numWords = len(trainMatrix[0])
    # 侮辱性文件的出现概率
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    # 构造单词出现次数列表
    # p0Num 正常的统计
    # p1Num 侮辱的统计 
    # 避免单词列表中的任何一个单词为0,而导致最后的乘积为0,所以将每个单词的出现次数初始化为 1
    p0Num = ones(numWords)#[0,0......]->[1,1,1,1,1.....]
    p1Num = ones(numWords)

    # 整个数据集单词出现总数,2.0根据样本/实际调查结果调整分母的值(2主要是避免分母为0,当然值可以调整)
    # p0Denom 正常的统计
    # p1Denom 侮辱的统计
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            # 累加辱骂词的频次
            p1Num += trainMatrix[i]
            # 对每篇文章的辱骂的频次 进行统计汇总
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # 类别1,即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
    p1Vect = log(p1Num / p1Denom)
    # 类别0,即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
    p0Vect = log(p0Num / p0Denom)
    return p0Vect, p1Vect, pAbusive

Test the algorithm: Observe the error rate to make sure the classifier works. Segmentation procedures can be modified to reduce error rates and improve classification results

#RSS源分类器及高频词去除函数
def calcMostFreq(vocabList,fullText):
    import operator
    freqDict={
    
    }
    for token in vocabList:  #遍历词汇表中的每个词
        freqDict[token]=fullText.count(token)  #统计每个词在文本中出现的次数
    sortedFreq=sorted(freqDict.iteritems(),key=operator.itemgetter(1),reverse=True)  #根据每个词出现的次数从高到底对字典进行排序
    return sortedFreq[:30]   #返回出现次数最高的30个单词
def localWords(feed1,feed0):
    import feedparser
    docList=[];classList=[];fullText=[]
    minLen=min(len(feed1['entries']),len(feed0['entries']))
    for i in range(minLen):
        wordList=textParse(feed1['entries'][i]['summary'])   #每次访问一条RSS源
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList=textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList=createVocabList(docList)
    top30Words=calcMostFreq(vocabList,fullText)
    for pairW in top30Words:
        if pairW[0] in vocabList:vocabList.remove(pairW[0])    #去掉出现次数最高的那些词
    trainingSet=range(2*minLen);testSet=[]
    for i in range(20):
        randIndex=int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    trainMat=[];trainClasses=[]
    for docIndex in trainingSet:
        trainMat.append(bagOfWords2VecMN(vocabList,docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam=trainNBO(array(trainMat),array(trainClasses))
    errorCount=0
    for docIndex in testSet:
        wordVector=bagOfWords2VecMN(vocabList,docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]:
            errorCount+=1
    print 'the error rate is:',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

#朴素贝叶斯分类函数
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    p1=sum(vec2Classify*p1Vec)+log(pClass1)
    p0=sum(vec2Classify*p0Vec)+log(1.0-pClass1)
    if p1>p0:
        return 1
    else:
        return 0

Using algorithms: Build a complete program that encapsulates everything. Given two RSS feeds, the program will display the most commonly used common words

The function localWords() uses two RSS sources as parameters. The RSS source must be imported outside the function. The reason for this is that the RSS source will change over time, and reloading the RSS source will get new data.

>>> reload(bayes)
<module 'bayes' from 'bayes.pyc'>
>>> import feedparser
>>> ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> sy=feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
>>> vocabList,pSF,pNY=bayes.localWords(ny,sf)
the error rate is: 0.2
>>> vocabList,pSF,pNY=bayes.localWords(ny,sf)
the error rate is: 0.3
>>> vocabList,pSF,pNY=bayes.localWords(ny,sf)
the error rate is: 0.55

In order to get an accurate estimate of the error rate, the above experiment should be performed multiple times and then averaged

Next, we need to analyze the data to show the geographically related words

You can sort the vectors pSF and pNY first, and then print them out in order. Add the following code to the file:

#最具表征性的词汇显示函数
def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[];topSF=[]
    for i in range(len(p0V)):
        if p0V[i]>-6.0:topSF.append((vocabList[i],p0V[i]))
        if p1V[i]>-6.0:topNY.append((vocabList[i],p1V[i]))
    sortedSF=sorted(topSF,key=lambda pair:pair[1],reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY=sorted(topNY,key=lambda pair:pair[1],reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

The function getTopWords() takes two RSS feeds as input, then trains and tests a Naive Bayes classifier, returning the probability values ​​used. Then create two lists for the storage of tuples. Unlike the previous return of the top X words, here you can return all words greater than a certain threshold, and these tuples will be sorted according to their conditional probabilities.

Save the bayes.py file and enter at the python prompt:

>>> reload(bayes)
<module 'bayes' from 'bayes.pyc'>
>>> bayes.getTopWords(ny,sf)
the error rate is: 0.55
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
how
last
man
...
veteran
still
ends
late
off
own
know
NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**
someone
meet
...
apparel
recalled
starting
strings

When the three lines of code used to remove high-frequency words were commented out, and then the classification performance before and after the comments were compared, the error rate was 54% after removing these lines of code, while the error rate obtained by retaining these lines of code was 70%. It is observed here that the top 30 most frequent words in these comments cover 30% of all words used, and the size of vocabList is about 3000 words, that is, a small part of the words in the vocabulary takes up all the text A big part of the wording. The reason for this phenomenon is that most of the language is redundant and structurally auxiliary. Another common method is to remove not only high-frequency words but also structural auxiliary words from a predetermined high-frequency word. This vocabulary list is called a stop word list.

From the last words output, we can see that the program outputs a large number of stop words. You can remove the fixed stop words and see what the result is. In this way, the classification error rate will also be reduced.

Summarize

For classification, using probabilities is sometimes more effective than using hard rules. Bayesian probability and Bayesian criterion provide an effective method for estimating unknown probabilities using known values.
The requirement for data volume can be reduced by assuming conditional independence between features. The independence assumption means that the probability of a word does not depend on other words in the document. Let us also be aware that this assumption is too simplistic. This is why it is called Naive Bayes. Although the conditional independence assumption is not correct, Naive Bayes is still an effective classifier.


  • Reference information comes from ApacheCN

Guess you like

Origin blog.csdn.net/diaozhida/article/details/84786277