100 days to get the machine learning | Day15 naive Bayes

Day15, start learning naive Bayes, first look at Tony Lord, to show respect.

Thomas Bayes (Thomas Bayes), English theologian, mathematician, mathematical statistician and philosopher, was born in London in 1702, did the priest; 1742 became the Royal Society; April 7, 1763 death. Bayes was a major influence on the early development of probability theory and statistics of the two (Bayesian and Blaise Pascal Blaise Pascal) one of the characters.

Bayesian mathematics major study in probability theory. He first inductive reasoning for the basic theory of probability theory, and the creation of Bayesian statistical theory, statistical decision functions for statistical inference, statistical estimation, etc. contributed. 1763 published works in this area, there are very important for modern probability theory and mathematical statistics action. Bayes "An essay towards solving a problem in the doctrine of chances" published in 1758, many of the terms used by Bayes is still in use. Bayesian statistical inference main contribution is the use of "inverse probability" concept, and it proposed that as a general Bayesian inference method.

I. Review of statistical probability basics

Independent events: In one experiment, the occurrence of an event does not affect the probability of occurrence of another event, the two do not have any relationship. If A1, A2, A3 ... An independent, while the A1 ~ An occurrence probability:

Conditional probability: A refers to the condition in the event of the occurrence probability of event B, represented by the symbol:

Total probability formula: if the event A1, A2, A3 ... An event group constitutes a complete, i.e., they are incompatible twenty-two, which is a corpus and [Omega]; and P (Ai)> 0, then a test for any B are:

Other probability basis, we who are interested in the venue:

Simple probability of [] refresher notes 1-- probability of independent events

[] Refresher probability classical generally note 2--

[Learn new] 3-- note probability geometric probability model

[] Refresher notes 4-- important probability formula

[] Refresher probability distribution probability notes 5--

Second, the Bayes Theorem

Bayes theorem (Bayes's Rule): If there is finite and k disjoint events B1, B2 ···, Bk, and, P (B1) + P (B2) + · · · + P (Bk) = 1 and a observable event a, then there is:

This is the Bayesian formula, in which:

P (Bi) is a priori probability, i.e. the probability of a hypothesis obtained before new data;

P (Bi | A) is the posterior probability, i.e. the probability that the hypothesis is calculated after the new data is observed;

P (A | Bi) is the likelihood, i.e., the probability of getting the data in the hypothesis;

P (A) is a normalized constant, i.e., the probability of the data obtained in any assumptions.

Prove it is not complicated

1, according to the definition of conditional probability, the probability of event A occurs at event B is:

2, the same probability, in the event A under a condition of occurrence of event B:

3. The combination of these two equations, we get:

4, on both sides of the same formula dividing P (A), if P (A) is non-zero, we obtain Bayes' theorem:

B appears on the premise, the probability of occurrence of A is equal to the probability of occurrence of B multiplied by the probability of occurrence of A under the premise of A divided by the probability of occurrence of B appears. Through contact A and B, is calculated from the probability of an event in the case of occurrence of another event, i.e., goes back to the results from the source (i.e., the probability of reverse).

Bayesian formula and a set of theories and methods thus developed, the statistical probability is called Bayesian, and the probability school has a completely different way of thinking.

Frequency school: the study was the event itself, so the researchers can only trial and error approach to it in order to get results. For example: If you want to calculate the probability of face-up when throwing a coin, we need to keep a coin toss, throw the number tends to infinity when the frequency is the probability face up face up.

Bayesian school: the study was an observer view of things, so you can use the information collected prior knowledge and to describe him, and then use some of the evidence to prove it. Or such as a coin toss, when the small knows a coin is uniform, then the next time gives throw results are positive or negative is a 50% confidence level (probability distribution), may be out of this uniform that the most common coins beliefs, such as Bob then randomly throwing 1000, findings by doing so, then validated their prior knowledge by the evidence. (There exist modified, such as coins found inconsistent material, in short, is such a procedure)

for example

Suppose there are two balls 100 installed in each of the boxes, the boxes 70 A red ball, green ball 30, the boxes B red balls 30, 70 green ball. Suppose randomly select a box, a ball which took note of color of the ball and then back into the original box, thus repeated 12 times, 8 times to obtain a recording red ball, four green ball. The question is, do you think is the probability of being selected box A box of how much?

Select the beginning of the prior probability of A and B are two boxes of 50%, a second election because it is random (this is the second election Bayes theorem special form). That is:

P (A) = 0.5, P (B) = 1 - P (A);

Then come up with a ball in the case of the red ball, we should be based on this information to update selected a priori probability A box:

P (A | red ball 1) = P (red ball | A) × P (A) / (P (red ball | A) × P (A) + (P (red ball | B) × P (B)) )

P (red ball | A): probability of A red ball in the box to get

P (red ball | B): the probability of a red ball in the box to get B

Thus in the case of a red ball, the prior probability A selection box can be modified to:

P (A | red ball 1) = 0.7 × 0.5 / (0.7 × 0.5 + 0.3 × 0.5) = 0.7

Ie after the emergence of a red ball, and B box is checked prior probability was amended to read:

P (A) = 0.7, P (B) = 1 - P (A) = 0.3;

Repeat until you go through eight red balls amended (the probability of an increase), after 4 this green ball amended (probability reduction), select the probability A box as follows: 96.7%.

Python code to solve this problem:

def bayesFunc(pIsBox1, pBox1, pBox2):
return (pIsBox1 * pBox1)/((pIsBox1 * pBox1) + (1 - pIsBox1) * pBox2)
def redGreenBallProblem():
pIsBox1 = 0.5
# consider 8 red ball
for i in range(1, 9):
pIsBox1 = bayesFunc(pIsBox1, 0.7, 0.3)
print " After red %d > in 甲 box: %f" % (i, pIsBox1)
# consider 4 green ball
for i in range(1, 5):
pIsBox1 = bayesFunc(pIsBox1, 0.3, 0.7)
print " After green %d > in 甲 box: %f" % (i, pIsBox1)
redGreenBallProblem()

Results are as follows:

After red 1 > in 甲 box: 0.700000
After red 2 > in 甲 box: 0.844828
After red 3 > in 甲 box: 0.927027
After red 4 > in 甲 box: 0.967365
After red 5 > in 甲 box: 0.985748
After red 6 > in 甲 box: 0.993842
After red 7 > in 甲 box: 0.997351
After red 8 > in 甲 box: 0.998863
After green 1 > in 甲 box: 0.997351
After green 2 > in 甲 box: 0.993842
After green 3 > in 甲 box: 0.985748
After green 4 > in 甲 box: 0.967365

Obviously you can see the emergence of the red ball is to increase the probability of choosing A box, and green ball the opposite.

Third, naive Bayes algorithm

Naive Bayes (Naive Bayesian) and is characterized in Bayesian conditional independence assumption classification based on the probability that the classification by the feature calculation, to select a large probability that the classification, it is a machine learning based on probability theory. Classification. Because the goal is to determine the classification, it is also part of supervised learning. Naive Bayes assumption various features are independent, so called plain. It is simple, easy to operate, based on feature independence assumption, assuming various features do not affect each other, thus greatly reducing the difficulty calculating probabilities.

  1. Naive Bayes algorithm implementation process is as follows:
    1) set
    for the item to be classified, wherein a is a characteristic property of x

2) category set as follows:

3) According to the Bayesian formula to calculate

4) If
, then x belongs to this
category.

  1. Naive Bayes Gaussian (characteristic attributes typically used in the continuous case)

    The above process algorithm can be seen, for one is naive Bayes algorithm using Bayes' formula, it does not make any changes.

    In calculating the conditional probability for the discrete data features can be used law of large numbers (frequency instead of thinking probability). However, how to deal with a continuous feature of it? Here we generally use a Gaussian Naive Bayes.

The so-called Gauss naive Bayes, characterized in that when the attribute is Gaussian distribution and a continuous value, the value can be directly calculated using the conditional probability distribution is a Gaussian probability formula.

![](https://img2018.cnblogs.com/blog/743008/201908/743008-20190805191943676-1417644618.png)

此时,我们只需要计算各个类别下的特征划分的均值和标准差.
  1. Naive Bayes polynomial (generally used in the case of discrete characteristic attribute)

    Naive Bayes called polynomial, that polynomial obey distribution characteristic properties, and thus for each category y, parameters
    , where n is the number of characteristic properties, then P (xi | y) probability θyi.

  1. Bernoulli Naive Bayes (generally used in the case of more missing values)

    And polynomial models, Bernoulli model is applicable to the case of discrete features, the difference is, Bernoulli model values ​​for each feature can only be 1 and 0 (in text classification, for example, a word in a document had occurred, wherein the value 1, or 0).

Fourth, the actual Naive Bayes

sklearn There are three different types of Naive Bayes:

Gaussian profile: a classification problem, assuming the properties / characteristics of the normal distribution.

Polynomial type: discrete values ​​for model years. Such as text classification problem which we mentioned earlier, we not only see the words appear in the text, also depends on the number of occurrences. If the total number of words is n, m is the number of words appear, then, a bit like a scene dice n times m times the word appears.

Bernoulli type: the resulting feature only 0 (no-show) and 1 (appeared).

Minimalist Scikit-Learn Starter

Example 1 using the iris data set we classify

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn import datasets
iris = datasets.load_iris()
gnb = GaussianNB()
scores=cross_val_score(gnb, iris.data, iris.target, cv=10)
print("Accuracy:%.3f"%scores.mean())

Output: Accuracy: 0.953

"San Francisco crime categorical predictor" of Example 2 Kaggle game

题目背景:『水深火热』的大米国,在旧金山这个地方,一度犯罪率还挺高的,然后很多人都经历过大到暴力案件,小到东西被偷,车被划的事情。当地警方也是努力地去总结和想办法降低犯罪率,一个挑战是在给出犯罪的地点和时间的之后,要第一时间确定这可能是一个什么样的犯罪类型,以确定警力等等。后来干脆一不做二不休,直接把12年内旧金山城内的犯罪报告都丢带Kaggle上,说『大家折腾折腾吧,看看谁能帮忙第一时间预测一下犯罪类型』。犯罪报告里面包括日期,描述,星期几,所属警区,处理结果,地址,GPS定位等信息。当然,分类问题有很多分类器可以选择,我们既然刚讲过朴素贝叶斯,刚好就拿来练练手好了。

(1) 首先我们来看一下数据

import pandas as pd  
import numpy as np  
from sklearn import preprocessing  
from sklearn.metrics import log_loss  
from sklearn.cross_validation import train_test_split
train = pd.read_csv('/Users/liuming/projects/Python/ML数据/Kaggle旧金山犯罪类型分类/train.csv', parse_dates = ['Dates'])  
test = pd.read_csv('/Users/liuming/projects/Python/ML数据/Kaggle旧金山犯罪类型分类/test.csv', parse_dates = ['Dates'])  
train  


我们依次解释一下每一列的含义:

Date: 日期
Category: 犯罪类型,比如 Larceny/盗窃罪 等.
Descript: 对于犯罪更详细的描述
DayOfWeek: 星期几
PdDistrict: 所属警区
Resolution: 处理结果,比如说『逮捕』『逃了』
Address: 发生街区位置
X and Y: GPS坐标
train.csv中的数据时间跨度为12年,包含了将近90w的记录。另外,这部分数据,大家从上图上也可以看出来,大部分都是『类别』型,比如犯罪类型,比如星期几。
(2)特征预处理
sklearn.preprocessing模块中的 LabelEncoder函数可以对类别做编号,我们用它对犯罪类型做编号;pandas中的get_dummies( )可以将变量进行二值化01向量,我们用它对”街区“、”星期几“、”时间点“进行因子化。

#对犯罪类别:Category; 用LabelEncoder进行编号  
leCrime = preprocessing.LabelEncoder()  
crime = leCrime.fit_transform(train.Category)   #39种犯罪类型  
#用get_dummies因子化星期几、街区、小时等特征  
days=pd.get_dummies(train.DayOfWeek)  
district = pd.get_dummies(train.PdDistrict)  
hour = train.Dates.dt.hour  
hour = pd.get_dummies(hour)  
#组合特征  
trainData = pd.concat([hour, days, district], axis = 1)  #将特征进行横向组合  
trainData['crime'] = crime   #追加'crime'列  
days = pd.get_dummies(test.DayOfWeek)  
district = pd.get_dummies(test.PdDistrict)  
hour = test.Dates.dt.hour  
hour = pd.get_dummies(hour)  
testData = pd.concat([hour, days, district], axis=1)  
trainData 

特征预处理后,训练集feature,如下图所示:

(3) 建模

from sklearn.naive_bayes import BernoulliNB
import time
features=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION',  
 'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN']  
X_train, X_test, y_train, y_test = train_test_split(trainData[features], trainData['crime'], train_size=0.6)  
NB = BernoulliNB()  
nbStart = time.time()  
NB.fit(X_train, y_train)  
nbCostTime = time.time() - nbStart  
#print(X_test.shape)  
propa = NB.predict_proba(X_test)   #X_test为263415*17;那么该行就是将263415分到39种犯罪类型中,每个样本被分到每一种的概率  
print("朴素贝叶斯建模%.2f秒"%(nbCostTime))  
predicted = np.array(propa)  
logLoss=log_loss(y_test, predicted)  
print("朴素贝叶斯的log损失为:%.6f"%logLoss)  

输出:
朴素贝叶斯建模0.55秒
朴素贝叶斯的log损失为:2.582561

例3 文本分类——垃圾邮件过滤

收集数据:提供文本文件
准备数据:将文本文件解析成词条向量
分析数据;检查词条确保解析的正确性
训练算法:使用之前建立的trainNB0()函数
测试算法:使用classifyNB(),并且构建一个新的测试函数来计算文档集的错误率
使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上

准备数据:切分文本

使用正则表达式切分,其中分隔符是除单词、数字外的任意字符

import re
mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'
regEx = re.compile('\\W*')
listOfTokens = regEx.split(mySent)
# 去掉长度小于0的单词,并转换为小写
[tok.lower() for tok in listOfTokens if len(tok) > 0]
[out]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']

切分邮件

emailText = open('email/ham/6.txt').read()
listOfTokens = regEx.split(emailText)

测试算法:使用朴素贝叶斯进行交叉验证

import randomdef textParse(bigString):
    '''
    字符串解析
    '''
    import re    # 根据非数字字母的任意字符进行拆分
    listOfTokens = re.split(r'\W*', bigString)    # 拆分后字符串长度大于2的字符串,并转换为小写
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]def spamTest():
    '''
    贝叶斯分类器对垃圾邮件进行自动化处理
    '''
    docList = []
    classList = []
    fullText = []    for i in range(1, 26):        # 读取spam文件夹下的文件,并转换为特征和标签向量
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)        # 读取ham文件夹下的文件,并转换为特征和标签向量
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)    # 转换为词列表
    vocabList = createVocabList(docList)    # 初始化训练集和测试集
    trainingSet = range(50);
    testSet = []    # 随机抽取测试集索引
    for i in range(10):
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])        del(trainingSet[randIndex])

    trainMat = []
    trainClasses = []    # 构造训练集
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])    # 朴素贝叶斯分类模型训练
    p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))
    errorCount = 0

    # 朴素贝叶斯分类模型测试
    for docIndex in testSet:
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])        if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            errorCount += 1
            print 'classification error', docList[docIndex]    print 'the error rate is: ',float(errorCount)/len(testSet)

由于SpamTest()构造的测试集和训练集是随机的,所以每次运行的分类结果可能不一样。如果发生错误,函数会输出错分文档的词表,这样就可以了解到底哪篇文档发生了错误。这里出现的错误是将垃圾邮件误判为了正常邮件。

import randomdef textParse(bigString):
    '''
    字符串解析
    '''
    import re    # 根据非数字字母的任意字符进行拆分
    listOfTokens = re.split(r'\W*', bigString)    # 拆分后字符串长度大于2的字符串,并转换为小写
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]def spamTest():
    '''
    贝叶斯分类器对垃圾邮件进行自动化处理
    '''
spamTest()
[out]
classification error ['benoit', 'mandelbrot', '1924', '2010', 'benoit', 'mandelbrot', '1924', '2010', 'wilmott', 'team', 'benoit', 'mandelbrot', 'the', 'mathematician', 'the', 'father', 'fractal', 'mathematics', 'and', 'advocate', 'more', 'sophisticated', 'modelling', 'quantitative', 'finance', 'died', '14th', 'october', '2010', 'aged', 'wilmott', 'magazine', 'has', 'often', 'featured', 'mandelbrot', 'his', 'ideas', 'and', 'the', 'work', 'others', 'inspired', 'his', 'fundamental', 'insights', 'you', 'must', 'logged', 'view', 'these', 'articles', 'from', 'past', 'issues', 'wilmott', 'magazine']
the error rate is:  0.1spamTest()
[out]
the error rate is:  0.0

参考文献:

https://blog.csdn.net/fisherming/article/details/79509025
https://blog.csdn.net/qq_32241189/article/details/80194653
http://blog.csdn.net/kesalin/article/details/40370325

Guess you like

Origin www.cnblogs.com/jpld/p/11305018.html