Bayesian text classification

 

                                                    Naive Bayes Classification

 

Bayesian classification

贝叶斯分类是一类分类算法的总称,这类算法均以贝叶斯定理为基础,故统称为贝叶斯分类

而朴素朴素贝叶斯分类是贝叶斯分类中最简单,也是常见的一种分类方法

Overview of Classification Problems

对于分类问题,其实谁都不会陌生,日常生活中我们每天都进行着分类过程。

例如,当你看到一个人,你的脑子下意识判断他是学生还是社会上的人;

你可能经常会走在路上对身旁的朋友说“这个人一看就很有钱”之类的话,其实这就是一种分类操作

Description of the category

从数学角度来说,分类问题可做如下定义:已知集合和,C= y1,y2...yn 和I=x1,x2...xn

确定映射规则y = f(x),使得任意xi有且仅有一个yi,使得成立 y=f(x)成立

其中C叫做类别集合,其中每一个元素是一个类别,而I叫做项集合(特征集合),其中每一个元

素是一个待分类项,f叫做分类器。分类算法的任务就是构造分类器f。

分类算法的内容是要求给定特征,让我们得出类别,这也是所有分类问题的关键

Naive Bayes Classification

Bayesian formula

change the expression

The problem we want to solve is: what is the probability of belonging to a certain category under certain characteristic conditions

Example analysis

The given data is as follows:

The question for us now is, if a pair of boyfriend and girlfriend, the boy wants to propose to a girl, the four characteristics of the boy are not handsome, bad personality, short height, not motivated , please judge whether the girl should marry or not?

Turning it into a math problem is to compare the probability of p(marry|(not handsome, bad personality, short height, not motivated)) and p(not marry|(not handsome, bad personality, short height, not motivated)), who With a high probability, I can give the answer of marrying or not marrying!

Apply Naive Bayes formula


Naive Term Explanation for Naive Bayes Algorithm

那么我只要求得

p(不帅、性格不好、身高矮、不上进|嫁)

p(不帅、性格不好、身高矮、不上进)

p(嫁)

下面我分别求出这几个概率,就得到最终结果。

假设 p(不帅、性格不好、身高矮、不上进|嫁) 

   = p(不帅|嫁)*p(性格不好|嫁)*p(身高矮|嫁)*p(不上进|嫁)


这也就是为什么朴素贝叶斯分类有朴素一词的来源,朴素贝叶斯算法是假设各个特征之间相互独

立,那么这个等式就成立了!

这一假设使得朴素贝叶斯法变得简单,但有时会牺牲一定的分类准确率。


我们将上面公式整理一下如下:


Our task is to ask for a higher probability of marrying or not marrying under certain characteristics. Comparing the two formulas with the same denominator, we only need to calculate the numerator of the two cases.

p(married) = ?

First, we organize the training data, and the number of married samples is as follows:

Then p(marry) = 6/12 (total number of samples) = 1/2

p (bad character | married) = ? Statistics satisfy the following sample size:

p (bad character | married) = 1/6

p(short|married) = ? The statistics satisfy the sample size as follows:

Then p(short|married) = 1/6

p(not motivated|married) = ?The statistics satisfy the sample number as follows:

Then p(not motivated|married) = 1/6

= (1/2) (1/6) (1/6) (1/6) (1/2) / same denominator

Calculate p under the same principle (not married | not handsome, bad personality, short height, not motivated)

p (嫁|不帅、性格不好、身高矮、不上进)
=(1/2)*(1/6)*(1/6)*(1/6)*(1/2)/分母
=(1/144)/分母 

p (不嫁|不帅、性格不好、身高矮、不上进)

= ((1/6*1/2*1*1/2)*1/2)/分母
= (1/24)/分母 


于是有p (不嫁|不帅、性格不好、身高矮、不上进)>p (嫁|不帅、性格不好、身高矮、不上进)

So we can give this girl the answer according to the Naive Bayes algorithm, no marriage! ! ! !


How did the above formula come about?

Venn diagram

Venn diagram

如上面的韦恩图,我们用矩形表示一个样本空间,代表随机事件发生的一切可能结果。

的在统计学中,我们用符号P表示概率,A事件发生的概率表示为P(A)

A事件与B事件同时发生的概率表示为P(A∩B),或简写为P(AB)即两个圆圈重叠的部分。

A或者B至少有一个发生的概率表示为P(A∪B),即圆圈A与圆圈B覆盖的区域。

在B事件发生的基础上发生A的概率表示为P(A|B),这便是条件概率,图形上 它表示AB重合的面积比上B的面积

basic concepts of statistics

  • P(A) Probability of event A occurring
  • p(AB) or p(A∩B) probability that events AB occur simultaneously
  • p(A|B) The probability that event A occurs if event B occurs

Conditional probability formula

由维恩图可知: p(AB) = p(B)p(A|B) = p(A)p(B|A)

P(A|B)=P(AB)/P(B)=p(A)p(B|A)/p(B)

条件概率是理解全概率公式和贝叶斯公式的基础,可以这样来考虑,

如果P(A|B)大于P(A)则表示B的发生使A发生的可能性增大了。

在条件概率中,最本质的变化是样本空间缩小了——由原来的整个样本空间缩小到了给定条件的样本空间

total probability formula

    1.事件组B1,B2,.... 满足 B1,B2....两两互斥,即 Bi ∩ Bj = ∅ ,i≠j , i,j=1,2,....,
    且P(Bi)     >0,i=1,2,....;

    2.B1∪B2∪....=Ω ,则称事件组 B1,B2,...是样本空间Ω的一个划分

    设 B1,B2,...是样本空间Ω的一个划分,A为任一事件,则:

    P(A)=P(AB1)+P(AB2)+....+P(ABn)

        =P(A|B1)P(B1)+P(A|B2)P(B2)+...+P(A|Bn)P(PBn)

    上式即为全概率公式(formula of total probability)


    全概率公式的意义在于,当直接计算P(A)较为困难,而P(Bi),P(A|Bi)  (i=1,2,...)的计算较为简单

    时,  可以利用全概率公式计算P(A)。

    思想就是,将事件A分解成几个小事件,通过求小事件的概率,然后相加从而求得事件A的概率,而将事

    件A进行分割的时候,不是直接对A进行分割,而是先找到样本空间Ω的一个个划分B1,B2,...Bn,这样事件

    A就被事件AB1,AB2,...ABn

    分解成了n部分,即A=AB1+AB2+...+ABn, 每一Bi发生都可能导致A发生相应的概率是P(A|Bi),由加法公式得

    P(A)=P(AB1)+P(AB2)+....+P(ABn)

        =P(A|B1)P(B1)+P(A|B2)P(B2)+...+P(A|Bn)P(Bn)

The probability of event A conditioned on event B (occurring) is not the same as the probability of event B conditioned on event A; however, the two

is a definite relationship, and Bayes' theorem is a statement of this relationship

 

这个公式本身平平无奇,无非就是条件概率的定义加上全概率公式一起作出的一个推导而已。但它所表达的意义却非常深刻。

全概率公式就是一个“原因推结果”的过程

但贝叶斯公式却恰恰相反,研究造成结果发生的原因 是XX原因造成的可能性有多大,即“结果推原因”。

Example 1

发报台分别以概率0.6和0.4发出信号“∪”和“—”。由于通信系统受到干扰,当发出信号“∪”时,

收报台分别以概率0.8和0.2受到信号“∪”和“—”;又当发出信号“—”时,收报台分别以概率0.9和0.1收到信号“—”和“∪”。

求当收报台收到信号“∪”时,发报台确系发出“∪”的概率。


p(发出U|收到U) = p(发出u) * P(收到U|发出U) / p(真实发出U)*p(收到u|真实发出u) + p(假发出U)* p(收到u|假发出u

==> 0.6*0.8 / (0.6*0.8 + 0.1*0.4) =0.923

Example 2

一所学校里面有 60% 的男生,40% 的女生。男生总是穿长裤,女生则一半穿长裤一半穿裙子。

假设你走在校园中,迎面走来一个穿长裤的学生(很不幸的是你高度近似,你只看得见他(她)穿的是否长裤,

而无法确定他(她)的性别),你能够推断出他(她)是男生的概率是多大吗?

p(男生|长裤) = p(男生) * p(长裤|男生) / p(长裤)
 = 0.6 * 1 / 0.6*1 + 0.4*0.5 = 0.6/0.8 = 6/8

Engineering application derivation process:

  • p(y=ck) = probability of class y = number of occurrences of class y / total number of classes

  • p(xi|y) = the total number of times xi appears in the y category/the number of times that category in the sample

p(xi|y) under different models:

  • Gaussian model When the feature is a continuous variable, the use of a polynomial model will lead to a lot of P(xi|yk)=0 (without smoothing). At this time, even if smoothing is performed, the obtained conditional probability is difficult to describe the real situation. . Therefore, to deal with continuous feature variables, a Gaussian model should be used.

  • Polynomial The polynomial model is used when the features are discrete. The polynomial model will do some smoothing when calculating the prior probability P(yk) and the conditional probability P(xi|yk). The specific formula is: P(yk)=Nyk+α / N+kα

    N is the total number of samples, k is the total number of categories, Nyk is the number of samples with category yk, α is the smooth value

    P (xi | yk) = Nyk, xi + α / Nyk + nα

    Nyk is the number of samples with the category yk, n is the dimension of the feature, Nyk,xi are the samples with the category yk, the value of the i-th dimension feature is the number of samples of xi, and α is the smooth value

    When α=1, it is called Laplace smoothing, when 0<α<1, it is called Lidstone smoothing, and when α=0, no smoothing is performed.

    If smoothing is not performed, when the value xi of a certain dimension feature does not appear in the training sample, it will lead to P(xi|yk)=0, resulting in a posterior probability of 0. Adding smoothing can overcome this problem.

  • Bernoulli

    Like the polynomial model, the Bernoulli model is suitable for discrete features. The difference is that the value of each feature in the Bernoulli model can only be 1 and 0 (taking text classification as an example, a word is in the document appears, its eigenvalue is 1, otherwise it is 0).

    In the Bernoulli model, the conditional probability P(xi|yk) is calculated as:

    When the eigenvalue xi is 1, P(xi|yk)=P(xi=1|yk)

    When the eigenvalue xi is 0, P(xi|yk)=1−P(xi=1|yk)

Data overflow:

为防止连续乘法时每个乘数过小,而导致的浮点数溢出(太多很小的数相乘结果为0,或者不能正确分类)

当计算乘积 p(w0|ci) * p(w1|ci) * p(w2|ci)... p(wn|ci) 时,由于大部分因子都非常小,

所以程序会下溢出或者得到不正确的答案。(用 Python 尝试相乘许多很小的数,最后四舍五入后会得到 0)

一种解决办法是对乘积取自然对数。在代数中有 ln(a * b) = ln(a) + ln(b), 

于是通过求对数可以避免下溢出或者浮点数舍入导致的错误。采用自然对数进行处理不会有任何损失

process

data training

  • Distillation method

    The steps of the leave-out method are relatively simple, and the dataset D is directly divided into two mutually exclusive sets, one of which is used as the training set S and the other as the test T. After training the model on S, use T to evaluate the test error as an estimate of the generalization error. The division of the training/test set should keep the data distribution as consistent as possible, so as to avoid the impact on the final result caused by the additional bias introduced by the data division process.

    A disadvantage of the leave-out method is that the division ratio of the training set S to the test set T is not well defined. If the training set S contains the vast majority of samples, the trained model may be closer to the model trained with D, but since T is relatively small, the evaluation results may not be stable and accurate enough; if the test set T contains more samples, The difference between the training set S and D is even greater, and the evaluated model may be quite different from the model trained with D, thus reducing the fidelity of the evaluation results.

  • cross-validation

    The "cross-validation method" first divides the dataset D into k mutually exclusive subsets of similar size. Then, each time the union of k-1 subsets is used as the training set, and the remaining subset is used as the test set, as shown in the following figure,

  • self help

    What we wish to evaluate is a model trained with D. However, in the leave-out method and cross-validation method, since a part of the samples are reserved for testing, the training set used by the actual evaluation model is smaller than D,

    This will inevitably introduce some estimation bias due to different training sample sizes. The cross-validation method is less affected by the size of the training samples, but the computational complexity is too high. Is there any way to reduce

    What is the impact of different training sample sizes, and at the same time can be more efficient for experimental evaluation?

    The "self-help approach" is a better solution. Given a dataset D containing m samples, we sample it to generate a dataset D': randomly pick a sample from D each time and copy it to the

    into D', and then put the sample back into the data set D, so that the sample may still be collected in the next sampling; after this process is repeated m times, we get a data set containing m samples

    D', which is the result of our bootstrap sampling. We take D' as the training set and DD' (set subtraction) as the test set.

    Bootstrapping is useful when the dataset is small and it is difficult to efficiently divide training/testing sets; in addition, bootstrapping can generate multiple different training sets from the initial dataset, which is very useful for methods such as ensemble learning

    the benefits of. However, bootstrapping produces data that alters the distribution of the initial dataset, which introduces estimation bias. Therefore, when the initial data volume is sufficient, the leave-out method and cross-validation method are more commonly used.

evaluate

Correct rate (precise rate)

正确率是针对我们预测结果而言的,它表示的是预测为正的样本中有多少是对的。那么预测为正

就有两种可能了,一种就是把正类预测为正类,另一种就是把负类预测为正类

正确率 = 提取出的正确信息条数 / 提取出的信息条数    

Recall rate (recall rate)

召回率是针对我们原来的样本而言的,它表示的是样本中的正例有多少被预测正确了。那也有两

种可能,一种是把原来的正类预测成正类,另一种就是把原来的正类预测为负类

召回率 = 提取出的正确信息条数 /  样本中的正确信息条数

F1

综合这二者指标的评估指标,用于综合反映整体的指标
F1 = 正确率 * 召回率 * 2 / (正确率 + 召回率) 

example

不妨举这样一个例子:

某池塘有1400条鲤鱼,300只虾,300只鳖,现在以捕鲤鱼为目的

撒一大网,逮着了700条鲤鱼,200只虾,100只鳖。那么,这些指标分别如下:

正确率 = 700 / (700 + 200 + 100) = 70%

召回率 = 700 / 1400 = 50%

F值 = 70% * 50% * 2 / (70% + 50%) = 58.3%

如果把池子里的所有的鲤鱼、虾和鳖都一网打尽,这些指标又有何变化:

正确率 = 1400 / (1400 + 300 + 300) = 70%

召回率 = 1400 / 1400 = 100%

F值 = 70% * 100% * 2 / (70% + 100%) = 82.35%    

由此可见,正确率是评估捕获的成果中目标成果所占得比例;召回率,顾名思义,就是从关注领域中,召回目标类别的比例;

而F值,则是综合这二者指标的评估指标,用于综合反映整体的指标。

准确率就是P = TP/(TP+FP)  大白话就是“你的预测有多少是对的”

召回率就是R = TP/(TP+FN)  大白话就是“正例里你的预测覆盖了多少”

真正例 TP = 700  [我要鱼]

假正例 FP = 300  [我要的是鱼, 你给我别的 都认为是假正例]

假反例 FN = 700  [本来是鱼,却没有当做鱼给我 1400-700] 

真反例 FP = 300  [本来不是鱼,却当成鱼给我 300+300-200-100]

正确率 = TP/TP+FP = 700/(700+300)

召回率 = TP/TP+FN = 700/(700+700)

Project actual combat article automatic classification

  • Obtaining sample data in the first stage

    The crawler system has defined categories when crawling articles of a specific category

    There are currently 15 categories: stocks/foreign exchange/futures/gold/funds/technology/digital currency/collection/insurance/P2P/financial management/trust/bonds/banks/real estate

    There are 5521 pieces of data in the collected samples, including 4416 training set and 1105 test data.

  • data training

    Test data collection method: set aside method 80% training 20% ​​testing

    Accuracy rate: take the first two about 83%, take the first one about 72%

    Recall: -- Not yet developed

    F1: --not yet developed

    Sample set: The incremental method is pulled once a month, the information is synchronized incrementally, and the effect evaluation is no problem and put into production

    Training output: probability of each word under classification P(xi|yj) & classification probability P(y)

  • Use of Classification

    Reference: [article content]

    1. Extract keywords

    2. Accumulate P(xi|yj) probability + P(yj)

    References: [A list of all categories and probability values ​​that the article belongs to]

  • Continuous optimization and problems

    1. Improve the accuracy of sample classification / uniform distribution of data / sufficient number (very important premise)

    2. Increase the weight of keywords

    3. stop_word is perfect, remove irrelevant words or words with high word frequency

    4. During the training process, articles of different lengths extract different numbers of keywords <1000 10 tags, 1000~3000 15 tags, >3000 20 tags

    5. The content of the article is rarely ignored, and it does not enter the training sample. Currently <300 characters are ignored.

    6. Lowercase keywords to prevent case mismatch

  • question

    1. In the aforementioned prevention of floating-point multiplication data overflow, the overall logarithm log2(x) x(0,1) is negative, and the accumulation is getting smaller and smaller.

      The current method is to shift the curve log2(1+x) to ensure the positive number of the data

2. 分类过程没有出现的词汇不进行计算 有分不出类别的可能

3. 纯手写没有使用开源框架

4. 目前样本数据噪声比较大 

Summarize


  • Naive Bayes

    朴素贝叶斯算法是建立在每一个特征值之间时独立的基础上的监督学习分类算法,而这也是称
    
    他为 “朴素”贝叶斯的缘由
    
    贝叶斯公式 + 特征条件独立假设 =  朴素贝叶斯
    
  • Procedure 1. Prepare data/collect samples

    2.训练数据 计算p(yi)/p(wi|yi) 高斯/伯努利/多项式
    
    3.数据的测试验证 留出法/交叉验证/自助  准确率/召回率/F1
    
    4.使用 简单累加
    
  • optimization

    1.保证样本数据准确和数据分布均匀
    
    2.防止数据溢出 取对数
    
    3.防止概率为0情况 拉普拉斯平滑
    
    4.stop_word 整理
    
    5.英文大小写统一处理
    
    6.不同长度文章提取不同个数个关键词
    

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326120467&siteId=291194637