Machine Learning - Naive Bayesian Classification

A Bayesian principle

1.1 The background of Bayesian principle:

The Bayesian principle was proposed by the British mathematician Thomas Bayes. He wrote a paper on inductive reasoning that directly influenced the statistics of the next two centuries. It is one of the famous papers in the history of science.

Bayesian principle is that Bayesian wrote an article to solve a problem called "reverse probability", trying to answer how to make a more mathematically logical guess without too much reliable evidence.

Reverse probability, understood literally, is the relative "forward probability" .

Positive probability is easier to understand. For example, if you know that there are N balls in the bag, either black or white, and M of them are black, then you can know the probability of finding a black ball by reaching in and touching a ball. . But this situation is often God's perspective, that is, we make judgments after knowing the whole picture of things . In real life, it is difficult for us to know the whole picture of things. Bayesian started from the actual scene and asked a question: If we don't know the ratio of black and white balls in the bag in advance, but by the color of the ball we touch, can we judge the ratio of black and white balls in the bag ? Another example: If you see a person always spending money, then you will infer that the person is probably rich. Of course, this is not absolute. That is to say, when you cannot accurately predict the nature of a thing, you can rely on events related to the nature of the thing to make a judgment. If things happen more frequently, it proves that this attribute is more likely to exist.    

1.2 Concepts and calculations involved in Bayesian principle:

This actually involves several concepts in Bayesian principle:

Prior probability : judge the probability of something happening through experience, the prior probability of X is expressed as P(X), and the prior probability of Y is expressed as P(Y).

Posterior probability : Posterior probability is the probability of inferring the cause after the result occurs.

③Conditional probability : the probability of event A occurring under the condition that another event B has occurred, expressed as P(A|B), read as "the probability of A occurring under the condition that B occurs".

Likelihood function (likelihood function) : The training process of the probability model can be understood as the process of parameter estimation, and the likelihood function is used to measure the parameters of this model. Likelihood here means likelihood, which is a function of statistical parameters.

If there are two events A and B, if the probability of occurrence of event B when event A occurs is known P(B|A) and the probability of occurrence of A and B is P(A), P(B), according to Bayesian decision The theory can calculate the probability P(A|B) of the A event under the condition that the B event occurs. The calculation formula is as follows:

 

Expressed in another form:

 

example:

We assume that: A means that the event "tested positive", and B1 means "has Bayeux death", and B2 means "does not suffer from Bayeux death"

Known: In the case of Bayeux death, the probability of being positive is P(A|B1)=99.9%; without Bayeux death, the probability of being positive is P(A|B2)=0.1 %. In a population, the probability of a person suffering from Bayesian death is 0.01%, and the probability of not suffering from it is 99.99%.

It can be known that a person is tested positive, and the probability of his Bayeux death is P(B1∩A), it can pass:

 

Calculated, the probability is 0.00999%

The probability that a person suffers from Bayeux death when the test is positive can be obtained, that is, P(B1|A), which can be obtained by:

 Calculated, the probability is 9%

 That is: posterior probability = conditional probability / total probability, in fact, Bayesian principle is to solve the posterior probability.

The Bayesian formula is:

 

1.3 Bayesian classification:

 

 

Two Naive Bayes Classification

2.1 Principle introduction:

Naive Bayes is a simple yet extremely powerful predictive modeling algorithm called Naive Bayes because it assumes that each input variable is independent. This is a strong assumption, and the actual situation is not guaranteed, but this technique is still very effective for the vast majority of complex problems . Naive Bayesian classification is a supervised classification algorithm that can perform binary classification or multi-classification.

Naive Bayesian models consist of two types of probabilities:

①The probability P(Cj) of each category;

②Conditional probability P(Ai|Cj) of each attribute.

Example: When we see a black man on the street, if we are asked where he is from, we will probably guess Africa nine times out of ten. Because blacks have the highest rate of Africans, of course he may also be American or Asian, but in the absence of other available information, we will choose the category with the highest conditional probability, which is the basis of Naive Bayesian thinking.

Naive Bayes is based on Bayes' theorem.

2.2 Mathematical expressions:

 

 The category of the sample can be determined . The same classification is performed on each sample, so that each sample has the greatest probability of accurately judging its category.

2.3 Algorithm steps:

Step1: Find out all the feature attributes featList that appear in the sample, and calculate the probability Pi of each category

Step2: Convert the feature vector of the sample into a vector containing only 0 and 1 with the same length as featList, where 1 means that the attribute in featList appears in the sample; 0 means that it does not appear

Step3: Calculate the probability of each feature of each category of samples

Step4: Calculate the probability of each category according to the appearance of the feature attributes in each sample, and select the largest one as the feature of the sample.

Step5: Classify all samples in the data set and calculate the accuracy of prediction

The training process is to estimate the class prior probability P(c) based on the training set D, and estimate the conditional probability for each attribute

  In practical applications, depending on whether the feature variable is discrete or continuous, different models are used for processing, and the estimation method of maximum likelihood estimation is required.

2.4 Commonly used models and applications of Naive Bayesian :
①Gaussian model: dealing with the case where the feature is a continuous variable
②Polynomial model: the most common, requiring the feature to be discrete data
③Bernoulli model: requiring the feature to be discrete, and Boolean type, that is true and false, or 1 and 0

In some cases, due to the extreme scarcity of sample data, it is very likely that the value of a certain feature and the value of a certain category have never appeared in the training set at the same time. This will cause the estimate of the probability to be equal to 0, which is not true. Reasonably, it cannot be assumed that an event will not happen just because an event has not been observed, so some false samples need to be given, and smooth estimation is required, which is called Laplace smoothing. If the feature is a continuous variable, it is necessary to assume the form of the conditional probability distribution, and use the training set to solve

 , we need to assume the form of the conditional probability distribution. A common assumption is that for a given

 

Satisfy the normal distribution, and the mean and standard deviation of the normal distribution need to be learned from the training set. Such a model is called a Gaussian Naïve Bayes classifier.

Multinomial Naive Bayes is usually used in text classification, and its features refer to the number of occurrences or frequencies of words in the text to be classified.

2.5 Why is Naive Bayes reliable? :

Naive Bayes classifiers are quite simple due to the strong assumption of conditional independence. However, it still has a very nice effect. What is the reason? Before people use a classifier, the first step (and the most important step) is often feature selection. The purpose of this process is to exclude collinearity between features and select relatively independent features. Second, when we assume that the features are independent of each other, this actually implies a regularization process; regardless of the correlation between variables effectively reduces the classification variance of Naive Bayes. Although this may increase the bias of the classification, if such bias does not change the order of the samples, it will have little effect on the classification results. For these reasons, Naive Bayesian classifiers tend to achieve very good results in practice. Hand and Yu (2001) showed this with a lot of real data.

2.6 Advantages and disadvantages of Naive Bayes :

advantage:

    (1) The Naive Bayesian model originated from classical mathematical theory and has stable classification efficiency.

    (2) It performs well on small-scale data, can handle multiple classification tasks, and is suitable for incremental training, especially when the amount of data exceeds the memory, we can go to incremental training in batches.

    (3) It is less sensitive to missing data, and the algorithm is relatively simple, which is often used in text classification.

shortcoming:

    (1) In theory, the Naive Bayesian model has the smallest error rate compared with other classification methods. But this is not always the case in practice. This is because the Naive Bayesian model assumes that the attributes are independent of each other when the output category is given. This assumption is often not true in practical applications. When the number of attributes is large or the attributes When the correlation between them is large, the classification effect is not good. Naive Bayes performs best when the attribute correlation is small. For this, there are algorithms such as semi-naive Bayes that are modestly improved by taking into account partial dependencies.

    (2) The prior probability needs to be known, and the prior probability often depends on the assumptions. There can be many hypothetical models, so sometimes the prediction effect will be poor due to the hypothetical prior model.

    (3) Since we determine the probability of the posterior through the prior and the data to determine the classification, there is a certain error rate in the classification decision.

    (4) It is very sensitive to the expression form of the input data.

three differences

There is a difference between Bayesian principle, Bayesian classification and Naive Bayesian

Bayesian principle is the biggest concept. It solves the problem of "reverse probability" in probability theory. Based on this theory, people design Bayesian classifiers. Naive Bayesian classification is a Bayesian classifier. One of the simplest and most commonly used classifiers. The reason why Naive Bayes is simple is that it assumes that the attributes are independent of each other, so there are constraints on the actual situation. If there is an association between the attributes, the classification accuracy will be reduced. Fortunately, for most cases, the classification effect of Naive Bayes is good.

  Their relationship is shown in Figure 1:

 

Figure 1: Relationship Diagram

Four practical application scenarios

Naive Bayesian algorithm plays an important role in text recognition and image recognition. An unknown text or image can be classified according to its existing classification rules, and finally achieve the purpose of classification. Naive Bayesian algorithm is widely used in real life, such as text classification, spam classification, credit evaluation, phishing website detection and so on.

Application scenario 4.1-------English news classification:

The algorithm flow chart is shown in Figure 2 below:

 

Figure 2: English News Classification Flowchart

Proceed as follows:

1. Import the data set from sklearn.data (classic 20 types of news text data, 20 Newsgroups corpus, about 20,000 news articles):

It can be seen that there are 18846 news downloaded this time

Output the first news, as shown in Figure 3:

 

Figure 3: First News

2. Split the data and randomly sample a part of the data for testing the training model

3. Feature extraction is performed through the CountVectorizer module in sklearn.feature_extraction.text.

Because these text data have no numerical scale and no specific characteristics, vector processing of the data is required.

4. Import the Naive Bayesian classification model (MultinomialNB) through sklearn.naive_bayes, and estimate the parameters of the training data, and then make predictions through the vectorized test data.

5. Evaluate the classification problem using the accuracy, recall, precision and F1 metrics:

The evaluation is shown in Figure 4:

 

Figure 4: Classification evaluation results

As can be seen from the figure above, the Naive Bayesian model has an accuracy rate of 83.98% for classifying 4712 test sample news, and the average accuracy rate (0.86), recall rate (0.84) and F1 index (0.82) are all 0.8 Above, the effect is not bad.

Due to its strong independent assumption of characteristic conditions, the calculation scale is greatly reduced, resource consumption and time overhead are greatly reduced, so it is widely used in massive text classification. In theory, Naive Bayesian models have the smallest error rate compared to other classification methods. But this is not always the case in practice. This is because the Naive Bayesian model assumes that the attributes are independent of each other. This assumption is often not true in practical applications. When the number of attributes is relatively large or the correlation between attributes is large, The classification effect is not good.

Application Scenario 4.2-------Spam:

The flowchart is shown in Figure 5:

 

Figure 5: Flow chart of spam processing

step:

1. Get the dataset:

Normal emails are stored in ham_data.txt, and spam emails are stored in spam_data.txt. The normal data is marked as 1, the junk data is marked as 0, and the stop words are stored in stop_words.utf8.

It can be seen that the total number of samples is 10001, and the content of one of the emails is shown in Figure 6:

 

Figure 6: Email message

2. Remove empty data.

3. Divide the training set and test set.

4. Standardize the data set, remove word segmentation, special symbols and stop words in the data set.

5. Text feature extraction:

Using the TfidfVectorizer() function, TfidfVectorizer() performs feature extraction based on Tf-idf

Tf-idf inverse document frequency (Inverse Document Frequency, IDF):

If a word is relatively rare, but it appears many times in this article, then it probably reflects the characteristics of this article, which is exactly the keyword we need.

Term frequency (TF) = the number of times a word appears in the article/the number of the word

Inverse document frequency (IDF) = log(total number of documents in the corpus/number of documents including the word + 1), that is, the larger the parameter, the rarer the word.

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

Therefore, keywords can be better extracted by calculating TF-IDF

6. Import the Bayesian model and use multinomial Bayesian for classification

7. Evaluate the model:

Figure 7 is available

 

Figure 7: Model Evaluation

The accuracy rate reaches 0.79, and the classification effect is not bad.

Application scenario 3-------Chinese news classification:

The simplified flowchart is shown in Figure 8:

 

Figure 8: Chinese News Classification Flowchart

Proceed as follows:

1. Import the news data set, extract the title, content, URL and content and other information:

Print the first five news items (including category, theme, URL, content and other information), the result is shown in Figure 8:

 

Figure 8: Chinese news information

2. Use the jieba library tokenizer for word segmentation.

3. Read the stop words list, remove stop words, and perform data cleaning:

Stop words such as: ! " #$%&'()*+, ---...... .......................... . ./. A reporter counts the year, month, day, hour, minute, second/ //012345

These words will pop up in the news a lot, but they are not very useful, so they need to be cleaned up. The stop word library is stored in stopwords.txt, which stores the stop words that need to be cleaned up.

When the stop words are not cleared, it is shown in Figure 9, and when the stop words are cleared, it is shown in Figure 10

 Figure 9: Content without removing stop words                                          

 

 Figure 10: Clear content after stop words

4. Divide the training set and test set for the news content:

Among them, the classification label is label_mapping = {"Car": 1, "Finance": 2, "Technology": 3, "Health": 4, "Sports":5, "Education": 6,"Culture": 7 ,"Military": 8,"Entertainment": 9,"Fashion": 0}

5. Extract text features from the training set:

You can use CountVectorizer() (to calculate word frequency) or TfidfVectorizer() (to calculate inverse document frequency). This application found that TfidfVectorizer works better. Use TfidfVectorizer to convert the original text into a feature matrix of tf-idf.

6. Perform Bayesian classification.

7. Output classification accuracy.

Model evaluation is shown in Figure 11

 

Figure 11: Model Evaluation

analyze:

The accuracy of the classification has reached 0.8152, which shows that the effect is very good and has great application significance. It can be seen that even a very simple algorithm can achieve unexpected results as long as it can be reasonably used and trained with a large amount of high-dimensional data.

Application 1. Code:


英文新闻分类代码,从sklearn.datasets获取数据集

from sklearn.datasets import fetch_20newsgroups
news_data = fetch_20newsgroups(subset = 'all')
print("本次下载的新闻条数为:",len(news_data.data))
print("第一篇文章内容的字符数量为:",len(news_data.data[0]))
print(news_data.data[0])

x = news_data.data
y = news_data.target
from sklearn.model_selection import train_test_split
#from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=33)

from sklearn.feature_extraction.text import CountVectorizer
#文本数据向量化
vec = CountVectorizer()
x_train=vec.fit_transform(x_train)
x_test=vec.transform(x_test)

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB() #初始化模型
model.fit(x_train,y_train) #调用fit函数进行模型训练
y_predict = model.predict(x_test) #使用predict 函数进行预测


from sklearn.metrics import classification_report
print("模型得分:%.2f" % (float(model.score(x_test,y_test))*100))
print(classification_report(y_test,y_predict,target_names= news_data.target_names))

Application 2. Code:

垃圾邮件过滤代码,数据集为ham_data.txt和spam_data.txt和stop_words.utf8中。

import numpy as np
from sklearn.model_selection import train_test_split

# 获取数据函数定义
def get_data():
    with open("E://laji//ham_data.txt", encoding="utf8") as ham_f, open("E://laji//spam_data.txt", encoding="utf8") as spam_f:
        # 正常的邮件数据
        ham_data = ham_f.readlines()
        # 垃圾邮件数据
        spam_data = spam_f.readlines()
        # 正常数据标记为1
        ham_label = np.ones(len(ham_data)).tolist()#将数组或者矩阵转化成列表
        # 垃圾邮件数据标记为0
        spam_label = np.zeros(len(spam_data)).tolist() 
        # 数据集集合
        corpus = ham_data + spam_data
        # 标记数据集合
        labels = ham_label + spam_label
return corpus, labels

# 数据分割函数定义
def prepare_datasets(corpus, labels, test_data_proportion=0.3):
    train_X, test_X, train_Y, test_Y = train_test_split(corpus, labels,
                                                        test_size=test_data_proportion, random_state=42)
    return train_X, test_X, train_Y, test_Y

# 移除空的数据函数定义
def remove_empty_docs(corpus, labels):
    filtered_corpus = []
    filtered_labels = []
    for doc, label in zip(corpus, labels):
        if doc.strip():
            filtered_corpus.append(doc)
            filtered_labels.append(label)

    return filtered_corpus, filtered_labels





# 获取数据集
corpus, labels = get_data()  
print("总的数据量:", len(labels))
corpus, labels = remove_empty_docs(corpus, labels)
print('样本之一:', corpus[10])
print('样本的label:', labels[10])
label_name_map = ["垃圾邮件", "正常邮件"]
print('实际类型:', label_name_map[int(labels[10])], label_name_map[int(labels[5900])])
# 对数据进行划分
train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus,
                                                                        labels,
                                                                        test_data_proportion=0.3)



# 归一化函数定义
import re
import string
import jieba
# 加载停用词
with open("E://laji//stop_words.utf8", encoding="utf8") as f:
    stopword_list = f.readlines()
# 分词
def tokenize_text(text):
    tokens = jieba.cut(text)
    tokens = [token.strip() for token in tokens]
    return tokens
# 去掉特殊符号
def remove_special_characters(text):
    tokens = tokenize_text(text)
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))#匹配特殊字符
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text
# 取出停用词
def remove_stopwords(text):
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ''.join(filtered_tokens)
    return filtered_text
# 标准化数据集
def normalize_corpus(corpus, tokenize=False):
    # 声明一个变量用来存储标准化后的数据
    normalized_corpus = []
    for text in corpus:
            # 去掉特殊符号
        text = remove_special_characters(text)
            # 取出停用词
        text = remove_stopwords(text)
        normalized_corpus.append(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
return normalized_corpus


# 进行归一化
norm_train_corpus = normalize_corpus(train_corpus)
norm_test_corpus = normalize_corpus(test_corpus)


# 特征提取
import gensim
import jieba
from gensim import corpora, models, similarities


from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_extractor(corpus, ngram_range=(1, 1)):
    vectorizer = TfidfVectorizer(min_df=1,
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

# tfidf 特征  标准化后的数据送入函数进行提取
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)
tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)


# 导入贝叶斯模型
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()


from sklearn import  metrics

# 模型性能预测函数
def get_metrics(true_labels, predicted_labels):
    print('准确率:', np.round(
        metrics.accuracy_score(true_labels,
                               predicted_labels),
        2))
    print('精度:', np.round(
        metrics.precision_score(true_labels,
                                predicted_labels,
                                average='weighted'),
        2))
    print('召回率:', np.round(
        metrics.recall_score(true_labels,
                             predicted_labels,
                             average='weighted'),
        2))
    print('F1得分:', np.round(
        metrics.f1_score(true_labels,
                         predicted_labels,
                         average='weighted'),
        2))


# 模型调用函数,这样做的好处是,可以自己人选模型
def train_predict_evaluate_model(classifier,
                                 train_features, train_labels,
                                 test_features, test_labels):
    # 模型构建
    classifier.fit(train_features, train_labels)
    # 使用哪个模型做预测
    predictions = classifier.predict(test_features)
    # 评估模型预测性能
    get_metrics(true_labels=test_labels,
                predicted_labels=predictions)
    return predictions


# 基于tfidf的多项式朴素贝叶斯模型
print("基于tfidf的贝叶斯模型")
mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb,
                                                     train_features=tfidf_train_features,
                                                     train_labels=train_labels,
                                                     test_features=tfidf_test_features,
                                                     test_labels=test_labels)

Application 3. Code:

#中文新闻分类代码,使用数据集为val.txt和stopwords.txt


import pandas as pd
import numpy as np
import jieba

#导入数据
df_news = pd.read_table('E://xinwen//val.txt',names=['category','theme','URL','content'],encoding='utf-8')
df_news = df_news.dropna()#删除缺少数据的记录
df_news.head()            #读取前5行

#使用jieba分词器分词
#先将content转换成list格式
content = df_news.content.values.tolist() 
#输出第五条新闻未切词前的内容
print (content[4])
content_S = []
#按新闻内容进行切词
for line in content:
    current_segment = jieba.lcut(line)
    if len(current_segment) > 1 and current_segment != '\r\n': 
        content_S.append(current_segment)
#输出第5条新闻切词后
content_S[4]

#可做可不做
#DataFrame处理
#df_content=pd.DataFrame({'content_S':content_S})
#查看前5条内容
#df_content.head()



#读取停词表
stopwords=pd.read_csv("E://xinwen//stopwords.txt",index_col=False,sep="\t",quoting=3,names=['stopword'], encoding='utf-8')
stopwords.head(20)
#删除停用词的函数
def drop_stopwords(contents,stopwords):
    contents_clean = []
    all_words = []
    for line in contents:
        line_clean = []
        for word in line:
            if word in stopwords:
                continue
            line_clean.append(word)
            all_words.append(str(word))
        contents_clean.append(line_clean)
    return contents_clean,all_words
    #print (contents_clean)
        
#转化为list
contents = df_content.content_S.values.tolist()    
stopwords = stopwords.stopword.values.tolist()
#删除停用词,记录有用的单词
contents_clean,all_words = drop_stopwords(contents,stopwords)

#可做可不做
#DataFrame处理
#df_content=pd.DataFrame({'contents_clean':contents_clean})
#打印前5条去除了停用词的新闻内容
#df_content.head()



df_train=pd.DataFrame({'contents_clean':contents_clean,'label':df_news['category']})
label_mapping = {"汽车": 1, "财经": 2, "科技": 3, "健康": 4, "体育":5, "教育": 6,"文化": 7,"军事": 8,"娱乐": 9,"时尚": 0}
df_train['label'] = df_train['label'].map(label_mapping)
#df_train.head()

from sklearn.model_selection import train_test_split
#划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(df_train['contents_clean'].values, df_train['label'].values, random_state=1)


words = []
for line_index in range(len(x_train)):
    try:
        words.append(' '.join(x_train[line_index]))
    except:
        print (line_index,word_index)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#文本特征提取
vec = TfidfVectorizer(analyzer='word', max_features=4000,  lowercase = False)
#vec = CountVectorizer(analyzer='word', max_features=4000,  lowercase = False)
vec.fit(words)


#引入多项式贝叶斯
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(words), y_train)

test_words = []
for line_index in range(len(x_test)):
    try:
        test_words.append(' '.join(x_test[line_index]))
    except:
         print (line_index,word_index)


classifier.score(vec.transform(test_words), y_test)

Guess you like

Origin blog.csdn.net/cangzhexingxing/article/details/124200619