Machine Learning Actual Combat-Bayesian Algorithm-24

Bayesian-News Classification

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
news = fetch_20newsgroups(subset='all')
print(news.target_names)
print(len(news.data))
print(len(news.target))

Insert picture description here

print(len(news.target_names))

Insert picture description here

news.data[0]

Insert picture description here

print(news.target[0])
print(news.target_names[news.target[0]])

Insert picture description here

x_train,x_test,y_train,y_test = train_test_split(news.data,news.target)
# train = fetch_20newsgroups(subset='train')
# x_train = train.data
# y_train = train.target
# test = fetch_20newsgroups(subset='test')
# x_test = test.data
# y_test = test.target

Insert picture description here

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

# 
print(cv.get_feature_names())
print(cv_fit.toarray())

print(cv_fit.toarray().sum(axis=0))

Insert picture description here

from sklearn import model_selection 
from sklearn.naive_bayes import MultinomialNB

cv = CountVectorizer()
cv_data = cv.fit_transform(x_train)
mul_nb = MultinomialNB()

scores = model_selection.cross_val_score(mul_nb, cv_data, y_train, cv=3, scoring='accuracy')  
print("Accuracy: %0.3f" % (scores.mean())) 

Insert picture description here
TfidfVectorizer uses an advanced calculation method called Term Frequency Inverse Document
Frequency (TF-IDF). This is a statistical method to measure the importance of a word in a text or corpus. Intuitively speaking, this method searches for words with higher frequencies in the current document by comparing the frequencies of words in the entire corpus. This is a way to standardize the results, which can avoid the situation that some words appear too frequently and have little effect on the characterization of an instance (I guess for example, a and and appear more frequently in English, but they (It has no effect on characterizing the role of a text)

from sklearn.feature_extraction.text import TfidfVectorizer
# 文本文档列表
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# 创建变换函数
vectorizer = TfidfVectorizer()
# 词条化以及创建词汇表
vectorizer.fit(text)
# 总结
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# 编码文档
vector = vectorizer.transform([text[0]])
# 总结编码文档
print(vector.shape)
print(vector.toarray())

Insert picture description here

# 创建变换函数
vectorizer = TfidfVectorizer()
# 词条化以及创建词汇表
tfidf_train = vectorizer.fit_transform(x_train)

scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy') 
print("Accuracy: %0.3f" % (scores.mean())) 

Insert picture description here

def get_stop_words():
    result = set()
    for line in open('stopwords_en.txt', 'r').readlines():
        result.add(line.strip())
    return result

# 加载停用词
stop_words = get_stop_words()
# 创建变换函数
vectorizer = TfidfVectorizer(stop_words=stop_words)


mul_nb = MultinomialNB(alpha=0.01)

# 词条化以及创建词汇表
tfidf_train = vectorizer.fit_transform(x_train)

scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy') 
print("Accuracy: %0.3f" % (scores.mean())) 

Insert picture description here

# 切分数据集
tfidf_data = vectorizer.fit_transform(news.data)
x_train,x_test,y_train,y_test = train_test_split(tfidf_data,news.target)

mul_nb.fit(x_train,y_train)
print(mul_nb.score(x_train, y_train))

print(mul_nb.score(x_test, y_test))

Insert picture description here

Bayesian spell checker

Insert picture description here
Spelling Checker Principle
Among all the correct spelling words, we want to find a correct word c that maximizes the conditional probability of w. Solve:
P(c|w) -> P(w|c) P© / P(w) For
example: appla is the condition w, apple and apply are the correct words c, for both apple and apply, P(w) Is the same, so we ignore it in the above formula and write it as:
P(w|c) P©

P©, the probability of the correctly spelled word c appearing in the article, that is, how likely is the occurrence of c in an English article.
Suppose it can be considered that the greater the probability of a word appearing in the article, the greater the probability of correct spelling. This amount can be replaced by the number of occurrences of the word. For example, the probability P('the') that appears in English is relatively high, while the probability that P('zxzxzxzyy') appears is close to 0 (assuming the latter is also a word).
P(w|c), in the user The probability of typing w when you want to type c. This is the probability that the user will type c into w by mistake.

import re
# 读取内容
text = open('big.txt').read()
# 转小写,只保留a-z字符
text = re.findall('[a-z]+', text.lower())
dic_words = {
    
    }
for t in text:
    dic_words[t] = dic_words.get(t,0) + 1
dic_words

Insert picture description here
Edit distance:
The edit distance between two words is defined as the use of several insertions (insert a single letter in the word), delete (delete a single letter), exchange (swap two adjacent letters), and replace (switch a single letter). The operation of changing letters to another) changes from one word to another.

# 字母表
alphabet = 'abcdefghijklmnopqrstuvwxyz'

#返回所有与单词 word 编辑距离为 1 的集合
def edits1(word):
    n = len(word)
    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion
               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion
apple = 'apple'
apple[0:0] + apple[1:]

Insert picture description here
Insert picture description here

#返回所有与单词 word 编辑距离为 2 的集合
#在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词
def edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))
e1 = edits1('something')
e2 = edits2('something')
len(e1) + len(e2)

Insert picture description here
The words with an edit distance of 1 or 2 to something actually reached 114,818.
Optimization: Only those correct words are used as candidates. After optimization, edits2 can only return 3 words:'smoothing','something' and'soothing'

P(w|c) solution: Normally, the probability of spelling one vowel into another is greater than the consonant (because people often type hello as hallo); the probability of spelling the first letter of a word incorrectly will be relatively small , and many more. But for the sake of simplicity, I chose a simple method: the correct word with an edit distance of 1 has higher priority than the edit distance of 2, and the correct word with an edit distance of 0 has higher priority than the edit distance of 1. Generally speaking, hello It is more likely to be hallo than hello to halo.

def known(words):
    w = set()
    for word in words:
        if word in dic_words:
            w.add(word)
    return w

# 先计算编辑距离,再根据编辑距离找到最匹配的单词
def correct(word):
    # 获取候选单词
    #如果known(set)非空, candidates 就会选取这个集合, 而不继续计算后面的
    candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or word
    # 字典中不存在相近的词
    if word == candidates:
        return word
    # 返回频率最高的词
    max_num = 0
    for c in candcidates:
        if dic_words[c] >= max_num:
            max_num = dic_words[c]
            candidate = c
    return candidate

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_37978800/article/details/113877327