python data mining Title III - spam SMS text classification

Part III Data Mining - Text Classification

Including eight steps on text classification in general. Data exploration analysis - "data extraction -" text preprocessing - "word -" removal of stop words - "Text quantify told -" classifier -. "Assessment model library includes important python numpy (array), pandas (for processing structured data), matplotlib (word cloud drawing, to facilitate visual representation), sklearn (substantial classification clustering algorithm library).

1. Data exploration analysis

(1) get a lot of non-processed documents, and mark a good document belongs types.
(2) to each document is assigned a unique Id, and before a word mark classification categories instead of discrete numbers. Such as classification labeled [ 'normal message', 'spam messages'], which is represented by a discrete [0,1].
(3) The Id, document content, as a column mark, the number of samples as a line, read these documents into an array. Form: [[Id1, content1, label1 ], ..., [Id_n, content_n, label_n]]
Code Example:
Import PANDAS PD AS
Data = pd.read_csv (csv file name, header = None) # csv file read , column name is not read
data.columns = [ 'Id', ' Content', 'Label']

1.1DataFrame get some method in the data:

  1. data.loc [] # gets the specified data string ranks indexed manner, for example:
    data.loc [0: 2, 'content'] # acquiring content data column of row 0,1,2, [Note]: 0 : Get 0,1,2 row 2, and this is generally not the same slice
    data.loc [[0,2], [ ' content', 'label']] # specified by the list ranks
  2. data.iloc [] # by digital indexing, and an array of identical usage
  3. data [ 'label'] # column label acquired data, the result is a one-dimensional array
    data [[ 'content', ' label']] # Content As a result, all the data column label

    1.2 statistical frequency of occurrence of different label, draw a pie chart

    data [ 'label']. value_counts () # Get This column label respective mark appears in the frequency data, the results are returned in the form of series

    1.2.1 Draw a pie chart

    Data = NUM [ 'label']. value_counts ()
    Import matplotlib.pyplot AS PLT
    plt.figure (figsize = (3,3)) is set to a square canvas # 3 * 3
    plt.pie (num, labels = [ ' normal ',' junk ']) # draw a pie chart, NUM is a series, series with an array index, and the like using the dictionary.
    plt.show ()

2. Data Extraction

When the proportion of differently labeled unbalance, stratified sampling, for example, 0 mark appears 72,000 times, and 8000 marks appears, at this time the model is generated lazy problem.
data_normal = data.loc [data [ 'label '] == 1] .sample (1000, random_state = 123) # 1's for all data in the selected label 1000 random samples
data_bad = data.loc [data [ 'label ' ] == 0] .sample (1000, random_state = 123) # 0 for all data in the label's selection of a random sample of 1000
data_new = pd.contat ([data_normal, data_bad ], axis = 0) # default line stitching, so axis do not write

3. Text Pretreatment

As shown below, content contains a xxx, as well as some special character encoding, as well as the comma full stop punctuation, etc., these things are meaningless characters, you need to delete

to delete these special non-Chinese characters, you will need regular expressions, regular expressions is a reptile essential knowledge, is a conditional expression, use this conditional expression construct of the rules specified retrieve matched conform to the rules of sentence in a specified string.
Re Import
afterDeleteSpecialWord data_new = [ 'Content'] Apply. (the lambda X: the re.sub ( '[^ \ u4E00- \ u9FD5] +', '', String))
Apply here represents each of the series array elements (i.e., the contents of character string of the document) perform this anonymous function x, string parameter is passed in, re.sub shows a previously specified regular expression '[^ \ u4E00- \ u9FD5] +' string match (i.e., non-special Chinese characters) by '' instead. Here the regular expression '[^ \ u4E00- \ u9FD5] +':
[] is a list of atoms, represents a non-^, \ u4E00- \ u9FD5 canonical representation of Chinese characters preceded by ^ said non-Chinese characters, [ ] + represents the atoms in the list of characters to match one or more times. Specific regular expression usage of online resources, many here do not explain in detail.
Dealt with, punctuation and special characters are gone, as follows:

4. word, remove stop words

Content content first step before the first word was, after the word element column is a list of content, such as content column before the elements 'I came to Beijing, Tsinghua University School of Computer Science', after performing segmentation results: [ 'I', 'come', 'Beijing', 'Tsinghua University', 'computer', 'School']
the second step is to remove stop words, first load the stop words file, which stores N number of stop words , then the results of the segmentation performed in the first step removal words present in the list of stop words.
code is as follows:
Import jieba # minutes thesaurus
with Open ( 'stopList.txt', 'R & lt') AS F:
STOP = F .read () # result obtained is a big string of special characters like line breaks also exist in which the
stop = stop.split () # splits on spaces, line breaks, get list of stop words
stop = [ ''] + stop # Because there are no spaces before the stop, but the space is a stop word, it is added back into space
jieba.load_userdic (path) # load the specified path path where user-defined dictionary
after_segement = afterDeleteSpecialWord.apply ( jieba.lcut) # segment words
data_after = after_segement.ap ply (lambda x: [i for i in x if i not in stop]) # remove stop words

4.1 Draw word cloud

Draw text word cloud is a visual image of the classification in the word frequency representation of the image presented in the form, more high frequency words fonts, small frequency, small fonts.
import matplotlib.pyplot as plt # drawing tools
from wordcloud import WordCloud # word cloud tool library
import itertools # will be compressed into a two-dimensional data-dimensional data
pic = plt.imread (picturePath) # here picturePath the path for a specific picture, not specified here this line of code is to load a background picture drawing board
'' '
WC = wordcloud (font_path = r'C: \ Windows \ fonts \ Font name', background_color = 'white', mask = pic) # generate a word cloud objects, windows system fonts stored in c disk folder under the Windows fonts folder. Because there are Chinese statistics, so do not choose English font, but select Chinese font, right, property, as, for the specific font name '' '

NUM = pd.Series (List (itertools.chain (* List ( data_after)))). value_counts ( ) # word frequency statistics
wc.fit_words (num) # good word frequency statistics will be put to
plt.imshow (WC)
plt.show ()

Text representation vectorization

文本向量化表示的含义为:由于我们目前得到的是一个分词结果,是中文,而计算机不能直接将中文作为分类的输入数据,必须将其用数字来表示,那么如何将一个文档用一个数字化的向量表示呢,这就是文本向量化。
常用的向量化表示有词袋模型,词频,TF-IDF,以及考虑上下文的词嵌入。
词袋模型是指,一个文档中出现了的词则该词置1,词库中其他词置0,而不考虑出现次数的多少。一个文档则可以表示成一个N维的0,1向量,N的大小取决于词库的大小。
词频:在词袋模型的基础上,考虑出现词的次数,而不是只要出现了就是1。
TF-IDF:考虑一个词的词频以及逆文档频率,逆文档频率是指该词在所有文档里的稀有程度,该词在所有文档里出现的文档数越少,则该词越稀有,区分度就越高,逆文档频率就越高,逆文档频率=log(所有文档数/(出现该词的文档数+1)),而TF-IDF则=TF*IDF。
在sklearn 中的feature_extraction.text包中有CountVectorizer,TfidfVectorizer直接可以做文本向量化工作,一个是根据词频,一个是TF-IDF。
tmp = data_after.apply(lambda x:' '.join(x)) # 由于谷歌开发的向量化工具仅支持按空格统计,所以之前的列表存放的单词需要转换成一个以空格分隔开的一个大字符串。
cv=CountVectorizer().fit(tmp) # 加载字典,以及需要向量化的文本数据
vector_data = cv.transform(tmp) # 向量化,结果是一个迭代器
vector_array = vector_data.toarray() # 将迭代器转为数组

文本分类

接下来的步骤和一般机器学习分类问题是一模一样的,不多做介绍。已经得到结构化数据vector_array,以及相应的标签label,可以用sklearn的各种训练模型进行训练,测试,模型评估等等。

Guess you like

Origin www.cnblogs.com/xqqblog/p/12051108.html