Table of contents
1. Word Frequency Statistics – Corpus Construction
Text Mining : Transforming textual information into usable knowledge. Usually, for the classification of a large number of files, different articles are usually saved by creating different folders.
Similarly, read the [text files] that need to be analyzed into [variables], and then use different data structures to store these text files in memory for the next step of analysis. This [memory variable] is the [corpus] we want to learn.
[corpus]: the collection of all documents to be analyzed
import os
import os.path
filePaths=[]
for root, dirs, files in os.walk("F:\\2.1 语料库\\2.1\\SogouC.mini\\Sample"):
#os.path.join()拼接文件路径的方法
for name in files:
filePaths.append(os.path.join(root, name)) # 路径+文件名
for root, dirs, files in os.walk("F:\\2.1 语料库\\2.1\\SogouC.mini\\Sample"):
print(root) # D:\学习资料\2.1 语料库\2.1\SogouC.mini\Sample\C000013
print(dirs)
print(files) # ['10.txt', '11.txt', '12.txt', '13.txt', '14.txt', '15.txt', '16.txt', '17.txt', '18.txt', '19.txt']
read data
import codecs
filePaths = []
fileContents = []
for root, dirs, files in os.walk("F:\\2.1 语料库\\2.1\\SogouC.mini\\Sample"):
for name in files:
filePath = os.path.join(root, name)
filePaths.append(filePath)
f = codecs.open(filePath, 'r', 'utf-8')
# 调用read( ),将内容读取出来,保存到fileContent中
fileContent = f.read()
f.close()
fileContents.append(fileContent)
Build [corpus] , which is in DataFrame format
import pandas
# 将获取到的文件内容组织成一个数据框,,框 就是语料库,创建语料库corpos
corpos = pandas.DataFrame({
'filePath': filePaths,
'fileContent': fileContents
})
语料库:
file path file content
Summary: Construction method of [corpus] : os.walk(fileDir)
#fileDir means [file path]
[file read]: codecs.open(filePath,method,encoding)
splicing file path:os.path.join(root,name)
2. Word frequency statistics – Chinese word segmentation
- install jieba
pip install jieba
- jieba.cut(content) # content 需要分词的句子
- 返回 segents ,分词的数组
- jieba.add_word()
- Optimize word segmentation effect
- Add custom participle
jieba.add_word(word) # word 需要增加的分词
jieba.add_word("天罡北斗阵")
- Import custom dictionaries
jieba.load_user(filePath) # filePath 自定义词典所在的路径
jieba.load_userdict("F:\\2.2 中文分词\\2.2\\金庸武功招式.txt")
- specific
steps
import jieba
filePaths, segments = [], []
for index, row in corpos.iterrows():
filePath = row['filePath']
fileContent = row['fileContent']
segs = jieba.cut(fileContent)
for seg in segs:
if len(seg.strip())>0:
segments.append(seg)
filePaths.append(filePath)
filePaths, segments = [], []
for index, row in corpos.iterrows():
print(row['filePath'])
print(row['fileContent'])
new data frame
segDF = pandas.DataFrame({
'filePath':filePaths,
'segment':segments })
iterrows(): Iterates the DataFrame into (insex, Series) pairs.
- 2 After the word segmentation of the article, each word segmentation needs to be followed by a piece of information - which article the word segmentation comes from, so it is necessary to point out the [source] of the word segmentation
- 2.1 Use the traversal method corpos.iterrows() of [data frame] to traverse each row of data in the [corpus]
- 2.2 The obtained row is used as a [dictionary] Through the column name, use the acquisition method of [dictionary value] to obtain [file path] and [file content]
import jieba
segments=[]
filePaths=[]
for index, row in corpos.iterrows():
#此次获取的是corpos的内容
filePath=row['filePath']
fileContent=row['fileContent']
#2.2调用jieba.cut()方法,对文章进行分词
segs=jieba.cut(fileContent)
for seg in segs:
segments.append(seg)
filePaths.append(filePath)
#2.3将得到的结果存储到一个【数据框】中,
segmentDataFrame = pandas.DataFrame({
'segment': segments,
'filePath': filePaths
})