2. Text Analysis

1. Word Frequency Statistics – Corpus Construction

Text Mining : Transforming textual information into usable knowledge. Usually, for the classification of a large number of files, different articles are usually saved by creating different folders.

Similarly, read the [text files] that need to be analyzed into [variables], and then use different data structures to store these text files in memory for the next step of analysis. This [memory variable] is the [corpus] we want to learn.

[corpus]: the collection of all documents to be analyzed

import os
import os.path

filePaths=[]
for root, dirs, files in os.walk("F:\\2.1 语料库\\2.1\\SogouC.mini\\Sample"):
    #os.path.join()拼接文件路径的方法
    for name in files:
        filePaths.append(os.path.join(root, name))  # 路径+文件名
for root, dirs, files in os.walk("F:\\2.1 语料库\\2.1\\SogouC.mini\\Sample"):
    print(root)  # D:\学习资料\2.1 语料库\2.1\SogouC.mini\Sample\C000013
    print(dirs)
    print(files)  # ['10.txt', '11.txt', '12.txt', '13.txt', '14.txt', '15.txt', '16.txt', '17.txt', '18.txt', '19.txt']

read data

import codecs

filePaths = []
fileContents = []
for root, dirs, files in os.walk("F:\\2.1 语料库\\2.1\\SogouC.mini\\Sample"):
    for name in files:
        filePath = os.path.join(root, name)
        filePaths.append(filePath)
        f = codecs.open(filePath, 'r', 'utf-8')
        # 调用read( ),将内容读取出来,保存到fileContent中
        fileContent = f.read()
        f.close()
        fileContents.append(fileContent)

Build [corpus] , which is in DataFrame format

import pandas

# 将获取到的文件内容组织成一个数据框,,框 就是语料库,创建语料库corpos
corpos = pandas.DataFrame({
    
    
    'filePath': filePaths,
    'fileContent': fileContents
})

语料库:file path file content
insert image description here


Summary: Construction method of [corpus] : os.walk(fileDir)#fileDir means [file path]
[file read]: codecs.open(filePath,method,encoding)
splicing file path:os.path.join(root,name)

2. Word frequency statistics – Chinese word segmentation

  1. install jieba
pip install jieba
- jieba.cut(content)  # content 需要分词的句子
- 返回 segents ,分词的数组
- jieba.add_word()
  1. Optimize word segmentation effect
  • Add custom participle
jieba.add_word(word)  # word 需要增加的分词
jieba.add_word("天罡北斗阵")
  • Import custom dictionaries
jieba.load_user(filePath)  # filePath 自定义词典所在的路径
jieba.load_userdict("F:\\2.2 中文分词\\2.2\\金庸武功招式.txt")
  1. specific
    steps
import jieba
filePaths, segments = [], []
for index, row in corpos.iterrows():
    filePath = row['filePath']
    fileContent = row['fileContent']
    segs = jieba.cut(fileContent)
    for seg in segs:
        if len(seg.strip())>0:
            segments.append(seg)
            filePaths.append(filePath)
filePaths, segments = [], []
for index, row in corpos.iterrows():
    print(row['filePath'])
    print(row['fileContent'])

insert image description here
new data frame

segDF = pandas.DataFrame({
    
    
        'filePath':filePaths,
        'segment':segments })

insert image description here
iterrows(): Iterates the DataFrame into (insex, Series) pairs.

  • 2 After the word segmentation of the article, each word segmentation needs to be followed by a piece of information - which article the word segmentation comes from, so it is necessary to point out the [source] of the word segmentation
    • 2.1 Use the traversal method corpos.iterrows() of [data frame] to traverse each row of data in the [corpus]
    • 2.2 The obtained row is used as a [dictionary] Through the column name, use the acquisition method of [dictionary value] to obtain [file path] and [file content]
import jieba

segments=[]
filePaths=[]
for index, row in corpos.iterrows():
    #此次获取的是corpos的内容
    filePath=row['filePath']
    fileContent=row['fileContent']    
    #2.2调用jieba.cut()方法,对文章进行分词
    segs=jieba.cut(fileContent)
    for seg in segs:
        segments.append(seg)
        filePaths.append(filePath)
#2.3将得到的结果存储到一个【数据框】中,
segmentDataFrame = pandas.DataFrame({
    
    
    'segment': segments, 
    'filePath': filePaths
})

Guess you like

Origin blog.csdn.net/weixin_46713695/article/details/131353949