Text Mining Case: Text Mining and Word Cloud Drawing Based on the Content of CSDN Blog Articles


1. Corpus preparation
1. Get the article address

First select the blogger to be analyzed and enter its homepage

insert image description here

The address of the homepage at the top of the browser:

https://blog.csdn.net/yt266666

You need to get all blog details from the home page:

  • When the number of articles is small, you can manually copy
  • When the number of articles is large, parse the source code of the page

html = requests.get(url="https://blog.csdn.net/yt266666", headers=headers).text

Use this code to get the source code of the page or right click on the page and save as


There are many ways to parse page source code, including regular expressions, full-text search, BeautifulSoup, Xpath, etc.

After parsing with XPath, the following results are obtained:

1	https://blog.csdn.net/yt266666/article/details/127559647
2	https://blog.csdn.net/yt266666/article/details/127539343
3	https://blog.csdn.net/yt266666/article/details/127511361
4	https://blog.csdn.net/yt266666/article/details/127474784
5	https://blog.csdn.net/yt266666/article/details/127453067
6	https://blog.csdn.net/yt266666/article/details/127452708
7	https://blog.csdn.net/yt266666/article/details/127427285
8	https://blog.csdn.net/yt266666/article/details/127405088
9	https://blog.csdn.net/yt266666/article/details/127401802
10	https://blog.csdn.net/yt266666/article/details/127394061
11	https://blog.csdn.net/yt266666/article/details/127377217
12	https://blog.csdn.net/yt266666/article/details/127344774
13	https://blog.csdn.net/yt266666/article/details/127334902
14	https://blog.csdn.net/yt266666/article/details/127334143
15	https://blog.csdn.net/yt266666/article/details/127306966
16	https://blog.csdn.net/yt266666/article/details/127284543
17	https://blog.csdn.net/yt266666/article/details/127271663
18	https://blog.csdn.net/yt266666/article/details/127269150
19	https://blog.csdn.net/yt266666/article/details/127268544
20	https://blog.csdn.net/yt266666/article/details/127268431

The main method of parsing the source code of the page to obtain the address is as follows:

parser = etree.HTMLParser(encoding="utf-8")
tree = etree.HTML(html, parser=parser)  # 加载html文件
blogList = tree.xpath('//div[@class="mainContent"]//a/@href')
2. Get content by address

Pass the address in to parse all the text in the blog content tag inside the page:

def getText(blogUrl):
    html = getHTMLText(blogUrl)
    parser = etree.HTMLParser(encoding="utf-8")
    tree = etree.HTML(html, parser=parser)  # 加载html文件
    html = tree.xpath('//div[@class="blog-content-box"]//text()')
    html = "".join(html).replace('\n', '')
    html = html.replace(" ", "")
    return html

This approach is rough but barely usable.

Then store the text in a local TXT file for easy reading

with open('blogText.txt', 'a', encoding='utf-8') as fd:
    fd.write(blogText)
2. Text Mining
1. Read text

open to obtain a binary file stream read reads the entire document as a string

text=open('blogText.txt', 'r',encoding='utf-8').read()
text=re.sub('[^\u4e00-\u9fa5]+','',text)

read method:

  • .read() reads the entire file at a time, it is usually used to put the contents of the file into a string variable.

  • .readlines() automatically parses the file content into a list of lines,

  • .readline() reads one line at a time and is usually much slower than .readlines().

  • .readline() should only be used when there is not enough memory to read the entire file at once.

2. Chinese word segmentation

Use full mode for slicing Here you need to import jieba dependencies

seg_list = jieba.cut(text, cut_all=True)
print(list(seg_list)) 

Show word segmentation effect:

'类型', '输入', '二维', '矩阵', '向量', '或', '常数', '它们', '的', '值', '将', '决定', '图像', '的', '绘制', '位置', '如果', '坐标', '为', '常数', '数则', '绘制', '平行', '于', '平面', '的', '二维', '图形', '其他', '两个', '坐标', '同理', '如果', '与', '均', '为', '向量', '向量', '长度', '必须', '与', '内', '的', '矩阵', '相同', '如下', '所示', '旋转', '转角', '角度', '旋转', '观看', '角度', '定义', '观看', '方向', '的', '角度', '横向', '旋转', '图形', '观看', '看表', '表示', '纵向', '旋转', '图形', '观看', '从', '左', '至', '右', '原图', '背景', '方框', '方框框', '框框', '的', '类型', '默认', '只', '绘制', '背景', '面板', '类型', '有', '多种', '如', '图形', '全', '方框', '后面', '面板', '后面', '面板', '网格', '格网', '网格', '后面', '面板', '黑色', '背景', '网格', '黑色', '背景', '用户', '手动', '调整', '无边', '边框', '修改', '颜色', '该', '变量', '设置', '用于', '为', '图像', '添加', '颜色', '如果', '设置', '为', '空值', '将', '不会', '会生', '生成', '图像', '用于', '变量', '如果', '为', '而', '我们', '输入', '了', '参数', '那么', '将', 
3. Part of speech tagging

Tag the part of speech of the word according to the participle

words = pseg.cut(text[:50])
for word, label in words:
    print("词语:"+word+"\t\t\t"+"词性:"+label)
词语:背景			词性:n
词语:方框			词性:n
词语:修改			词性:v
词语:颜色			词性:n
词语:设置			词性:vn
词语:图例			词性:n
词语:数值			词性:n
词语:范围			词性:n
4. Remove stop words

Some words can appear in any article so their appearance is meaningless for our analysis and we delete them.

As I added some stop words in the list:

StopWords = []
StopWords.append('收藏')
StopWords.append('中')
StopWords.append('与')
StopWords.append('我们')
StopWords.append('专栏')
StopWords.append('一个')
StopWords.append('文章')
StopWords.append('方法')
StopWords.append('版权')
StopWords.append('进行')
StopWords.append('使用')
StopWords.append('能够')
StopWords.append('并')
StopWords.append('对')
StopWords.append('可以')

insert image description here

5. Part-of-speech distribution analysis

Draw the table as follows:

part of speech Frequency Proportion
noun 4709 0.384785
verb 3441 0.281588
adverb 561 0.045908
pronoun 500 0.040917
noun verb 451 0.036907
adjective 370 0.030278
conjunction 327 0.026759
numeral 322 0.026350
preposition 295 0.024141
Position of the word 225 0.018412

After having the data, simply draw a bar chart for display:

insert image description here

6. High-frequency vocabulary analysis

insert image description here

7. Word cloud drawing

insert image description here

Guess you like

Origin blog.csdn.net/yt266666/article/details/127683788