Article directory
1. Corpus preparation
1. Get the article address
First select the blogger to be analyzed and enter its homepage
The address of the homepage at the top of the browser:
https://blog.csdn.net/yt266666
You need to get all blog details from the home page:
- When the number of articles is small, you can manually copy
- When the number of articles is large, parse the source code of the page
html = requests.get(url="https://blog.csdn.net/yt266666", headers=headers).text
Use this code to get the source code of the page or right click on the page and save as
There are many ways to parse page source code, including regular expressions, full-text search, BeautifulSoup, Xpath, etc.
After parsing with XPath, the following results are obtained:
1 https://blog.csdn.net/yt266666/article/details/127559647
2 https://blog.csdn.net/yt266666/article/details/127539343
3 https://blog.csdn.net/yt266666/article/details/127511361
4 https://blog.csdn.net/yt266666/article/details/127474784
5 https://blog.csdn.net/yt266666/article/details/127453067
6 https://blog.csdn.net/yt266666/article/details/127452708
7 https://blog.csdn.net/yt266666/article/details/127427285
8 https://blog.csdn.net/yt266666/article/details/127405088
9 https://blog.csdn.net/yt266666/article/details/127401802
10 https://blog.csdn.net/yt266666/article/details/127394061
11 https://blog.csdn.net/yt266666/article/details/127377217
12 https://blog.csdn.net/yt266666/article/details/127344774
13 https://blog.csdn.net/yt266666/article/details/127334902
14 https://blog.csdn.net/yt266666/article/details/127334143
15 https://blog.csdn.net/yt266666/article/details/127306966
16 https://blog.csdn.net/yt266666/article/details/127284543
17 https://blog.csdn.net/yt266666/article/details/127271663
18 https://blog.csdn.net/yt266666/article/details/127269150
19 https://blog.csdn.net/yt266666/article/details/127268544
20 https://blog.csdn.net/yt266666/article/details/127268431
The main method of parsing the source code of the page to obtain the address is as follows:
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.HTML(html, parser=parser) # 加载html文件
blogList = tree.xpath('//div[@class="mainContent"]//a/@href')
2. Get content by address
Pass the address in to parse all the text in the blog content tag inside the page:
def getText(blogUrl):
html = getHTMLText(blogUrl)
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.HTML(html, parser=parser) # 加载html文件
html = tree.xpath('//div[@class="blog-content-box"]//text()')
html = "".join(html).replace('\n', '')
html = html.replace(" ", "")
return html
This approach is rough but barely usable.
Then store the text in a local TXT file for easy reading
with open('blogText.txt', 'a', encoding='utf-8') as fd:
fd.write(blogText)
2. Text Mining
1. Read text
open to obtain a binary file stream read reads the entire document as a string
text=open('blogText.txt', 'r',encoding='utf-8').read()
text=re.sub('[^\u4e00-\u9fa5]+','',text)
read method:
-
.read() reads the entire file at a time, it is usually used to put the contents of the file into a string variable.
-
.readlines() automatically parses the file content into a list of lines,
-
.readline() reads one line at a time and is usually much slower than .readlines().
-
.readline() should only be used when there is not enough memory to read the entire file at once.
2. Chinese word segmentation
Use full mode for slicing Here you need to import jieba dependencies
seg_list = jieba.cut(text, cut_all=True)
print(list(seg_list))
Show word segmentation effect:
'类型', '输入', '二维', '矩阵', '向量', '或', '常数', '它们', '的', '值', '将', '决定', '图像', '的', '绘制', '位置', '如果', '坐标', '为', '常数', '数则', '绘制', '平行', '于', '平面', '的', '二维', '图形', '其他', '两个', '坐标', '同理', '如果', '与', '均', '为', '向量', '向量', '长度', '必须', '与', '内', '的', '矩阵', '相同', '如下', '所示', '旋转', '转角', '角度', '旋转', '观看', '角度', '定义', '观看', '方向', '的', '角度', '横向', '旋转', '图形', '观看', '看表', '表示', '纵向', '旋转', '图形', '观看', '从', '左', '至', '右', '原图', '背景', '方框', '方框框', '框框', '的', '类型', '默认', '只', '绘制', '背景', '面板', '类型', '有', '多种', '如', '图形', '全', '方框', '后面', '面板', '后面', '面板', '网格', '格网', '网格', '后面', '面板', '黑色', '背景', '网格', '黑色', '背景', '用户', '手动', '调整', '无边', '边框', '修改', '颜色', '该', '变量', '设置', '用于', '为', '图像', '添加', '颜色', '如果', '设置', '为', '空值', '将', '不会', '会生', '生成', '图像', '用于', '变量', '如果', '为', '而', '我们', '输入', '了', '参数', '那么', '将',
3. Part of speech tagging
Tag the part of speech of the word according to the participle
words = pseg.cut(text[:50])
for word, label in words:
print("词语:"+word+"\t\t\t"+"词性:"+label)
词语:背景 词性:n
词语:方框 词性:n
词语:修改 词性:v
词语:颜色 词性:n
词语:设置 词性:vn
词语:图例 词性:n
词语:数值 词性:n
词语:范围 词性:n
4. Remove stop words
Some words can appear in any article so their appearance is meaningless for our analysis and we delete them.
As I added some stop words in the list:
StopWords = []
StopWords.append('收藏')
StopWords.append('中')
StopWords.append('与')
StopWords.append('我们')
StopWords.append('专栏')
StopWords.append('一个')
StopWords.append('文章')
StopWords.append('方法')
StopWords.append('版权')
StopWords.append('进行')
StopWords.append('使用')
StopWords.append('能够')
StopWords.append('并')
StopWords.append('对')
StopWords.append('可以')
5. Part-of-speech distribution analysis
Draw the table as follows:
part of speech | Frequency | Proportion |
---|---|---|
noun | 4709 | 0.384785 |
verb | 3441 | 0.281588 |
adverb | 561 | 0.045908 |
pronoun | 500 | 0.040917 |
noun verb | 451 | 0.036907 |
adjective | 370 | 0.030278 |
conjunction | 327 | 0.026759 |
numeral | 322 | 0.026350 |
preposition | 295 | 0.024141 |
Position of the word | 225 | 0.018412 |
After having the data, simply draw a bar chart for display: