Crawling Fang Wenshan's lyrics summarized by netizens in 'Sentence Fan' and composing frequency statistics

Require:

  1. Pick a topic that interests you.
  2. Write a crawler program in python to crawl data on related topics from the web.
  3. Perform text analysis on the crawled data to generate a word cloud.
  4. Explain the text analysis results.
  5. Write a complete blog, describing the above implementation process, problems encountered and solutions, data analysis ideas and conclusions.
  6. Finally, submit all the crawled data, crawler and data analysis source code.

  In this assignment, I crawled the lyrics of Fang Wenshan from the website "Sentence Fan" to see the frequency of words in his lyrics and the sentences that netizens like Mr. Fang.
  The main problem encountered in the crawling process is that the website headerschecks the requests, so headersparameters and declarations need to be added UA.
  code show as below:

import jieba
import requests
from bs4 import BeautifulSoup


lyrics = ''
headers = {
    'User-Agent': 'User-Agent:*/*'
}

resp = requests.get('http://www.juzimi.com/writer/%E6%96%B9%E6%96%87%E5%B1%B1', headers=headers)
resp.encoding = 'UTF-8'
print(resp.status_code)
soup = BeautifulSoup(resp.text, 'html.parser')

page_url = 'http://www.juzimi.com/writer/%E6%96%B9%E6%96%87%E5%B1%B1?page={}'
page_last = soup.select('.pager-last')
if len(page_last) > 0:
    page_last = page_last[0].text

for i in range(0, int(page_last)):
    print(i)
    resp = requests.get(page_url.format(i), headers=headers)
    resp.encoding = 'UTF-8'
    soup = BeautifulSoup(resp.text, 'html.parser')
    for a in soup.select('.xlistju'):
        lyrics += a.text + ' '

# 保留爬取的句子
with open('lyrics.txt', 'a+', encoding='UTF-8') as lyricFile:
    lyricFile.write(lyrics)

# 加载标点符号并去除歌词中的标点
with open('punctuation.txt', 'r', encoding='UTF-8') as punctuationFile:
    for punctuation in punctuationFile.readlines():
        lyrics = lyrics.replace(punctuation[0], ' ')

# 加载无意义词汇
with open('meaningless.txt', 'r', encoding='UTF-8') as meaninglessFile:
    mLessSet = set(meaninglessFile.read().split('\n'))
mLessSet.add(' ')

# 加载保留字
with open('reservedWord.txt', 'r', encoding='UTF-8') as reservedWordFile:
    reservedWordSet = set(reservedWordFile.read().split('\n'))
    for reservedWord in reservedWordSet:
        jieba.add_word(reservedWord)

keywordList = list(jieba.cut(lyrics))
keywordSet = set(keywordList) - mLessSet  # 将无意义词从词语集合中删除
keywordDict = {}

# 统计出词频字典
for word in keywordSet:
    keywordDict[word] = keywordList.count(word)

# 对词频进行排序
keywordListSorted = list(keywordDict.items())
keywordListSorted.sort(key=lambda e: e[1], reverse=True)
# 将所有词频写出到txt
for topWordTup in keywordListSorted:
    print(topWordTup)
    with open('word.txt', 'a+', encoding='UTF-8') as wordFile:
        for i in range(0, topWordTup[1]):
            wordFile.write(topWordTup[0]+'\n')

  In the above code generation word.txt, copy the vocabulary to the website https://wordsift.org/ for word cloud generation. The generated word cloud image is as follows:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325044105&siteId=291194637