1. Choose a topic or website that interests you. (All students cannot be the same)
https://www.bilibili.com/video/av22224421
2. Write a crawler program in python to crawl data on related topics from the Internet.
3. Perform text analysis on the crawled data to generate a word cloud.
import requests import jieba import pandas import matplotlib.pyplot as plt from wordcloud import WordCloud ,ImageColorGenerator from bs4 import BeautifulSoup def jieba_cut(sentence): seg = jieba.cut(sentence) segList = [] for in in seg: segList.append((i)) return segList if __name__=='__main__': str='' url='http://comment.bilibili.com/36773399.xml' page=requests.get(url) page.encoding='utf-8' soup=BeautifulSoup(page.text,"html.parser") content=soup.find_all('d') for i in content: str=str+i.text with open('bilibili.txt','w',encoding='utf-8') as f: f.write(str) dict={} with open ('bilibili.txt','r',encoding='utf-8') as f: words=jieba_cut(f.read()) wordslist=set(words) for word in wordslist: dict[word]=words.count(word) mask = plt.imread(r'H:\129\wallhaven-627476.jpg') text=' '.join(words) wc = WordCloud( width=1000, height=800, margin=2, background_color = ' white ' , #Set the background color font_path= ' C:\Windows\Fonts\STZHONGS.TTF ' , #If there is Chinese, this code must be added, otherwise a box will appear and no Chinese characters will appear max_words=1000, #Set the maximum realistic number of characters max_font_size=400, #Set the maximum font value random_state=50, #Set how many random generation states there are, that is, how many color schemes mask= mask, ) mycloud = wc.generate (text) image_colors = ImageColorGenerator(mask) wc.recolor(color_func=image_colors) wc.to_file('cloudword.jpg')
4. Explain the text analysis results.
5. Write a complete blog, describing the above implementation process, problems encountered and solutions, data analysis ideas and conclusions.
Find the video website, find the source code of the webpage, find the cid, open the bullet screen file XML, and start to crawl the bullet screen and store it in the text. There is a small problem in the word frequency statistics, which cannot be counted with a dictionary. Currently unable to resolve.
6. Finally submit all the crawled data, crawler and data analysis source code.