Big crawler job - crawling the bullet screen of station B

1. Choose a topic or website that interests you. (All students cannot be the same)

https://www.bilibili.com/video/av22224421

 

2. Write a crawler program in python to crawl data on related topics from the Internet.

3. Perform text analysis on the crawled data to generate a word cloud.

import  requests
import jieba
import pandas
import matplotlib.pyplot as plt
from wordcloud import WordCloud ,ImageColorGenerator
from bs4 import BeautifulSoup

def jieba_cut(sentence):
    seg = jieba.cut(sentence)
    segList = []
     for in in seg:
        segList.append((i))
    return segList



if __name__=='__main__':
     str=''
     url='http://comment.bilibili.com/36773399.xml'
     page=requests.get(url)
     page.encoding='utf-8'
     soup=BeautifulSoup(page.text,"html.parser")
     content=soup.find_all('d')
     for i in content:
        str=str+i.text
     with open('bilibili.txt','w',encoding='utf-8') as f:
        f.write(str)
         
     dict={}
    
     with open ('bilibili.txt','r',encoding='utf-8') as f:
        words=jieba_cut(f.read())
        wordslist=set(words)
        for word in wordslist:
            dict[word]=words.count(word)

        mask = plt.imread(r'H:\129\wallhaven-627476.jpg')


        text=' '.join(words)
        wc = WordCloud(
            width=1000,
            height=800,
            margin=2,
            background_color = ' white ' ,   #Set the background color 
            font_path= ' C:\Windows\Fonts\STZHONGS.TTF ' , #If   there is Chinese, this code must be added, otherwise a box will appear and no Chinese characters will appear 
            max_words=1000,   #Set the maximum realistic number of characters 
            max_font_size=400,   #Set the maximum font value 
            random_state=50, #Set   how many random generation states there are, that is, how many color schemes 
            mask= mask,
        )
        mycloud = wc.generate (text)
        image_colors = ImageColorGenerator(mask)

        wc.recolor(color_func=image_colors)
        wc.to_file('cloudword.jpg')

 

4. Explain the text analysis results.

 

5. Write a complete blog, describing the above implementation process, problems encountered and solutions, data analysis ideas and conclusions.

 Find the video website, find the source code of the webpage, find the cid, open the bullet screen file XML, and start to crawl the bullet screen and store it in the text. There is a small problem in the word frequency statistics, which cannot be counted with a dictionary. Currently unable to resolve.

6. Finally submit all the crawled data, crawler and data analysis source code.

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324717328&siteId=291194637