Simple crawling "clown" movie watercress Commentary generate word cloud

Lead

 Some time ago saw Joaquin Phoenix clown movie, I was very curious to evaluate most of the audience has to say after watching the film, and then looked after watercress Commentary, think of the words that appear most Commentary extracted by python What, to make a word cloud to see the movie to the audience left the key word yes.

dedicate data

 First, the beginning of time, through requests to simulate crawling data, found Commentary flip turn after 20 you need to log watercress users have permission to view , it is intended by the use of selenium browser actions automation will climb analog data page take down, then stored in specific txt file, because they did not intend to do additional analysis, it is not intended to be stored in the database.

Installation on selenium and chromedriver

 It works on the principle of popular automated testing framework selenium, and selenium and chromdriver corresponding version installed will not go into details, interested students can refer to:
https://blog.csdn.net/weixin_43241295/article/details/83784692

Analysis of watercress login page user login process

Simple crawling "clown" movie watercress Commentary generate word cloud
Simple crawling "clown" movie watercress Commentary generate word cloud

 From the page view, probably the process is to click the password to log in the navigation bar , and then enter the user name and password , click the login button , as can occur when watching online verification of some watercress reptiles pictures, I have not met, I log in directly OK, so then we need selenium simulate the entire login process.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

def crawldouban():
    name = "你的用户名"
    passw = "你的密码"

    # 启动chrome
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    browser = webdriver.Chrome(executable_path="/usr/bin/chromedriver", options=options)

    # 获取登录网址
    browser.get("https://accounts.douban.com/passport/login")
    time.sleep(3)

    # 登录自动化操作流程
    browser.find_element_by_class_name("account-tab-account").click()
    form = browser.find_element_by_class_name("account-tabcon-start")
    username = form.find_element_by_id("username")
    password = form.find_element_by_id("password")     
    username.send_keys(name)
    password.send_keys(passw)
    browser.find_element_by_class_name("account-form-field-submit").click()
    time.sleep(3)

Get user reviews

 The next step is to obtain comments on the page, and then comment stored in the specified text file in, (I will not simulate queries movie then jumps to the whole process of the Commentary), directly from the get Commentary page address departure, click Next and then constantly repeated extraction comments, write operation .

Simple crawling "clown" movie watercress Commentary generate word cloudSimple crawling "clown" movie watercress Commentary generate word cloud

    browser.get("https://movie.douban.com/subject/27119724/comments?status=P")
    comments = browser.find_elements_by_class_name("short")
    WriteComment(comments)

    while True:
        link = browser.find_element_by_class_name("next")
        path = link.get_attribute('href')
        if not path:
            break
        # print(path)    
        link.click()
        time.sleep(3)
        comments = browser.find_elements_by_class_name("short")
        WriteComment(comments)
    browser.quit()        

# 将评论写入到指定的文本文件           
def WriteComment(comments):
        with open("comments.txt", "a+") as f:
                for comment in comments:
                        f.write(comment.text+" \n")  

 Code Analysis: The code fetch nothing to talk about, is to find classname is 'short' of elements , to get the text content inside the file can be written to the specified text, which has a main loop to determine whether there is a page , through Next get a hyperlink, when the acquisition has not proved that when the last page.

Data word to generate a word cloud

 Talk about thinking about it, I have here the data processing rough, not combined pandas + numpy, I will be crawling down the data, simply cut the line breaks and the composition of the new data , and then jieba word , the new data word, and finally a pause to read the local word file , get a stop word list . Generate word cloud pause of the specified word, and font files, background color, etc., and then save the picture to the local word cloud.

from wordcloud import WordCloud
from scipy.misc import imread
import jieba

# 处理从文本中读取的内容
def text_read(file_path):
    filename = open(file_path, 'r', encoding='utf-8')
    texts = filename.read()
    texts_split = texts.split("\n")
    filename.close()

    return texts_split

def data_handle(picture_name):
    # 读取从网站上爬取下来的数据
    comments = text_read("comments.txt")
    comments = "".join(comments)

    # 分词, 读取停顿词
    lcut = jieba.lcut(comments)
    cut_text = "/".join(lcut)
    stopwords = text_read("chineseStopWords.txt")

    # 生成词云图
    bmask = imread("backgrounds.jpg")
    wordcloud = WordCloud(font_path='/usr/share/fonts/chinese/simhei.ttf', mask=bmask, background_color='white', max_font_size=250, width=1300, height=800, stopwords=stopwords)
    wordcloud.generate(cut_text)
    wordcloud.to_file(picture_name)

if __name__ == "__main__":
    data_handle("joker6.jpg")

This is my own snap a picture as the background:

Simple crawling "clown" movie watercress Commentary generate word cloud

The final renderings:

Simple crawling "clown" movie watercress Commentary generate word cloud

to sum up

 Write reptile data analysis, as well as ideas about finishing tools need to use probably spent two nights. Overall, still some of the more easy to understand things, the content related to large-scale concurrent reptile collection and data analysis still learning, record their learning process is still pretty interesting.

Guess you like

Origin blog.51cto.com/mbb97/2444879