How to crawl Douban movie reviews?

Douban.com is a social networking site built based on users' interest in books, movies, and music. Crawling the movie review data on Douban.com has a very important role. Movie review data is the result of NLP (Natural Language Processing). The movie review data set can be used for further data processing such as Chinese word segmentation, named entity recognition, keyword extraction, syntax analysis, text vectorization, sentiment analysis, and public opinion analysis. application. Let's take a look at how to get Douban movie reviews with Apocalypse IP~
Insert picture description here

from urllib import request

import time

import re

import them

os.mkdir(r’C:\Users*\Desktop\PYhomework\c800’)

search_counts = 800

url = ‘https://movie.douban.com/subject/2353023/reviews’

headers = {***}

headers[‘Referer’] = ‘https://movie.douban.com/subject/***/’

i = 0

lists = []

for count in range(0, search_counts, 20):

url = url + "?start=" + str(count)

req = request.Request(url, headers=headers)

response = request.urlopen(req)

HTML = response.read()

HTML = HTML.decode("utf-8")

pattern = re.compile("<div data-cid=\"(.*)\">")

lists = pattern.findall(HTML) + lists

'''爬取实际评论'''

headers[

‘Cookie’] = ‘***’

headers[‘Host’] = ‘movie.douban.com’

headers[‘Sec-Fetch-Dest’] = ‘document’

headers[‘Sec-Fetch-Mode’] = ‘navigate’

headers[‘Sec-Fetch-Site’] = ‘none’

headers[‘Sec-Fetch-User’] = ‘?1’

headers[‘Upgrade-Insecure-Requests’] = ‘1’

print('Crawl succeeded!')

for id in lists:

i += 1

url = 'https://movie.douban.com/j/review/' + id + '/full'

req = request.Request(url, headers=headers)

response = request.urlopen(req)

comment = response.read()

comment = comment.decode("utf-8")

with open(r"C:\Users\*\Desktop\PYhomework\c800\comment%d.txt" % i, mode="w", encoding="utf-8") as c:

    c.write(comment)

    print("comment%d保存成功!" % i)

time.sleep(0)  # 随缘设置

print("Fetching completed!")

Guess you like

Origin blog.csdn.net/tianqiIP/article/details/114261716