Douban.com is a social networking site built based on users' interest in books, movies, and music. Crawling the movie review data on Douban.com has a very important role. Movie review data is the result of NLP (Natural Language Processing). The movie review data set can be used for further data processing such as Chinese word segmentation, named entity recognition, keyword extraction, syntax analysis, text vectorization, sentiment analysis, and public opinion analysis. application. Let's take a look at how to get Douban movie reviews with Apocalypse IP~
from urllib import request
import time
import re
import them
os.mkdir(r’C:\Users*\Desktop\PYhomework\c800’)
search_counts = 800
url = ‘https://movie.douban.com/subject/2353023/reviews’
headers = {***}
headers[‘Referer’] = ‘https://movie.douban.com/subject/***/’
i = 0
lists = []
for count in range(0, search_counts, 20):
url = url + "?start=" + str(count)
req = request.Request(url, headers=headers)
response = request.urlopen(req)
HTML = response.read()
HTML = HTML.decode("utf-8")
pattern = re.compile("<div data-cid=\"(.*)\">")
lists = pattern.findall(HTML) + lists
'''爬取实际评论'''
headers[
‘Cookie’] = ‘***’
headers[‘Host’] = ‘movie.douban.com’
headers[‘Sec-Fetch-Dest’] = ‘document’
headers[‘Sec-Fetch-Mode’] = ‘navigate’
headers[‘Sec-Fetch-Site’] = ‘none’
headers[‘Sec-Fetch-User’] = ‘?1’
headers[‘Upgrade-Insecure-Requests’] = ‘1’
print('Crawl succeeded!')
for id in lists:
i += 1
url = 'https://movie.douban.com/j/review/' + id + '/full'
req = request.Request(url, headers=headers)
response = request.urlopen(req)
comment = response.read()
comment = comment.decode("utf-8")
with open(r"C:\Users\*\Desktop\PYhomework\c800\comment%d.txt" % i, mode="w", encoding="utf-8") as c:
c.write(comment)
print("comment%d保存成功!" % i)
time.sleep(0) # 随缘设置
print("Fetching completed!")