Use python to crawl the short review of the Douban movie "Hot"

1. Reptile Object-Douban Movie Review

Today I will share a crawler case. The goal of crawling is: a short review of any movie on Douban (note: it is a short review, not a movie review!). Take the movie "Passionate" as an example:picture

▲ Crawling target

Crawl the above 6 key fields, including:

Page number, reviewer nickname, review star rating, review time, reviewer IP location, number of users, review content.

2. Crawling results

Screenshot of crawling results:picture

▲ Some result data

3. Explanation of crawler code

First, import the libraries you need:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import random
from time import sleep

Define a request header:

# 请求头
h1 = {
    
    
 'Cookie': '换成自己的cookie',
 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Encoding': 'gzip, deflate',
 'Host': 'movie.douban.com',
 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15',
 'Accept-Language': 'zh-CN,zh-Hans;q=0.9',
 'Referer': 'https://movie.douban.com/subject/35267224/?from=showing',
 'Connection': 'keep-alive'
}

Define the request address: (The rule is: start on page 1 is 0, start on page 2 is 20, start on page 3 is 40, so it is concluded: start=(page-1)*20)

# 请求地址
url = 'https://movie.douban.com/subject/{}/comments?start={}&limit=20&status=P&sort=new_score'.format(v_movie_id, (page - 1) * 20)

Send requests using requests:

# 发送请求
response = requests.get(url, headers=h1, verify=False)

Use BeautifulSoup to parse page data:

# 解析页面数据
soup = BeautifulSoup(response.text, 'html.parser')

Define some empty lists to store data:

user_name_list = []  # 评论者昵称
star_list = []  # 评论星级
time_list = []  # 评论时间
ip_list = []  # 评论者ip属地
vote_list = []  # 有用数
content_list = []  # 评论内容

Take the "Comment Content" field as an example:

for review in reviews:
 # 评论内容
 content = review.find('span', {
    
    'class': 'short'}).text
 content = content.replace(',', ',').replace(' ', '').replace('\n', '').replace('\t', '').replace('\r', '')
 content_list.append(content)

Put the list data stored in all fields into Dataframe format:

df = pd.DataFrame(
 {
    
    
  '页码': page,
  '评论者昵称': user_name_list,
  '评论星级': star_list,
  '评论时间': time_list,
  '评论者IP属地': ip_list,
  '有用数': vote_list,
  '评论内容': content_list,
 }
)

Further save to csv file:

# 保存到csv
df.to_csv(result_file, mode='a+', header=header, index=False, encoding='utf_8_sig')
print('文件保存成功:', result_file)

Above, the core logic is explained.

The code also contains: star conversion function, automatic page turning, text cleaning and other functions. For details, see the complete source code at the end of the article.

3. Obtain the complete source code

Friends who love to learn, the complete Python source code and result data of this analysis process can be obtained as follows.

Technology must be shared and communicated, and it is not recommended to work behind closed doors. One person can go very fast, and a group of people can go further.

This article is shared and recommended by fans. Information, information sharing, data, and technical exchange improvements can all be obtained by joining the communication group. The group has more than 2,000 members. The best way to add comments is: source + direction of interest, which is convenient Find like-minded friends.

Method ①, add WeChat account: pythoner666, note: from CSDN + join the group
Method ②, WeChat search public account: Python learning and data mining, background reply: Douban enthusiastic, get the code of this article

Guess you like

Origin blog.csdn.net/m0_59596937/article/details/132865594