Cat's Eye Film Critics crawling

Foreword

Some time ago, the hit movie "Youth of you," do not know if you read it? Anyway, I was reading, as to how this kind of movie, I do not comment, this task is left to users to do it! Well, that is entered, today we have to do is film critic crawling cat's eye.
Here I analyze the movie "young you" as an example.
Here Insert Picture Description

Ideas analysis

Film critic interfaces as follows (this can be found to the Internet, you can also capture their own analysis):
? Http://m.maoyan.com/mmdb/comments/movie/1218029.json v = yes & offset = 0
where simple analysis about the meaning of the parameters:
1218029: Cat's Eye movie ID (this is "Youth of you," the movie ID)
offset: offset, seemingly in order to increase 15
we can each increase offset to crawling, that allow offset each an increase of 15. We construct the URL in this way, you can send a request.
About data access, we can be stored in the database, can also be saved to a file, here I am crawling due to less data is saved directly to the files.
Through this URL request returns json data, we can parse the json module in Python, and then save the pandas as a CSV file on the line.

The complete code

# !/usr/bin/env python
# —*— coding: utf-8 —*—
# @Time:    2020/1/22 16:47
# @Author:  Martin
# @File:    maoyan.py
# @Software:PyCharm
import requests
import pandas as pd
import json
# 猫眼电影的ID
film_id = '1218029'
# 请求的URL
raw_url = 'http://m.maoyan.com/mmdb/comments/movie/'+film_id+'.json?_v_=yes&offset=%d'
# 伪装请求头部
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'max-age=0',
    'Host': 'm.maoyan.com',
    'Proxy-Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
# 保存结果的列表
result = []
for i in range(0, 1000, 15):
    url = raw_url % i
    r = requests.get(url, headers=headers)
    data = json.loads(r.text)
    result = result + data['cmts']
    print("offset: ", i)
# 将数据存储到文件
df = pd.DataFrame(result)
df.to_csv('./result/maoyan.csv', index=False, index_label=False, na_rep='NULL', encoding='utf_8_sig')

to sum up

Disadvantages: request with this method URL, the data obtained has an upper limit, it seems more than 1,000 die, specific solution ideas, and so have ideas and then share.
Stepped pit: problems encountered are coding problem, start the saved data, always garbled phenomenon, particular attention where you want to set encoding utf_8_sig .

Published 122 original articles · won praise 191 · views 20000 +

Guess you like

Origin blog.csdn.net/Deep___Learning/article/details/104079588