记忆碎片之python爬虫APP数据爬取fiddler抓包及多线程爬取流程分析（四）

无敌免责声明：本案例用到的app仅仅做为学习使用，切勿使用爬虫程序恶意攻击该服务器。
有了前面三节内容的铺垫，相信对抓包和模拟器配置都有了一些了解，这里实现一个完整的案例，仅做为入门学习记录。

第一步：启动fiddler，并启用抓包

fiddler抓包APP配置
在这里插入图片描述

第二步：启动安卓模拟器，设置代理，并启动`APP`应用，明确抓取内容

在这里插入图片描述

在这里插入图片描述

接着向下翻页，滚动鼠标滚轮，之后回到顶部，点击第一个菜谱进入菜谱的详情页，回到fiddler里面，设置停止抓包，在fiddler左下角单机一下，变成空白，就是取消继续抓包

在这里插入图片描述

第三步：分析数据包

经过分析，发现数据都是已api.douguo.net为host的方式返回的，不同的数据，用了不同的URL地址
在这里插入图片描述
上图fiddler抓包分析之后，红色的请求就是爬虫程序所需要的，可以复制响应数据流到json.cn里面粘贴查看。
分析完这些之后，再分析每一个请求的细节内容

经过三个链接的请求头对比之后，发现三个请求的请求头都是一样的，不同的是每一个请求携带的data数据。

第四步：在编辑器中编写代码

# 记录
# 豆果美食app案例
"""
在夜神模拟器中安装豆果美食apk，安装包可以在本地用浏览器搜索下载
打开fiddler工具，并修改Options的中的内容，
    选项卡HTTPS from remote clients only
    选项卡 Connections 端口8889 勾选Allow remote computers to connect(运行所有移动设备链接)
启动夜神模拟器的代理设置
"""
"""
fiddler已经能接收到数据包了，现在我们清空，走一遍app，让fiddler抓包
app走的流程是：首页-菜谱分类-蔬菜-土豆-菜谱-学做多-向下滑动鼠标（翻页操作）
抓包结束，在fiddler使用F12或鼠标点击左下角capture traffic停止抓包
接着来分析抓到的包：
观察发现，api.douguo.net是接口地址，重点查看返回的json数据
使用工具栏的find，在对话框中输入api.douguo.net,然后点击Find Sessions,现在所有域名和api.douguo.net相关的都变为黄色
然后查看这些包的数据，由于编码的问题，把返回数据放到浏览器json.cn里面查看
http://api.douguo.net/personalized/home HTTP/1.1 首页中部分类
http://api.douguo.net/recipe/flatcatalogs HTTP/1.1 点击“菜谱分类”后的所有分类
http://api.douguo.net/recipe/v2/search/0/20 HTTP/1.1 点击“学做多”后的内容
http://api.douguo.net/recipe/v2/search/20/20 HTTP/1.1 翻页操作
http://api.douguo.net/recipe/detail/957058 HTTP/1.1 详情页
"""

主程序：

import json
import requests
from multiprocessing import Queue
from APP数据抓取.spider_douguo import mogo
from concurrent.futures import ThreadPoolExecutor

# 创建队列
queue_list = Queue()


# 请求函数
def handler_request(url, data):
    """
    :param url: 请求不同页面的链接
    :param data: 请求不同页面时所附带的请求数据
    :return: 获得的页面结果
    """
    header = {
        "client": "4",
        "version": "6961.2",
        "device": "OPPO R17",
        "sdk": "22,5.1.1",
        "channel": "baidu",
        "resolution": "1280*720",
        "display-resolution": "1280*720",
        "dpi": "1.5",
        # "android-id": "241c04e169bc5101",
        # "pseudo-id": "4e169bc5101241c0",
        "brand": "OPPO",
        "scale": "1.5",
        "timezone": "28800",
        "language": "zh",
        "cns": "0",
        "carrier": "CHINA+MOBILE",
        # "imsi": "460075101569188",
        "User-Agent": "Mozilla/5.0 (Linux; Android 5.1.1; OPPO R17 Build/NMF26X; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/74.0.3729.136 Mobile Safari/537.36",
        "uuid": "bb05a5b1-b4a8-4ae9-818a-8cf5e456a6ba",
        "battery-level": "0.86",
        "battery-state": "2",
        "terms-accepted": "1",
        "newbie": "1",
        # "mac": "24:1C:04:E1:69:BC",
        "imei": "866174518956913",  # 这个不能去掉 ，这个是手机的ID，在模拟器里可查看
        "reach": "10000",
        "act-code": "1586589478",
        "act-timestamp": "1586589478",
        "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "Keep-Alive",
        # "Cookie": "duid=63925665",
        "Host": "api.douguo.net",
        # "Content-Length": "13",
    }
    # 在这里添加代理，当前代理链接已失效，只说明如何使用
    # proxy = {"http": "http://H211EATS905745KC:[email protected]:9030"}
    # response = requests.get(url=url, proxies=proxy)
    response = requests.post(url=url, headers=header, data=data)
    return response


# 分类菜谱
def handle_index():
    """
    :return: 获取所有分类的名字，重组请求列表页的请求数据，并把数据放入队列，等待代用
    """
    url = "http://api.douguo.net/recipe/flatcatalogs"
    data = {"client": 4,
            # "_session": 1586589481759866174518956913,
            # "v": 1503650468,
            "_vs": 2305,
            # "sign_ran": "342696dee05600e1981d6cf95dfe22e9",
            # "code": "22a8091f41d1c9bb"
            }
    response = handler_request(url=url, data=data)
    # print(response.text)
    index_response_dict = json.loads(response.text)
    for index_item in index_response_dict["result"]["cs"]:
        for index_item_sub in index_item["cs"]:
            for item in index_item_sub["cs"]:
                # print(item)
                # 详情页client=4&_session=1586589481759866174518956913&keyword=%E5%9C%9F%E8%B1%86&order=3&_vs=11104&type=0&auto_play_mode=2&sign_ran=e31f5e1a08d5ea07b9e79a5d88f9a9df&code=a71450a7a3827c7a
                # 解码后client=4&_session=1586589481759866174518956913&keyword=土豆&order=3&_vs=11104&type=0&auto_play_mode=2&sign_ran=e31f5e1a08d5ea07b9e79a5d88f9a9df&code=a71450a7a3827c7a
                # 请求菜谱列表时要用到的请求数据
                data_detail = {
                    "client": "4",
                    "_session": "1586589481759866174518956913",
                    "keyword": item["name"],
                    "order": "3",
                    "_vs": "11104",
                    "type": "0",
                    "auto_play_mode": "2",
                    # "sign_ran": "e31f5e1a08d5ea07b9e79a5d88f9a9df",
                    # "code": "a71450a7a3827c7a",
                }
                # print(data_detail)
                queue_list.put(data_detail)


# 菜谱列表及详情页
def handle_caipu_list(data):
    """
    :param data: 所有分类的名字
    :return: 根据类名获取到分类列表，并第二次请求获取详情页数据，最后把组合的字典数据写入MongoDB数据库
    """
    print("当前处理的食材：", data["keyword"])
    caipu_list_url = "http://api.douguo.net/recipe/v2/search/0/20"
    caipu_list_response = handler_request(url=caipu_list_url, data=data)
    # print(caipu_list_response.text)
    caipu_list_response_dict = json.loads(caipu_list_response.text)
    for item in caipu_list_response_dict["result"]["list"]:
        # print(item)
        caipu_info = {}
        caipu_info["shicai"] = data["keyword"]
        if item["type"] == 13:
            caipu_info["user_name"] = item["r"]["an"]
            caipu_info["shicai_id"] = item["r"]["id"]
            caipu_info["describe"] = str(item["r"]["cookstory"]).replace("\n", "").replace(" ", "")
            caipu_info["caipu_name"] = item["r"]["n"]
            caipu_info["zuoliao_list"] = item["r"]["major"]
            caipu_info["pingfen"] = item["r"]["rate"]
            caipu_info["people"] = item["r"]["recommendation_tag"]
            # print(caipu_info)
            detail_url = f"http://api.douguo.net/recipe/detail/{str(caipu_info['shicai_id'])}"
            detail_data = {
                "client": "4",
                "_session": "1586589481759866174518956913",
                "author_id": "0",
                "_vs": "11101",
                "is_new_user": "1",
                # "sign_ran": "f98f1a6b40400f36fb07ec13242d5033",
                # "code": "04765aa8dc22d71d"
            }
            detail_response = handler_request(url=detail_url, data=detail_data)
            # print(detail_response.text)
            detail_response_dict = json.loads(detail_response.text)
            caipu_info["tips"] = detail_response_dict["result"]["recipe"]["tips"]
            caipu_info["cook_step"] = detail_response_dict["result"]["recipe"]["cookstep"]
            # print(json.dumps(caipu_info, ensure_ascii=False))
            # Connect_mongo.insert_item(caipu_info)
            mogo.mongo_info.insert_item(caipu_info)
        else:
            continue


if __name__ == '__main__':
    handle_index()
    pool = ThreadPoolExecutor(max_workers=20)
    print(queue_list.qsize())  # 查看总共有多少数据
    while queue_list.qsize() > 0:

        pool.submit(handle_caipu_list, queue_list.get())

    # 没有使用多进程的代码
    # handle_caipu_list(queue_list.get())
    # for _ in range(queue_list.qsize()):
    #     handle_caipu_list(queue_list.get())

如果有代理时，测试代理，这里仅用来参考，并没有使用代理

import requests

url = "http://ip.hahado.cn/ip"  # 专门用来测试IP的一个链接

proxy = {"http": "http://H211EATS905745KC:[email protected]:9030"}
# proxy = {"ip": "180.126.44.136"}
response = requests.get(url=url, proxies=proxy)
# print(response.text)    {"ip":"180.126.44.136","locale":""}
print(response.text)

MongoDB插入数据的代码

# 在Linux中使用yum安装MongoDB的安装
# /etc/init.d/mongod status 查看MongoDB运行的状态
# netstat -an | grep 27017  查看端口进程
# mongo   exit退出
# liunx下载太慢，用win10

import pymongo
from pymongo.collection import Collection


# https://www.mongodb.com/download-center/community
class Connect_mongo(object):
    def __init__(self):
        self.client = pymongo.MongoClient(host="127.0.0.1", port=27017)
        self.db = self.client["douguo"]  # 自定义数据库名

    # 插入数据的方法
    def insert_item(self, item):
        #                         数据库名   自定义表名
        db_collection = Collection(self.db, "douguo_item")
        db_collection.insert_one(item)

mongo_info = Connect_mongo()
# mongo_info.insert_item({"aa": "bb"})

附：因为从fiddler中复制的data和请求头等数据都是原始的，然而在使用的时候要转换成字典格式

# 原始数据
client=4&_session=1586754944886866174518956913&author_id=0&_vs=11101&is_new_user=1&sign_ran=293eba6ee05e51218b100efbd1941db2&code=695e2bb1c2fb7d6f
# 处理之后的数据
"client":"4",
"_session":"1586589481759866174518956913",
"author_id":"0",
"_vs":"11101",
"is_new_user":"1",
"sign_ran":"f98f1a6b40400f36fb07ec13242d5033",
"code":"04765aa8dc22d71d",

# 下面二行是替换操作 用后面的字符替换前面的字符
# & \n 请求头不用替换换行
# = :
# 下面这一行是正则匹配替换
# (.*?):(.*)     "$1":"$2",

第五步：安装配置MongoDB数据库

步骤自行百度并操作安装配置，我用的是4.2版本的，这个是下载地址https://www.mongodb.com/download-center/community
安装的时候，由于我的硬盘不是固态，最后会卡在进度条，一直等着就行了，我等了一个小时左右
至于配置，现在几乎不需要怎么配置就可以使用了，就连系统服务都是已经配置好了
最后运行爬虫，数据库保存数据如下
在这里插入图片描述