Crawler technology - a public record once crawling license information

Note: This blog is not a reptile education, mainly technical points carding

 

  1. Looking for data interface

  2. Direct current page crawling, only the current page frame was found, so the data should be filled by a transmission type ajax
  3.   That is the current url ajax interface address, id variable to find

     

  4.   Sends a data request, the data dictionary is returned,

    Import Requests 
    
    IF  the __name__ == ' __main__ ' :
         # access to enterprise ID 
        URL = ' http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList ' 
        headers = {
             ' the User-- Agent ' : ' the Mozilla /5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 76.0.3809.132 Safari / 537.36 ' 
        } 
        # parameters package 
        Data = {
             ' ON ' : ' to true ' ,
             'page': '1',
            'pageSize': '15',
            'productName':'',
            'conditionType': '1',
            'applyname':'',
            'applysn':'',
        }
        json_ids = requests.post(url=url,headers=headers,data=data).json()

  5.   Parse the string data json

    id_list=[]
    json_ids = requests.post(url=url,headers=headers,data=data).json()
    for dic in json_ids['list']:
        id_list.append(dic['ID'])

     

    Verification data, print it print (len (id_list)), to give the length of the dictionary 15, in line with pageSize we set above, the data obtained is correct.

     

     

  6.   Acquiring company id number, the next processing ajax

    # 企业详细数据
    post_url='http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    for id in id_list:
        data={
            'id':id
        }
        detail_json = requests.post(url=post_url, headers=headers, data=data).json()
    
        all_data_list.append(detail_json)

    Print information about the acquisition, the confirmation, the next step storage operation and it's done

  7.  Simple persistent data storage

    #持久化存储all_data_list
    fp = open('./medData.json','w',encoding='utf-8')
    json.dump(all_data_list,fp=fp,ensure_ascii=False)
    print('爬取完成!')

     看到文件里出现了一个新的json格式文件,表示存储完成

     

     检查一下,数据完整,爬取成功!

     

  8.   分页信息爬取优化

    # 参数封装
    data={
        'on': 'true',
        'page': '1',
        'pageSize': '15',
        'productName':'',
        'conditionType': '1',
        'applyname':'',
        'applysn':'',
    }

     

    由于从主页复制过来的信息代码进行过分页处理,所以我们要爬取更多数据时要对其进行修改,

    for page in range(1,6):
        page=str(page)
        # 参数封装
        data={
            'on': 'true',
            'page': page,
            'pageSize': '15',
            'productName':'',
            'conditionType': '1',
            'applyname':'',
            'applysn':'',
        }
        json_ids = requests.post(url=url,headers=headers,data=data).json()
        for dic in json_ids['list']:
            id_list.append(dic['ID'])

     

    其中的range内部表示页码,根据不同需求进行更改,不同之处仅在数据数量。

 

以下全部源码展示

 

# coding:utf-8
# author:Joseph

import requests
import json
if __name__=='__main__':
    # 获取企业id
    url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
    }
    id_list = []  # 存储企业的id
    all_data_list = []  # 存储所有的企业详情数据

    for page in range(1,6):
        page=str(page)
        # 参数封装
        data={
            'on': 'true',
            'page': page,
            'pageSize': '15',
            'productName':'',
            'conditionType': '1',
            'applyname':'',
            'applysn':'',
        }
        json_ids = requests.post(url=url,headers=headers,data=data).json()
        for dic in json_ids['list']:
            id_list.append(dic['ID'])

    # print(id_list)
    # 企业详细数据
    post_url='http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    for id in id_list:
        data={
            'id':id
        }
        detail_json = requests.post(url=post_url, headers=headers, data=data).json()

        all_data_list.append(detail_json)
    #持久化存储all_data_list
    fp = open('./medData.json','w',encoding='utf-8')
    json.dump(all_data_list,fp=fp,ensure_ascii=False)
    print('爬取完成!')

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/91joe/p/12518365.html