Python Reptile Tour _TWO

Preface:

ONE learned about reptiles and requestsbasic use of the module, this time to follow the teacher to do a comprehensive case

0x00: Case Description

Take this time to climb the cosmetics company specific production license information
Here Insert Picture Description
Here Insert Picture Description

0x01: the analysis

We must first determine whether the information on this page appears the company is dynamically loaded out of or along with the url directly arise
Here Insert Picture Description
can F12look can also write a pyscript crawling about the
captured data of the page, you can inquire about the company name is in the data to verify the information is loaded by the manner in which the
Here Insert Picture Description
inquiry is not a business name, description information and data on this page is dynamically loaded out by that is ( Ajax)
Here Insert Picture Description
then we capture what Ajaxthe request
Here Insert Picture Descriptionreally is out of the dynamic loading
Here Insert Picture Description
can see the parameters and so, here first summarize information found in

#页面信息通过动态加载出来
Request URL: http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList
Request Method: POST
Content-Type: application/json;charset=UTF-8

Since this page information we have analyzed Well, that and then look at the specific licensing information about the company, just a point to go see the following information
Here Insert Picture Description
here also need to analyze the information through the link directly presented out or by dynamic load out, and just analysis methods, like home, I am here to see the direct Ajaxrequest, found that a
Here Insert Picture Description
look at the parameters only idYige
Here Insert Picture Description
try a few companies will find that this urlis the same, the only change is the argument id
Here Insert Picture Description
here so far then statistics about the information found in the page

#信息也是通过动态加载出来的
Request URL: http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById
Request Method: POST
Content-Type: application/json;charset=UTF-8

After the analysis, the code will be written crawling

0x02: crawling

First page of json string formatting it for easy viewing
Here Insert Picture Description
after formatting, find a dictionary llistis a list, the list also contains the dictionary, you can traverse valuevalue, the value can be obtained ID Here Insert Picture Description
information collected by this time we will use Arrived

import requests

if __name__ == '__main__':
	#首页url,用于获取企业ID号
    url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
    #UA伪装
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    #导入参数
    data = {
        'on': 'true',
        'page': '1',
        'pageSize': '15',
        'productName':'',
        'conditionType': '1',
        'applyname':'',
        'applysn':'',
    }
    #建立一个空列表永远存储ID值
    id_message = []
    #发起请求
    message = requests.post(url=url,data=data,headers=headers).json()
    #数据是一个字典,而list的value值中包含ID值,所以需要从value值中取出ID值
    for dict in message['list']:
        id_message.append(dict['ID'])
    print(id_message)

You will find on this page so that you can get all the business after crawling ID
Here Insert Picture Description
so ID to get a specific license information companies will be able to show it by ID

import requests
import json
if __name__ == '__main__':
    url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
    #UA伪装
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    #导入参数
    data = {
        'on': 'true',
        'page': '1',
        'pageSize': '15',
        'productName':'',
        'conditionType': '1',
        'applyname':'',
        'applysn':'',
    }
    #建立一个空列表永远存储ID值
    id_message = []
    #建立一个存储企业详细信息的列表
    all_date = []
    #发起请求
    message = requests.post(url=url,data=data,headers=headers).json()
    #数据是一个字典,而list的value值中包含ID值,所以需要从value值中取出ID值
    for dict in message['list']:
        id_message.append(dict['ID'])
    #根据ID号获得企业的详细数据
    date_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    #遍历ID列表
    for id in id_message:
        datas = {
            'id': id
        }
        detail_message = requests.post(url=date_url,data=datas,headers=headers).json()
        # print(detail_message)
        #列表中存储
        all_date.append(detail_message)
    #存储到本地
    fp = open('message.txt','w',encoding='utf-8')
    json.dump(all_date,fp=fp,ensure_ascii=False,indent=4)
    print("爬取成功咯")

Here Insert Picture Description
This crawling successful, if you want to crawl more pages simply will not bother 'page': '1'this parameter is set to Dynamic

0x03: The final code

import requests
import json
if __name__ == '__main__':
    url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
    #UA伪装
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    # 建立一个空列表永远存储ID值
    id_message = []
    # 建立一个存储企业详细信息的列表
    all_date = []
    #查询多页数据,添加一个循环即可
    for page in range(1,3):
        #转换为字符类型
        page = str(page)
       #导入参数
        data = {
            'on': 'true',
            'page': page,
            'pageSize': '15',
            'productName':'',
            'conditionType': '1',
            'applyname':'',
            'applysn':'',
        }
        #发起请求
        message = requests.post(url=url,data=data,headers=headers).json()
        #数据是一个字典,而list的value值中包含ID值,所以需要从value值中取出ID值
        for dict in message['list']:
            id_message.append(dict['ID'])
    #根据ID号获得企业的详细数据
    date_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    #遍历ID列表
    for id in id_message:
        datas = {
            'id': id
        }
        detail_message = requests.post(url=date_url,data=datas,headers=headers).json()
        # print(detail_message)
        #列表中存储
        all_date.append(detail_message)
    #存储到本地
    fp = open('message.txt','w',encoding='utf-8')
    json.dump(all_date,fp=fp,ensure_ascii=False,indent=4)
    print("爬取成功咯")

Information taken crawl
Here Insert Picture Description

to sum up:

This case through training, mastered the method of analysis, the next learning data analysis !

Keep up toward the best way forward! !

Published 71 original articles · won praise 80 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_43431158/article/details/104329070