Crawler practice project-cosmetics production license information management system service platform

1. Determine the url

Through the packet capture tool, we can find that when we refresh the entire page, the data packets we want to crawl are not found in the packet capture tool, so it is determined that they are dynamically loaded by ajax.

Note: The packages captured in XHR are dynamically loaded with ajax, their links cannot be obtained directly from the links of the entire page where they are located, and their links can be obtained in the request header

Analysis of ajax means that we need to extract the URL in the packet capture tool

Through analysis, we can get: the unique ID value of all companies can be obtained from the ajax on the homepage.

In each company's AJax package, we can get the same part of each detail page.

1. http://scxk.nmpa.gov.cn:81/xk/itownet/portal/dzpz.jsp?id=9be8485451d44b3a8eb659ab6d3ae9c2 #公司1
2. http://scxk.nmpa.gov.cn:81/xk/itownet/portal/dzpz.jsp?id=1a7c3b68d8404db8b7048149367eeaf0 #公司2

To conclude, we need to obtain the unique ID of each company, and then concatenate the same part of each link to get the specific details of each company.

import requests
import json
# 思路:通过翻页的方式获取每一页所有公司的所有id,然后把这些id作为requests的参数和每一个固定的页面链接进行链接
# 易错点:id和固定链接之间不是简单的url的拼凑,而是id是请求的参数,是通过requests拼接到一起的。
id_list = []  # 获取每个企业的id
request_header ={
    
    
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74'
}
url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList'
for page in range(1,20):
    page = str(page)
    data = {
    
    
	'on': 'true',
	'page': page,
	'pageSize': '15',
	'productName': '',
	'conditionType': '1',
	'applyname': '',
	'applysn': ''
    }

    response = requests.post(url=url,headers=request_header,data=data).json()
    for dic in response['list']:
        id_list.append(dic['ID'])
# ****************************************************************
# 从这里开始整个程序分为两部分,上面是获取所有的id,下面是对所有的id信息进行请求,然后是保存。
all_data =[] #这个列表用来存放最终的所有公司的具体信息
post_url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById'
for  id in id_list:  #把id封装到字典当作作为参数以备调用
    data2 = {
    
            #遍历所有的id
    'id':id
    }
    result = requests.post(url=post_url,headers=request_header,data=data2).json()
    all_data.append(result) #通过append方法把所有的请求结果上传到all_data[]中
    print(result) #直接输出请求返回的结果

file = open('./huanzhuanpin.json','w',encoding='utf-8')
json.dump(all_data,fp=file,ensure_ascii=False)

Guess you like

Origin blog.csdn.net/weixin_47249161/article/details/113889799