2. Recalls small climb

1. reptiles three categories:

General Reptile: crawling the entire page of data

Focused crawler: crawling through screening filter data, based on the content of a local page.

Incremental reptiles: the reptile monitor website updated data,

 

2. What is UA detection, how to crack?

UA Detection: by acquiring request to the server, the request header UA acquisition by requesting, by determining whether the value of UA knows the identity of the requested carriers.

 

The crawler initiate the requested information, disguised as a browser request

UA is a detection mechanism anti-climb, anti-climb corresponds to the portal,

Anti-anti-climbing strategy corresponds to a crawl program, crack climbing mechanism is the anti-anti-anti-climbing strategy

 

3. Brief https encryption processes?

Certificate encryption keys: public and private key generated by the server, the server sends the public key to a third party certificate authority, certificate authority public key digitally signed, returned to the server as an anti-counterfeit labels, certificates and public keys are, the server and then sent to the client after ,, authentication public key encryption and the client signing the digital, the ciphertext to the server after encryption.

 

4. What is the dynamic loading of data? How crawling data dynamically loaded?

ajax can load dynamic data, sometimes some things are not WYSIWYG, ajax request data may be sent.

After capture packets by the packet capture tool ajax request, the packet includes a parameter acquisition url, sends a request to return a string json

ajax json generally returns a string, it may be other types of requests.

 

Common parameters and their effect get and post methods 5.requests module?

url,data,params和headers

 

6. issue 1, IP was closed, this with their own hot spots

Question 2, page there is a problem? There are problems after page 50, by try..except exclude abnormal

 

 

7. cosmetics production license information management system services

Global Search:

Select a random packet, press ctrl + F searched one as shown below

 

We see the data packet, the position is as shown below:

The best is a global search to find the corresponding search package

 

We see global parameters in the figure below:

response refers to the response data.

The figure is a string json our response, after parsing, we see that the ID of FIG.

 

 analysis:

#搜索的地址:http://125.35.6.84:81/xk/
#Request URL: http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList
#Request Method: POST
# Content-Type: application/json;charset=UTF-8
#User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36

#Form Data    #参数
#     on: true
#     page: 1
#     pageSize: 15
#     productName: 
#     conditionType: 1
#     applyname: 
#     applysn: 

 

 

分析:
    (1)通过抓包工具检测出首页中的企业信息数据全部为动态加载
    (2)通过抓包工具获取动态加载数据对应的ajax的数据包(url,请求参数)
    (3)通过对步骤2的url请求后获取的响应数据中分析出有一个特殊的字段ID(每家企业都有一个唯一的ID值)
    (4)从手动点击企业进入企业的详情页,发现浏览器地址栏中的url中包含了该企业的ID和固定的域名可以拼接成详情页的url
    (5)发现详情页的企业详情信息对应的数据值是动态加载出来的.上述我们获取详情页中的url是无用的.
    (6)通过抓包工具的全局搜索的功能,可以定位到企业详情信息对应的ajax数据包(url,请求参数),对应的响应数据就是最终我们想要爬取的企业详细数据.
    
注意:先写思路,再写程序
    程序需要先一点点写,再写出全部.
    写一步执行一步.

 

 

import requests
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
#第一请求的url地址
first_url='http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
ids=[]
#如何爬取前10页的数据?,双击选中
for page in range(1,11):
    data={
        "on": "true",
        "page": str(page),
        "pageSize": "15",
        "productName": "",
        "conditionType": "1",
        "applyname": "",
        "applysn": "",
    }
    #json_obj=requests.post(url=first_url,data=data,headers=headers).json()
    response=requests.post(url=first_url,data=data,headers=headers)  #响应对象
    #response.headers返回的是响应头信息(字典)
    if response.headers['Content-Type']=='application/json;charset=UTF-8':
        json_obj=response.json()
        for dic in json_obj['list']:
            ids.append(dic['ID'])
#print(ids)   #这个时候我们已经获取到了id
detail_url='http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
for _id in ids:
    data={
        'id':_id
    }
    company_text=requests.post(detail_url,data=data,headers=headers).text
    print(company_text)

抓取的数据是下面的内容:

 

Guess you like

Origin www.cnblogs.com/studybrother/p/10939309.html