The python doraemon crawler (Requests module)

# ## 2.requests module
 
- concept: a role is used to simulate a browser initiating a request based on the network module of the request.
 - encoding process:
   - Specifies the URL
   - transmitting the request
   - the fetch response data (data to crawling)
   - persistent storage
     - the installation environment:
     - PIP requests the install

 ** requests: ** 

- GET / POST:
   - URL
   - data / the params: request parameters for the package
   - headers: the UA camouflage
     - What data is dynamically loaded by: another request to request additional data
     - ajax
     - JS
     - how to identify whether there is data dynamically loading pages?
     - local Search
     - global Search
     -The first step to a strange website before crawling to do?
     - Determine whether you want to crawl data is dynamically loaded !!! 

# ### 2.1 Chinese garbage problem 

`` `Python 
# solve the Chinese garbled 
wd = input ( ' Enter Key a: ' ) 
URL = ' https://www.sogou.com/web ' 
# stored is dynamic request parameter 
params = {
     ' Query ' : WD 
} 
# necessarily be applied to the request params 
# params parameter indicates the package request url parameter 
response = requests.get (url = url, params = params) 

# manually modify the response data encoded !!!!!!!!! 
response.encoding = ' UTF-. 8 ' 

page_text= Response.text 
fileName = WD + ' .html ' 
with Open (fileName, ' W ' , encoding = ' UTF-. 8 ' ) AS FP: 
    fp.write (page_text) 
Print (WD, ' download successful! ' ) 
` 

# ### anti-climb 2.2 UA and UA camouflage mechanism 

`` `Python 
# problem: most sites will request verification, if not the browser to initiate a request, the server will be denied access 


# to solve the Chinese garbled & UA camouflage 
wd = INPUT ( ' Enter Key a: ' ) 
URL = ' https://www.sogou.com/web ' 
# stored is the dynamic parameters of the request
the params = {
     ' Query ' : WD 
} 

# is about to initiate a request corresponding to the header information 
headers = {
     ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / Safari 76.0.3809.100 / 537.36 ' 
} 

# necessarily be applied to the request params 
# params parameter indicates the package is to request url parameter 
# headers parameter is used to disguise achieve UA 
response = requests.get (url = url, params = the params, headers = headers) 

# manually modify the response data encoded 
response.encoding = ' UTF-. 8 ' 

page_text = response.text 
fileName+ WD = ' .html ' 
with Open (fileName, ' W ' , encoding = ' UTF-. 8 ' ) AS FP: 
    fp.write (page_text) 
Print (WD, ' ! Download success ' ) 
` 

# ### 2.3 dynamic loading data 

how to determine whether there is data dynamically loading a page? 

- capture tool for local search
 - if it is determined that the page on how to dynamically load data to locate data?
   - use packet capture tools for global search
 - on an unfamiliar site data must be judged before you crawling crawling whether the data is dynamically loaded !!! 



`` `Python 
# example needs 
crawling enterprise Detail information: http: //125.35.6.84: 81 / xk / 
analysis:
 1 . data site home and business details page is dynamic load out
2.? An analysis of a business enterprise details of how the data is coming 
through a ajax request (post) to request details of an enterprise data 
corresponding to the request url: HTTP: //125.35.6.84:81/xk/itownet/portalAction. ? do = Method, getXkzsById 
the request carries a parameter: id: xxdxxxx 
conclusions:
 1 . data details page of each company is through a post ajax request form to request
 2 url ajax request corresponding to each company are. Like, all the way requested post, only the value of the parameter id of the request is not the same.
 3 only need to get the id value of each company corresponding to each company to obtain the corresponding detail data. 
need to get the id value of each company's 
idea: each id value of a business should be stored in the corresponding home-related request or response. 
Conclusion: id value of each company's response data is stored in one of ajax home page corresponding to the request, only need the response data in the enterprise id extract / parse out immediately after. 
`` ` 

` `` Python 
# code to achieve 
# id did not correspond to a company request to 
url = ' http://125.35.6.84:81/xk/itownet/portalAction.do? = getXkzsList Method '
data = {
    'on': 'true',
    'page': '1',
    'pageSize': '15',
    'productName': '',
    'conditionType': '1',
    'applyname': '',
    'applysn': '',
} 

FP = Open ( ' ./company_detail.txt ' , ' W ' , encoding = ' UTF-. 8 ' ) 

# the JSON () Returns the value of each company id have 
data_dic = requests.post (url = url , data = data, headers = headers) .json ()
 # parse id 
for DIC in data_dic [ ' List ' ]: 
    the _id = DIC [ ' ID ' ]
 #      Print (the _id) 
    # for each corresponding business id details data capture (initiation request) 
    POST_URL = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById ' 
    post_data = {
         ' ID ' : the _id 
    } 
    # JSON return value is one of a business listing information 
    detail_dic = requests.post (url POST_URL =, = Data post_data, headers = headers) .json () 
    COMPANY_TITLE = detail_dic [ ' epsName ' ] 
    address = detail_dic [ ' epsProductAddress ' ] 
    
    fp.write (COMPANY_TITLE + ' : ' + address + ' \ n- ' )
     Print(COMPANY_TITLE, ' crawling success !!! '  )
fp.close () 
`` `

 

Guess you like

Origin www.cnblogs.com/doraemon548542/p/11964364.html