Crawler base (classification / requests module)

  • Reptiles Category:
    • Common
    • Focus
    • Incremental: Monitoring
  • requests
    • Role: Analog browser send request
    • get/post:url,data/params,headers
    • Anti-climbing mechanisms:
      • robots.txt
      • UA detection
    • Encoding process:
      • Specify the url
      • Initiate a request
      • Fetch response data
      • Persistent storage
    • get / post Return value: response object response
      • text: string response data
      • json (): returns the standard string json
      • content: the response data in binary form
      • encoding: the coded data in response to
# Simple collector web 
WD = INPUT ( ' Enter A Word: ' )
URL = ' https://www.sogou.com/web ' 
# will be set to a dynamic parameter request 
param = {
     ' Query ' : WD
}
# The UA camouflage 
headers = {
     ' the User-- Agent ' : ' the Mozilla / 5.0 (the iPhone; the CPU the iPhone the OS like the Mac the OS X-11_0) AppleWebKit / 604.1.38 (KHTML, like the Gecko) Version / 11.0 Mobile / 15A372 Safari / 604.1 ' 
} 
response
= requests.get (URL = URL, the params = param, headers = headers) # manually set the response code data (processing distortion problems Chinese) response.encoding = ' UTF-. 8 ' # text string is returned in the form of response data page_text = response.text fileName = wd+'.html' with open(fileName,'w',encoding='utf-8') as fp: fp.write(page_text) Print (fileName, ' download successful! ' )

User-Agent: user agent, a part of the http protocol, part of the header field belonging to the special string head

UA detection: portal server detects UA every request, if the UA is detected request crawlers, the request fails

# Crawling KFC location information 
URL = ' http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword ' 
City = INPUT ( ' Enter City A name: ' )
data = {
    "cname": "",
    "pid": "",
    "keyword": city,
    "pageIndex": "1",
    "pageSize": "10",
}
headers = {
    'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
}
response = requests.post(url=url,data=data,headers=headers)

# Json () returns a json object type 
page_text = response.json ()

fp = open('./kfc.txt','w',encoding='utf-8')
for dic in page_text['Table1']:
    address = dic['addressDetail']
    fp.write(address+'\n')
fp.close()

 

Guess you like

Origin www.cnblogs.com/wmh33/p/11040944.html