python reptile: a initial reptiles

Initial reptiles

What is a reptile: analog programming through the Internet browser, go to the Internet and then allowed to climb the process of access to data

- Reptile Classification:
    - General reptiles:
        - crawl the Internet in a whole page of data
    - focused crawler:
        - fetch page of local data
    - Incremental reptiles:
        - to monitor the website updated data, in order to climb take out the latest updates to the site data

 

Use request module

# Crawling search dog home page source data 
Import Requests
 # 1. Specify URL 
URL = ' https://www.sogou.com/ ' 
# 2. requests get: get the return value is a response object 
response = requests.get ( = URL URL)
 # 3. fetch response data 
page_text = response.text # returns the response data string 
# 4. persistent storage 
with Open ( ' sogou.html ' , ' W ' , encoding = ' UTF-. 8 ' ) AS FP: 
    fp.write (page_text)
# Achieve a simple web collector 
# need to carry url parameters dynamic 
url = ' https://www.sogou.com/web ' 
# achieve parameters dynamic 
WD = INPUT ( ' Enter Key A: ' ) 
the params = {
     ' Query ' : WD 
} 
# in the request parameters needed to effect corresponding to request dictionary to get this parameter params method 
Response = requests.get (URL = URL, params = params) 

page_text = response.text 
fileName = WD + ' .html ' 
with Open (fileName, ' W ' , encoding = ' UTF-. 8 ') as fp:
    fp.write(page_text)

The code execution found:

  • 1. garbled
  • 2. Data on the order right
# Solve distortion 

URL = ' https://www.sogou.com/web ' 
# achieve parameters dynamic 
WD = INPUT ( ' Enter Key A: ' ) 
the params = {
     ' Query ' : WD 
} 
# necessary to request in the request parameter corresponds to the parameter params dictionary role get this method 
response = requests.get (URL = URL, params = params) 
response.encoding = ' UTF-. 8 '  # modify the encoding format of response data 
page_text = response.text 
fileName = + WD ' .html ' 
with Open (fileName, ' W',encoding='utf-8') as fp:
    fp.write(page_text)
  • UA detection: Portal request by detecting the identity of the carrier determine whether the request is a request to change the reptile launched
  • UA伪装:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36
# Solve UA detected 
URL = ' https://www.sogou.com/web ' 
# achieve parameters dynamic 
WD = INPUT ( ' Enter Key A: ' ) 
the params = {
     ' Query ' : WD 
} 
headers = {
     ' User- - Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 76.0.3809.132 Safari / 537.36 ' 
} 
# in the request needs to request dictionary action parameter corresponding to the params parameter get method 
Response = requests.get (URL = URL, the params = the params, headers = headers) 
response.encoding= ' UTF-. 8 '  # modify the encoding format of response data 
page_text = response.text 
fileName = WD + ' .html ' 
with Open (fileName, ' W ' , encoding = ' UTF-. 8 ' ) AS FP: 
    fp.write (page_text )

Dynamic loading of page data

  • Another is through a separate data request requesting
# Crawling watercress Ranking
url = ' https://movie.douban.com/j/chart/top_list ' Start = the INPUT ( ' you want to start from the get several films: ' ) limit = the INPUT ( ' how much you want to obtain film data: ' ) DIC = { ' type ' : ' 13 is ' , ' interval_id ' : ' 100: 90 ' , ' Action ' : ' ' , ' Start ' : Start, 'limit': Limit, } Response = requests.get (URL = URL, the params = DIC, headers = headers) page_text = response.json () # JSON () returns the serialized object is a good example for DIC in page_text: Print (DIC [ ' title ' ] + ' : ' + DIC [ ' Score ' ])
# KFC query http://www.kfc.com.cn/kfccda/storelist/index.aspx 
url = ' http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword ' 

adderss = iNPUT ( ' keyword query enter ' )
 for I in Range (l, 5 ): 
    DIC = {
         ' CNAME ' : '' ,
         ' PID ' : '' ,
         ' keyword ' : adderss,
         ' the pageIndex ' : STR (I),
         'the pageSize ' : ' 10 ' , 
    } 
    Response = requests.post (URL = URL, headers = headers, Data = DIC) 
    page_text = response.json () 

    for A in page_text [ ' the Table1 ' ]:
         Print ( " Restaurant Name: ' , A [ ' storeName ' ] + ' restaurant ' , ' ---- restaurant address: ' , A [ ' addressDetail ' ])

 

Practice crawling Drug Administration related business details information http://125.35.6.84:81/xk/

# State Drug Administration 

Import Requests 
url = ' http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList ' 
headers = {
     ' the User-Agent ' : ' Mozilla / 5.0 (Windows NT 10.0 ; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 76.0.3809.132 Safari / 537.36 ' 
} 
DIC = {
     ' ON ' : ' to true ' ,
     ' Page ' : ' . 1 ' ,
     'pageSize': '15',
    'productName':'',
    'conditionType': '1',
    'applyname': '',
    'applysn':'',
}
response = requests.post(url=url, headers=headers, data=dic)
page_text = response.json()
for page in page_text['list']:
    # print(page['EPS_NAME'],page['ID'])
    _url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    _dic = {
        'id': page['ID']
    }
    _response = requests.get(url=_url, headers=headers, params=_dic)
    _page_text = _response.json()
    print(_page_text['epsName'],'-----',_page_text['legalPerson'])

 

Guess you like

Origin www.cnblogs.com/CatdeXin/p/11514906.html