Initial reptiles
What is a reptile: analog programming through the Internet browser, go to the Internet and then allowed to climb the process of access to data
- Reptile Classification:
- General reptiles:
- crawl the Internet in a whole page of data
- focused crawler:
- fetch page of local data
- Incremental reptiles:
- to monitor the website updated data, in order to climb take out the latest updates to the site data
Use request module
# Crawling search dog home page source data Import Requests # 1. Specify URL URL = ' https://www.sogou.com/ ' # 2. requests get: get the return value is a response object response = requests.get ( = URL URL) # 3. fetch response data page_text = response.text # returns the response data string # 4. persistent storage with Open ( ' sogou.html ' , ' W ' , encoding = ' UTF-. 8 ' ) AS FP: fp.write (page_text)
# Achieve a simple web collector # need to carry url parameters dynamic url = ' https://www.sogou.com/web ' # achieve parameters dynamic WD = INPUT ( ' Enter Key A: ' ) the params = { ' Query ' : WD } # in the request parameters needed to effect corresponding to request dictionary to get this parameter params method Response = requests.get (URL = URL, params = params) page_text = response.text fileName = WD + ' .html ' with Open (fileName, ' W ' , encoding = ' UTF-. 8 ') as fp: fp.write(page_text)
The code execution found:
- 1. garbled
- 2. Data on the order right
# Solve distortion URL = ' https://www.sogou.com/web ' # achieve parameters dynamic WD = INPUT ( ' Enter Key A: ' ) the params = { ' Query ' : WD } # necessary to request in the request parameter corresponds to the parameter params dictionary role get this method response = requests.get (URL = URL, params = params) response.encoding = ' UTF-. 8 ' # modify the encoding format of response data page_text = response.text fileName = + WD ' .html ' with Open (fileName, ' W',encoding='utf-8') as fp: fp.write(page_text)
- UA detection: Portal request by detecting the identity of the carrier determine whether the request is a request to change the reptile launched
- UA伪装:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36
# Solve UA detected URL = ' https://www.sogou.com/web ' # achieve parameters dynamic WD = INPUT ( ' Enter Key A: ' ) the params = { ' Query ' : WD } headers = { ' User- - Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 76.0.3809.132 Safari / 537.36 ' } # in the request needs to request dictionary action parameter corresponding to the params parameter get method Response = requests.get (URL = URL, the params = the params, headers = headers) response.encoding= ' UTF-. 8 ' # modify the encoding format of response data page_text = response.text fileName = WD + ' .html ' with Open (fileName, ' W ' , encoding = ' UTF-. 8 ' ) AS FP: fp.write (page_text )
Dynamic loading of page data
- Another is through a separate data request requesting
# Crawling watercress Ranking
url = ' https://movie.douban.com/j/chart/top_list ' Start = the INPUT ( ' you want to start from the get several films: ' ) limit = the INPUT ( ' how much you want to obtain film data: ' ) DIC = { ' type ' : ' 13 is ' , ' interval_id ' : ' 100: 90 ' , ' Action ' : ' ' , ' Start ' : Start, 'limit': Limit, } Response = requests.get (URL = URL, the params = DIC, headers = headers) page_text = response.json () # JSON () returns the serialized object is a good example for DIC in page_text: Print (DIC [ ' title ' ] + ' : ' + DIC [ ' Score ' ])
# KFC query http://www.kfc.com.cn/kfccda/storelist/index.aspx url = ' http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword ' adderss = iNPUT ( ' keyword query enter ' ) for I in Range (l, 5 ): DIC = { ' CNAME ' : '' , ' PID ' : '' , ' keyword ' : adderss, ' the pageIndex ' : STR (I), 'the pageSize ' : ' 10 ' , } Response = requests.post (URL = URL, headers = headers, Data = DIC) page_text = response.json () for A in page_text [ ' the Table1 ' ]: Print ( " Restaurant Name: ' , A [ ' storeName ' ] + ' restaurant ' , ' ---- restaurant address: ' , A [ ' addressDetail ' ])
Practice crawling Drug Administration related business details information http://125.35.6.84:81/xk/
# State Drug Administration Import Requests url = ' http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList ' headers = { ' the User-Agent ' : ' Mozilla / 5.0 (Windows NT 10.0 ; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 76.0.3809.132 Safari / 537.36 ' } DIC = { ' ON ' : ' to true ' , ' Page ' : ' . 1 ' , 'pageSize': '15', 'productName':'', 'conditionType': '1', 'applyname': '', 'applysn':'', } response = requests.post(url=url, headers=headers, data=dic) page_text = response.json() for page in page_text['list']: # print(page['EPS_NAME'],page['ID']) _url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById' _dic = { 'id': page['ID'] } _response = requests.get(url=_url, headers=headers, params=_dic) _page_text = _response.json() print(_page_text['epsName'],'-----',_page_text['legalPerson'])