# ## 2.requests module - concept: a role is used to simulate a browser initiating a request based on the network module of the request. - encoding process: - Specifies the URL - transmitting the request - the fetch response data (data to crawling) - persistent storage - the installation environment: - PIP requests the install ** requests: ** - GET / POST: - URL - data / the params: request parameters for the package - headers: the UA camouflage - What data is dynamically loaded by: another request to request additional data - ajax - JS - how to identify whether there is data dynamically loading pages? - local Search - global Search -The first step to a strange website before crawling to do? - Determine whether you want to crawl data is dynamically loaded !!! # ### 2.1 Chinese garbage problem `` `Python # solve the Chinese garbled wd = input ( ' Enter Key a: ' ) URL = ' https://www.sogou.com/web ' # stored is dynamic request parameter params = { ' Query ' : WD } # necessarily be applied to the request params # params parameter indicates the package request url parameter response = requests.get (url = url, params = params) # manually modify the response data encoded !!!!!!!!! response.encoding = ' UTF-. 8 ' page_text= Response.text fileName = WD + ' .html ' with Open (fileName, ' W ' , encoding = ' UTF-. 8 ' ) AS FP: fp.write (page_text) Print (WD, ' download successful! ' ) ` # ### anti-climb 2.2 UA and UA camouflage mechanism `` `Python # problem: most sites will request verification, if not the browser to initiate a request, the server will be denied access # to solve the Chinese garbled & UA camouflage wd = INPUT ( ' Enter Key a: ' ) URL = ' https://www.sogou.com/web ' # stored is the dynamic parameters of the request the params = { ' Query ' : WD } # is about to initiate a request corresponding to the header information headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / Safari 76.0.3809.100 / 537.36 ' } # necessarily be applied to the request params # params parameter indicates the package is to request url parameter # headers parameter is used to disguise achieve UA response = requests.get (url = url, params = the params, headers = headers) # manually modify the response data encoded response.encoding = ' UTF-. 8 ' page_text = response.text fileName+ WD = ' .html ' with Open (fileName, ' W ' , encoding = ' UTF-. 8 ' ) AS FP: fp.write (page_text) Print (WD, ' ! Download success ' ) ` # ### 2.3 dynamic loading data how to determine whether there is data dynamically loading a page? - capture tool for local search - if it is determined that the page on how to dynamically load data to locate data? - use packet capture tools for global search - on an unfamiliar site data must be judged before you crawling crawling whether the data is dynamically loaded !!! `` `Python # example needs crawling enterprise Detail information: http: //125.35.6.84: 81 / xk / analysis: 1 . data site home and business details page is dynamic load out 2.? An analysis of a business enterprise details of how the data is coming through a ajax request (post) to request details of an enterprise data corresponding to the request url: HTTP: //125.35.6.84:81/xk/itownet/portalAction. ? do = Method, getXkzsById the request carries a parameter: id: xxdxxxx conclusions: 1 . data details page of each company is through a post ajax request form to request 2 url ajax request corresponding to each company are. Like, all the way requested post, only the value of the parameter id of the request is not the same. 3 only need to get the id value of each company corresponding to each company to obtain the corresponding detail data. need to get the id value of each company's idea: each id value of a business should be stored in the corresponding home-related request or response. Conclusion: id value of each company's response data is stored in one of ajax home page corresponding to the request, only need the response data in the enterprise id extract / parse out immediately after. `` ` ` `` Python # code to achieve # id did not correspond to a company request to url = ' http://125.35.6.84:81/xk/itownet/portalAction.do? = getXkzsList Method ' data = { 'on': 'true', 'page': '1', 'pageSize': '15', 'productName': '', 'conditionType': '1', 'applyname': '', 'applysn': '', } FP = Open ( ' ./company_detail.txt ' , ' W ' , encoding = ' UTF-. 8 ' ) # the JSON () Returns the value of each company id have data_dic = requests.post (url = url , data = data, headers = headers) .json () # parse id for DIC in data_dic [ ' List ' ]: the _id = DIC [ ' ID ' ] # Print (the _id) # for each corresponding business id details data capture (initiation request) POST_URL = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById ' post_data = { ' ID ' : the _id } # JSON return value is one of a business listing information detail_dic = requests.post (url POST_URL =, = Data post_data, headers = headers) .json () COMPANY_TITLE = detail_dic [ ' epsName ' ] address = detail_dic [ ' epsProductAddress ' ] fp.write (COMPANY_TITLE + ' : ' + address + ' \ n- ' ) Print(COMPANY_TITLE, ' crawling success !!! ' ) fp.close () `` `