1, reptiles definition:
Initiate a request to the site, after obtaining the resources and extract useful data analysis program
2, the basic process of reptiles
(1) sends a request:
Use http library launched http request to the target site, it sends a Request Request includes: request headers, request bodies, etc.
(2) Content acquisition response:
If the server can be a normal response, you'll get a Response Response include: html, json, pictures, videos, etc.
(3) analytical content:
Parse html data: regular expressions, such as third-party parsing library Beautifulsoup, pyquery such as parsing json data: json module parses binary data: a way to write documents b
(4) Data storage:
Database files
3, cookie pool and agent pool:
cookie pool: Some sites may limit access to the frequency set to simulate different user access
Agent pool: ip used to simulate different users to access the site
4, forward proxy and reverse proxy:
Forward agency: their own, for example, over the wall (access Google): send a request to a server, the server again Google made the request, the requested data back to itself.
Reverse Proxy: send a request to a server (a ip address), sent to nginx (reverse proxy server), and then sent to the back-end server (there may be many units, but to see it is one).
5, simulated landing:
# Set request header headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 77.0.3865.120 Safari / 537.36 ' } # POST request log, and the parameters carried in the RES = requests.post ( ' http://www.aa7a.cn/user.php ' , headers=headers, data={ 'username':'[email protected]', 'password':'147258369..', 'captcha': 'ubw3', 'remember': 1, 'ref': 'http://www.aa7a.cn/', 'act': 'act_login' }) # Successful login request cookies results obtained in the cookie = res.cookies.get_dict () # verify a successful login request login page contains information RES = requests.post ( ' http://www.aa7a.cn/ ' , headers = headers, Cookies = the cookie) IF ' [email protected] ' in res.text: Print ( ' successful landing! ' ) the else : Print ( ' Login failed! ' )
6, crawling pear Video Case:
Import Re # crawling pear video step: # 1 first obtains the target url to crawl content # https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=3&start=36&mrd=0.762001536578264&filterIds=1625871,1625877 , 1625918,1625694,1625682,1625684,1625687,1625674,1625664,1625661,1625648,1625645,1625634,1625614,1625604 # link 2. analysis need to get RES = requests.get ( ' https://www.pearvideo.com /category_loading.jsp?reqType=5&categoryId=3&start=36 ' ) # 3. desired content is obtained using a positive match # <a href="video_1625614" class="vervideo-lilink actplay"> reg_txt = ' <the href = a " (. *?) "class = " vervideo-lilink actplay ">' obj =the re.findall (reg_txt, res.text) # Print (obj) # 4. The path splicing requires access: as: HTTPS: //www.pearvideo.com/video_1625614 for URL in obj: res_url = ' https://www.pearvideo.com/ ' + URL # video page obtained res_v = requests.get (res_url) # . 5, matched to the final video address (list) obj_v the re.findall = ( ' srcUrl = "(. *?)" ' , res_v.text) # Print (obj_v) OBJ1 = requests.get (obj_v [0]) # Parsing video Title name = obj_v [0] .rsplit ( ' / ' ,. 1) [. 1 ] Print (name) # cycles written to the file, iter_content (): binary stream with Open (name, ' WB ') as f: for line in obj1.iter_content(): f.write(line)