What is a reptile
# 1 What is the Internet? Is the Internet and a computer connected by a network device (cable, routers, switches, firewalls, etc.) is made the same as a net. # 2, the purpose of the establishment of the Internet? The core value of the Internet is that shared data / transfer: Data is stored on a computer, and the computer to the Internet with the purpose is to be able to facilitate data sharing between each other / transfer, or you can take U disk to copy data on someone else's computer. # 3 What is the Internet? What reptiles do is? What we call the Internet is sending a request by a client computer to the target computer to download the data to the target computer's local process. # 3.1 only, user access to network data is: the browser to submit a request -> download page code -> parsing / rendering into pages. # 3.2 and crawlers do is: Analog browser sends a request -> download page code -> only to extract useful data -> stored in the database or file # 3.1 and 3.2 is that the difference: our crawler to extract only the page code is useful for our data # 4, summarized reptile # metaphor 4.1 crawlers: If we use the Internet compared to a large spider web, the data on that computer is a prey spider web, and the crawler is a only a small spider, spider web crawl along their desired prey / data # : Definitions 4.2 crawler initiates a request to the site, analyze and extract useful data access to resources program # value of 4.3 reptiles: the Internet is the most valuable data, such as product information Lynx mall, villas chain of home network snowball net securities investment information, etc., these data represent real money in various industries, we can say, who mastered the first-hand data in the industry, who will become the master of the whole industry, if the entire Internet data likened to a treasure, that our crawlers course is to teach you how to efficiently mine these treasures, reptiles mastered the skills, you become all Internet information company behind the boss, in other words, they are free to provide you with valuable data.
The basic flow of two reptiles
# 1, a request to initiate the use of library initiates http request to the target site, i.e., a transmitting the Request the Request comprising: request header, request the like # 2, the content acquisition response if the server can properly respond, to give a the Response the Response comprising: HTML, JSON , pictures, videos, etc. # 3, parses the content parse html data: regular expressions, such as third-party parsing library Beautifulsoup, pyquery such as parsing json data: json module parses binary data: a way to write files b # 4, save data database file
Three request and response
# HTTP protocol: HTTP: //www.cnblogs.com/linhaifeng/articles/8243379.html # users to send their information to the server (socket server) via a browser (socket Client): Request # the Response: server receives the request, analysis sent to the user's request information, and then returns the data (data returned may contain other links, such as: images, JS, CSS, etc.) # PS: browser after receiving Response, parses its content to display to the user, and analog crawlers and the browser sends a request receiving Response, to extract useful data therein.
Four Request
# 1, the request mode: common request methods: GET, POST other requests ways: the HEAD, PUT, DELETE, OPTHONS `` ` PS: with the difference between the demo get the post browsers (with login demo post) post and get requests eventually spliced into this form: K1 = XXX & K2 = yyy & K3 = ZZZ pOST request parameter in the request body: using a browser to view, stored in the form data after the get request parameters directly on the URL `` ` # 2, the request url url stands for uniform resource locator, such as a web document, a picture of a video so you can uniquely be determined by url `` ` url encoded HTTPS: //www.baidu.com/s?wd= picture images will be encoded (see sample code) `` ` ` `` loading page is: load a web page, usually are the first loaded document document, the document parsing documents when he met link, for a hyperlink to initiate a request to download the picture ` `` # 3, the request header User-agent: If there is no request header user-agent client configuration, the server may be you as a user of illegal Host Cookies: the cookie used to store login information `` ` generally do crawlers will add request headers ` `` # 4, requesting body if it is get way, no request body content if it is post way, the request body is the Data format `` ` PS: 1 , the login window, file uploads, information will be attached to the request body 2 , log in, enter the wrong user name and password, then submit, you can see post, usually right after the login page will jump, can not capture the pOST `` `
from urllib.parse import urlencode import requests headers={ 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Cookie':'3mBTbA3JDYX Oh, jjRX56GhfO_0R3jsJKRy66jK4JKjHKet6vP; ispeed_lsm = 0; H_PS_PSSID = 1421_24558_21120_17001_24880_22072; BD_UPN = 123 253; H_PS_645EC = 44be6I1wqYYVvyugm2gc3PK9PoSa26pxhzOVbeQrn2rRadHvKoI% 2BCbN5K% 2Bg; BDORZ = B490B5EBF6F3CD402E515D22BCDA1598', 'Host':'www.baidu.com', 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'} # response=requests.get('https://www.baidu.com/s?'+urlencode({'wd':'美女'}),headers=headers) response=requests.get('https://www.baidu.com/s',params={'wd':'美女'},headers = headers) Printinternal params is to call urlencode# (response.text)
五 Response
# 1, the response status 200 : success on behalf of 301 : Jump on behalf of 404 : file does not exist 403 : Permissions 502 : Server Error # 2, respone header SEt- cookie: There may be more, is to tell the browser, the cookie preserved # 3, preview the page source code is the most important part, it contains the contents of the requested resource such as web page html, picture binary data, etc.
Six summary
# 1, reptiles flow summary: crawling ---> --- resolve> store # 2, reptiles Tools Required: Request Libraries: requests, selenium parsing library: Regular, beautifulsoup, pyquery repository: files, MySQL, Mongodb, Redis # 3, reptiles common framework: scrapy
import requests import re import time import hashlib def get_page(url): print('GET %s' %url) try: response=requests.get(url) if response.status_code == 200: return response.content except Exception: pass def parse_index(res): obj=re.compile('class="items.*?<a href="(.*?)"',re.S) detail_urls=obj.findall(res.decode('gbk')) for detail_url in detail_urls: if not detail_url.startswith('http'): detail_url='http://www.xiaohuar.com'+detail_url yield detail_url def parse_detail(res): obj=re.compile('id="media".*?src="(.*?)"',re.S) res=obj.findall(res.decode('gbk')) if len(res) > 0: movie_url=res[0] return movie_url def save(movie_url): response=requests.get(movie_url,stream=False) if response.status_code == 200: m=hashlib.md5() m.update(('%s%s.mp4' %(movie_url,time.time())).encode('utf-8')) filename=m.hexdigest() with open(r'./movies/%s.mp4' %filename,'wb') as f: f.write (response.content) f.flush () DEF main (): index_url = ' http://www.xiaohuar.com/list-3-{0}.html ' for I in Range (. 5 ): Print ( ' * ' * 50 , I) # crawling home page the index_page = the get_page (index_url.format (I,)) # parsing home page, where to get the address of the video list detail_urls = parse_index (the index_page) # cycles video page crawling for detail_url in detail_urls: # crawling video page detail_page=get_page(detail_url) #拿到视频的url movie_url=parse_detail(detail_page) if movie_url: #保存视频 save(movie_url) if __name__ == '__main__': main() #并发爬取 from concurrent.futures import ThreadPoolExecutor import queue import requests import re import time import hashlib from threading import current_thread p=ThreadPoolExecutor(50) def get_page(url): print('%s GET %s' %(current_thread().getName(),url)) try: response=requests.get(url) if response.status_code == 200: return response.content except Exception as e: print(e) def parse_index(res): print('%s parse index ' %current_thread().getName()) res=res.result() obj=re.compile('class="items.*?<a href="(.*?)"',re.S) detail_urls=obj.findall(res.decode('gbk')) for detail_url in detail_urls: if not detail_url.startswith('http'): detail_url='http://www.xiaohuar.com'+detail_url p.submit(get_page,detail_url).add_done_callback(parse_detail) def parse_detail(res): print('%s parse detail ' %current_thread().getName()) res=res.result() obj=re.compile('id="media".*?src="(.*?)"',re.S) res=obj.findall(res.decode('gbk')) if len(res) > 0: movie_url=res[0] print('MOVIE_URL: ',movie_url) with open('db.txt','a') as f: f.write('%s\n' %movie_url) # save(movie_url) p.submit(save,movie_url) print('%s下载任务已经提交' %movie_url) def save(movie_url): print('%s SAVE: %s' %(current_thread().getName(),movie_url)) try: response=requests.get(movie_url,stream=False) if response.status_code == 200: m=hashlib.md5() m.update(('%s%s.mp4' %(movie_url,time.time())).encode('utf-8')) filename=m.hexdigest() with open(r'./movies/%s.mp4' %filename,'wb') as f: f.write(response.content) f.flush() except Exception as e: print(e) def main(): index_url='http://www.xiaohuar.com/list-3-{0}.html' for i in range(5): . p.submit (the get_page, index_url.format (I,)) add_done_callback (parse_index) IF the __name__ == ' __main__ ' : main () crawling net video Xiaohua