# After-school summary # "crawlers" to start climbing the mountain, do not understand ah # a reptile principle # 1. What is the Internet? # Refers to a bunch of network equipment, the computer Internet a platform to call it together for the Internet. # ? 2. The purpose of the establishment of the Internet # purpose is to establish the Internet transfer and share data data # 3 .. the whole process of the Internet: # a normal user: # Open a browser> to the target site to send a request> a fetch response data -> renderer in the browser # - crawlers: # simulate a browser -> to a target site a transmission request> fetch response data of a> a data extract valuable> persisted to data # 4. the whole process of the Internet: # a normal user: # open the browser a> to the target site to send the request of a> fetch response data a> render to browser # - crawlers: # simulate browser a> to the target site to send a request a> a fetch response data> data to extract a valuable> persisted to data #5. What is the browser sends a request? # HTTP protocol requests. # - Client: # Browser is a software -> Client IP and port # a server # HTTPS:.. / / JD the WWW COM / # www.jd. COM (Jingdong domain name) -> DNS parsing -> Jingdong IP and port of the server # client ip and port - - -> IP and port to send the request to the server can establish a link to obtain the corresponding data. # 6 crawler whole process # transmission request # fetch response data (as long as the transmission request to the server, the request will be returned by the response data) - parses and extracts data (requires parsing library: re, BeautifulSoup4, Xpath ...) - Save local # (file processing, database, MongoDB repository) # Import Requests # Response = requests.get (URL = 'HTTP: //www.baidu.com/') # response.encoding = 'UTF-. 8' # print(response.text) # with open('baidu.html', 'w', encoding='utf-8')as f: # f.write(response.text) # import requests # response = requests.get('https://video.pearvideo.com/head/20190625/cont-1570107-14056273.mp4') # print(response.content) # with open('视频.mp4', 'wb')as f: # f.write(response.content) import requests import re response = requests.get('https://www.pearvideo.com/') print(response.text) res_list=re.findall('<a href="video_(.*?)"',response.text,re.S) print(res_list) for v_id in res_list: detail_url='https://www.pearvideo.com/video'+v_id print(detail_url)