A basic process reptiles:
1, sends a request (request library: Request, Selenium)
2, the fetch response data () returned by the server
3, parses and extracts data (resolver library: Re, the BeautifulSoup, the Xpath)
. 4, save the data (storage library) MongoDB
Second, crawling "pear video" in one video
1 # crawling video Pear
2 Import Requests . 3 = URL ' https://video.pearvideo.com/mp4/adshort/20190613/cont-1565846-14013215_adpkg-ad_hd.mp4 ' . 4 RES = requests.get (URL) . 5 # the crawling video file is written . 6 with Open ( ' pear video .mp4 ', ' WB ' ) AS F: . 7 f.write (res.content)
Third, the use of regular expressions
. 1, re.find.all ( 'regular matching rule', 'parse the text', 'normal mode')
2, re.S: global mode (the entire text matching)
3, refers to the current position of the
4, * means It is to find all
Fourth, the whole video crawling "pear video" in the
1 import requests 2 import re 3 import uuid 4 5 #1、发送请求 6 def get_page(url): 7 response=requests.get(url) 8 return response 9 10 #2、解析数据 11 def parse_index(text): 12 res=re.findall('<a href="video_(.*?)"',text,re.S) 13 14 detail_url_list=[] 15 for m_id in res: 16 detail_url='https://www.pearvideo.com/video_'+m_id 17 detail_url_list.append(detail_url) 18 19 return detail_url_list 20 21 #解析详情页获取视频url 22 def parse_detail(text): 23 movie_url=re.findall('srcUrl="(.*?)"',text,re.S)[0] 24 return movie_url 25 26 #3、保存数据 27 def save_movie(movie_url): 28 response=requests.get(movie_url) 29 Open with (F ' {uuid.uuid4 ()} MP4. ' , ' WB ' ) AS F: 30 f.write (response.content) 31 is f.flush () 32 33 is 34 is # main function input :( 'main' , then press the "Enter") 35 IF the __name__ == ' __main__ ' : 36 # . 1, sends a request to the home page 37 [ index_res the get_page = (URL = ' https://www.pearvideo.com/ ' ) 38 is # 2, for Home parsing for details page the above mentioned id 39 detail_url_list = parse_index (index_res.text) 40 41 # 3, before sending a request for each page URL 42 is for detail_url in detail_url_list: 43 is detail_res the get_page = (URL = detail_url) 44 is 45 # . 4, details page for parsing the video URL 46 is movie_url = parse_detail (detail_res.text) 47 Print ( movie_url) 48 49 # . 5, the video storage 50 save_movie (movie_url)
Fifth, multi-threaded crawling whole video a "pear video" in the
. 1 Import Requests 2 Import Re # regular module . 3 # uuid.uuid4 () may generate a unique random string section according to the time stamp of the world . 4 Import UUID . 5 # Import module thread pool . 6 from concurrent.futures Import the ThreadPoolExecutor . 7 # thread pool limits 50 threads . 8 the pool = the ThreadPoolExecutor (50 ) . 9 10 # crawler trilogy . 11 12 is # 1, a transmission request 13 is DEF the get_page (URL): 14 Print (F ' start asynchronous task: URL {} ') 15 Response = requests.get (URL) 16 return Response . 17 18 is . 19 # 2, analysis data 20 # parsing Home Video Information acquired page ID 21 is DEF parse_index (RES): 22 is 23 is Response = res.result () 24 # extracted Home All ID 25 ID_LIST = re.findall ( ' <A href = "_ Video (. *?)" ' , response.text, re.S) 26 # Print (RES) 27 28 # cycling id list 29 for m_id in ID_LIST : 30 # Stitching details page url 31 is detail_url = ' https://www.pearvideo.com/video_ ' + m_id 32 # Print (detail_url) 33 is # submitted to a details page url get_page function 34 is pool.submit (get_page, detail_url) .add_done_callback (parse_detail) 35 36 37 # resolve details page for the video url 38 DEF parse_detail (RES): 39 the Response = res.result () 40 movie_url = re.findall ( ' srcUrl = "(*).?" ' , response.text , re.S) [0] 41 is #Submit asynchronous video url get_page pass function, returns the results passed to save_movie 42 is pool.submit (get_page, movie_url) .add_done_callback (save_movie) 43 is 44 is 45 # . 3, the data storage 46 is DEF save_movie (RES): 47 48 movie_res = res.result () 49 50 # video written to the local 51 is with Open (F ' {uuid.uuid4 ()} MP4. ' , ' WB ' ) AS F: 52 is f.write (movie_res.content) 53 is Print (F ' video download end: movie_res.url {} ' ) 54 is f.flush () 55 56 is 57 is IF the __name__ == ' __main__ ' : # main ENTER + 58 59 # a get_page to send an asynchronous request, the results to parse_index function 60 URL = ' https://www.pearvideo.com/ ' 61 is the pool .submit (get_page, url) .add_done_callback ( parse_index)