python reptile practice - crawling "pear video"

A basic process reptiles:
 1, sends a request (request library: Request, Selenium)
 2, the fetch response data () returned by the server
 3, parses and extracts data (resolver library: Re, the BeautifulSoup, the Xpath)
 . 4, save the data (storage library) MongoDB

Second, crawling "pear video" in one video

1 # crawling video Pear
 2 Import Requests . 3 = URL ' https://video.pearvideo.com/mp4/adshort/20190613/cont-1565846-14013215_adpkg-ad_hd.mp4 ' . 4 RES = requests.get (URL) . 5 # the crawling video file is written . 6 with Open ( ' pear video .mp4 ', ' WB ' ) AS F: . 7 f.write (res.content)

 Third, the use of regular expressions

. 1, re.find.all ( 'regular matching rule', 'parse the text', 'normal mode') 
2, re.S: global mode (the entire text matching)
3, refers to the current position of the
4, * means It is to find all

Fourth, the whole video crawling "pear video" in the

 1 import requests
 2 import re
 3 import uuid
 4 
 5 #1、发送请求
 6 def get_page(url):
 7     response=requests.get(url)
 8     return response
 9 
10 #2、解析数据
11 def parse_index(text):
12     res=re.findall('<a href="video_(.*?)"',text,re.S)
13 
14     detail_url_list=[]
15     for m_id in res:
16         detail_url='https://www.pearvideo.com/video_'+m_id
17         detail_url_list.append(detail_url)
18 
19     return detail_url_list
20 
21 #解析详情页获取视频url
22 def parse_detail(text):
23     movie_url=re.findall('srcUrl="(.*?)"',text,re.S)[0]
24     return movie_url
25 
26 #3、保存数据
27 def save_movie(movie_url):
28     response=requests.get(movie_url)
29     Open with (F ' {uuid.uuid4 ()} MP4. ' , ' WB ' ) AS F:
 30          f.write (response.content)
 31 is          f.flush ()
 32  
33 is  
34 is  # main function input :( 'main' , then press the "Enter") 
35  IF  the __name__ == ' __main__ ' :
 36      # . 1, sends a request to the home page 
37 [      index_res the get_page = (URL = ' https://www.pearvideo.com/ ' )
 38 is      # 2, for Home parsing for details page the above mentioned id 
39      detail_url_list = parse_index (index_res.text)
 40 
41      # 3, before sending a request for each page URL 
42 is      for detail_url in detail_url_list:
 43 is          detail_res the get_page = (URL = detail_url)
 44 is  
45          # . 4, details page for parsing the video URL 
46 is          movie_url = parse_detail (detail_res.text)
 47          Print ( movie_url)
 48  
49          # . 5, the video storage 
50          save_movie (movie_url)

Fifth, multi-threaded crawling whole video a "pear video" in the

. 1  Import Requests
 2  Import Re   # regular module 
. 3  # uuid.uuid4 () may generate a unique random string section according to the time stamp of the world 
. 4  Import UUID
 . 5  # Import module thread pool 
. 6  from concurrent.futures Import the ThreadPoolExecutor
 . 7  # thread pool limits 50 threads 
. 8 the pool = the ThreadPoolExecutor (50 )
 . 9  
10  # crawler trilogy 
. 11  
12 is  # 1, a transmission request 
13 is  DEF the get_page (URL):
 14      Print (F ' start asynchronous task: URL {} ')
 15      Response = requests.get (URL)
 16      return Response
 . 17  
18 is  
. 19  # 2, analysis data 
20  # parsing Home Video Information acquired page ID 
21 is  DEF parse_index (RES):
 22 is  
23 is      Response = res.result ()
 24      # extracted Home All ID 
25      ID_LIST = re.findall ( ' <A href = "_ Video (. *?)" ' , response.text, re.S)
 26      # Print (RES) 
27  
28      # cycling id list 
29      for m_id in ID_LIST :
 30         # Stitching details page url 
31 is          detail_url = ' https://www.pearvideo.com/video_ ' + m_id
 32          # Print (detail_url) 
33 is          # submitted to a details page url get_page function 
34 is          pool.submit (get_page, detail_url) .add_done_callback (parse_detail)
 35  
36  
37  # resolve details page for the video url 
38  DEF parse_detail (RES):
 39      the Response = res.result ()
 40      movie_url = re.findall ( ' srcUrl = "(*).?" ' , response.text , re.S) [0]
 41 is      #Submit asynchronous video url get_page pass function, returns the results passed to save_movie 
42 is      pool.submit (get_page, movie_url) .add_done_callback (save_movie)
 43 is  
44 is  
45  # . 3, the data storage 
46 is  DEF save_movie (RES):
 47  
48      movie_res = res.result ()
 49  
50      # video written to the local 
51 is      with Open (F ' {uuid.uuid4 ()} MP4. ' , ' WB ' ) AS F:
 52 is          f.write (movie_res.content)
 53 is          Print (F ' video download end: movie_res.url {} ' )
 54 is          f.flush ()
55  
56 is  
57 is  IF  the __name__ == ' __main__ ' :   # main ENTER + 
58  
59      # a get_page to send an asynchronous request, the results to parse_index function 
60      URL = ' https://www.pearvideo.com/ ' 
61 is      the pool .submit (get_page, url) .add_done_callback ( parse_index)

 

Guess you like

Origin www.cnblogs.com/lweiser/p/11035236.html