--B station covers Python combat crawling reptiles

Street beat Mito crawling mainly on the basis of revised teacher Cui and other blog on this article, I learned crawling video cover B station, due to limited capacity, the code only once crawling a.

First, go home station B

 

 Even then choose the one you feel good-looking video, point in, or observe the information in this column Network

 

 

Finally, we found a bar information

 

 In this column we can see the data contains a lot of information about the video, we direct the pic corresponding url in online search, get on the cover,

 

 So we need to do now is to use python from URL = 'https:? Aid = 66698107 & cid = 115671196 //api.bilibili.com/x/web-interface/view' request to obtain the results of which are then extracted from the pic, below directly on the code

. 1  Import JSON
 2  Import OS
 . 3  Import Re
 . 4  Import Requests
 . 5  from the urllib Import Request
 . 6 AV = INPUT ( ' Enter number to query AV: ' )
 . 7 URL = ' https://api.bilibili.com/x/web ? -interface / View AID S =% ' % (AV,)
 . 8  
. 9 headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 64.0. Safari 3282.167 / 537.36 ',
 10              ' the Referer ' : ' https://www.bilibili.com/v/douga?spm_id_from=333.334.b_62696c695f646f756761.2 ' ,
 . 11              # 'the Accept': 'text / HTML, file application / XHTML + XML, file application / XML ; Q = 0.9, Image / WebP, Image / APNG, * / *; Q = 0.8, file application / Signed - Exchange; V = B3 ' 
12 is              # this section Headers been given, check the blog found not written may be directly climbing 
13              # 'the Accept-Encoding': 'the gzip, the deflate, br', 
14              # 'the Accept - Language': 'ZH-the CN, ZH; Q = 0.9', 
15              # 'the Cache - Control': 'max - Age =', 
16              # 'Connection': 'the Keep - Alive'
17             }
18= requests.get Response (URL, headers = headers)
 . 19 Content = json.loads (response.text)
 20 is  # acquired str string needs to be parsed into data json 
21 is statue_code = content.get ( ' code ' )
 22 is  IF == statue_code 0:                                    
 23 is      Print (content.get ( ' Data ' ) .get ( ' PIC ' ))
 24      Print (content.get ( ' Data ' ) .get ( ' title ' ))
 25      IMG = content.get (' Data ' ) .get ( ' PIC ' )
 26 is      name the re.sub = ( " [A-Za-Z0-9 \! \% \ [\] \, \ ./] " , "" , content.get ( ' the Data ' ) .get ( ' title ' )) # this part with the regular, because it was discovered that some video B station name will be some punctuation will lead to not name the file 
                                                          # so only extract a string of Chinese characters as a file with n name
27 request.urlretrieve (img, name + ' .jpg ' ) # save as title 28 the else : 29 Print ( 'The AV number does not exist ')
Code statue_code data can be seen in the data, after analysis know it is used to indicate the state of the requested data, in statue_code == 0, the data will have 
the file name this name because some video B station will some punctuation will lead to not name the file, so the addition of regular, remove the other characters in addition to English.
Finally crawling results:

 

 In the folder:

 

 Welcome to explore learning together



 

Guess you like

Origin www.cnblogs.com/KangZP/p/11468316.html