python crawling - crawling Amoy vote being hit movies

Today, just to learn a bit of python reptile, that the harvest is quite big, so writing a blog help partners who want to learn crawlers.

Here I will take a simple crawling Amoy vote being hit movies, for example, tell us about the complete process of a reptile.

First, the word does not say, the dry - source

. 1  from BS4 Import the BeautifulSoup
 2  Import Requests
 . 3  Import JSON
 . 4  
. 5  # disguised browser requests 
. 6 headers = {
 . 7      ' the User-- Agent ' : ' the Mozilla / 5.0 (compatible; MSIE 9.0; the Windows NT 6.1; Trident / 5.0; ' ,
 . 8      ' the Referer ' : ' https://www.taopiaopiao.com/showList.htm?spm=a1z21.3046609.header.4.1d69112aGq86y0&n_s=new ' 
. 9  }
 10  
. 11  # codes acquired page 
12 is  DEFthe getPage (URL):
 13 is      the try :
 14          Response = requests.get (URL)
 15          IF response.status_code == 200:    # HTTP status code 200 represents a successful request 
16              return response.text
 . 17          the else :
 18 is              return None
 . 19      the except Exception:
 20 is          return None
 21 is  
22 is  DEF getInfo (HTML):
 23 is      Soup = the BeautifulSoup (HTML, ' lxml ' )     # Create a python bs bs objects using default resolver, lxml also parser 
24      items soup.select = (' Div .movie-Card-wrap ' )             # go to the website of the console needs to find a higher content of label elements, pay attention to when looking for methodical, most of the content is crawled regularly, to find after crawling content, you find what you want to climb the parent label, where to find the div tag, then back .movie-card-wrap is the class name, of course, can find in accordance with the id, not their own Baidu soup.select 
25      i = 1
 26      for Item in items:
 27          name = item.find (name = ' div ' , class_ = ' Movie-Card-name ' .) .get_text () Strip ()      # this is to find the content you want to crawl label and its class 
28          = item.find info (name = ' div ' , the class_ = ' Movie-Card-List ' ) .get_text (). Strip ()
 29         print(str(i)+' '+'电影名:'+name+'\n'+info+'\n')
30         i=i+1
31 
32 url='https://www.taopiaopiao.com/showList.htm?spm=a1z21.3046609.header.4.1d69112aGq86y0&n_s=new'
33 html=getPage(url)
34 getInfo(html)

Then talk about the specific meaning of the code, in fact, have a comment, I will talk in detail about the process it

A disguised as browser request headers

This is well understood, because if you do not disguise, then you take the climb, crawl website you can learn to climb in the data, it is easy to be closed, so we wrote a json headers of disguised browser to access, I do not understand Baidu own

Second, get the page code getPage

This part of the code is well understood, useful on two lines, so not explained in detail, when used directly to

Third, access to information getInfo

This is partly I think that in terms of crawling the hardest part, of course, it is not hard, so I talk about in detail with examples

First of all we need to know, crawling web content is content page code crawled, the data on the server side we can not crawl to, and what it meant, we open the browser, press F12

You can see the page's source code, and then we have to climb is that part of the content between the tags taken

 

That is, for example, I circled in red above content, this step we need to do is locate the content you want to crawl in the source code, so to speak, we can understand it. Then find the corresponding label, call the method on it.

Then crawling information can be saved to the database, the data can be written json, files can be written to and then do secondary treatment, screening some useful data here in order to facilitate understanding, I direct output to the console we can look at the results.

 

This is the python reptiles, I am also a novice, may have a lot to say not in place, I hope you forgive.

 

Guess you like

Origin www.cnblogs.com/cairsha/p/10981015.html