Today, just to learn a bit of python reptile, that the harvest is quite big, so writing a blog help partners who want to learn crawlers.
Here I will take a simple crawling Amoy vote being hit movies, for example, tell us about the complete process of a reptile.
First, the word does not say, the dry - source
. 1 from BS4 Import the BeautifulSoup 2 Import Requests . 3 Import JSON . 4 . 5 # disguised browser requests . 6 headers = { . 7 ' the User-- Agent ' : ' the Mozilla / 5.0 (compatible; MSIE 9.0; the Windows NT 6.1; Trident / 5.0; ' , . 8 ' the Referer ' : ' https://www.taopiaopiao.com/showList.htm?spm=a1z21.3046609.header.4.1d69112aGq86y0&n_s=new ' . 9 } 10 . 11 # codes acquired page 12 is DEFthe getPage (URL): 13 is the try : 14 Response = requests.get (URL) 15 IF response.status_code == 200: # HTTP status code 200 represents a successful request 16 return response.text . 17 the else : 18 is return None . 19 the except Exception: 20 is return None 21 is 22 is DEF getInfo (HTML): 23 is Soup = the BeautifulSoup (HTML, ' lxml ' ) # Create a python bs bs objects using default resolver, lxml also parser 24 items soup.select = (' Div .movie-Card-wrap ' ) # go to the website of the console needs to find a higher content of label elements, pay attention to when looking for methodical, most of the content is crawled regularly, to find after crawling content, you find what you want to climb the parent label, where to find the div tag, then back .movie-card-wrap is the class name, of course, can find in accordance with the id, not their own Baidu soup.select 25 i = 1 26 for Item in items: 27 name = item.find (name = ' div ' , class_ = ' Movie-Card-name ' .) .get_text () Strip () # this is to find the content you want to crawl label and its class 28 = item.find info (name = ' div ' , the class_ = ' Movie-Card-List ' ) .get_text (). Strip () 29 print(str(i)+' '+'电影名:'+name+'\n'+info+'\n') 30 i=i+1 31 32 url='https://www.taopiaopiao.com/showList.htm?spm=a1z21.3046609.header.4.1d69112aGq86y0&n_s=new' 33 html=getPage(url) 34 getInfo(html)
Then talk about the specific meaning of the code, in fact, have a comment, I will talk in detail about the process it
A disguised as browser request headers
This is well understood, because if you do not disguise, then you take the climb, crawl website you can learn to climb in the data, it is easy to be closed, so we wrote a json headers of disguised browser to access, I do not understand Baidu own
Second, get the page code getPage
This part of the code is well understood, useful on two lines, so not explained in detail, when used directly to
Third, access to information getInfo
This is partly I think that in terms of crawling the hardest part, of course, it is not hard, so I talk about in detail with examples
First of all we need to know, crawling web content is content page code crawled, the data on the server side we can not crawl to, and what it meant, we open the browser, press F12
You can see the page's source code, and then we have to climb is that part of the content between the tags taken
That is, for example, I circled in red above content, this step we need to do is locate the content you want to crawl in the source code, so to speak, we can understand it. Then find the corresponding label, call the method on it.
Then crawling information can be saved to the database, the data can be written json, files can be written to and then do secondary treatment, screening some useful data here in order to facilitate understanding, I direct output to the console we can look at the results.
This is the python reptiles, I am also a novice, may have a lot to say not in place, I hope you forgive.