Beginner reptile (a)

1 Introduction
Achieved before the crawler python, are mainly used more underlying urllib, urllib2 achieved, this implementation was relatively primitive, it is also encode more strenuous, particularly when the extracted information, have to use the regular expression matching (before reprint of an article embarrassments Wikipedia reptiles, http://blog.csdn.net/zhyh1435589631/article/details/51296734). Here we use requests + beautifulsoup implementations, using css selector, to simplify the code written.  
 
2. Basic information

Of course, before using these two modules, the two modules do require some introduction:
Requests main library is a good package http function, can achieve the basic http operating
beautifulsoup mainly provides one pair of html, xml pages perfect analytical methods, in fact, he is the html tag is parsed as a tree node, so that we can be a html page as a tree structure.
requests the official document: http://docs.python-requests.org/zh_CN/latest/user/quickstart.html
BeautifulSoup official document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh .html

3. Implementing the code
Code is relatively simple, not to say, the following code, we were crawling two sites, embarrassments Encyclopedia.
 1 # -*- coding=utf8 -*-
 2 
 3 import requests
 4 from bs4 import BeautifulSoup
 5 
 6 def qiushibaike():
 7     content = requests.get('http://www.qiushibaike.com').content
 8     soup = BeautifulSoup(content, 'html.parser')
 9 
10     for div in soup.find_all('div', {'class' : 'content'}):
11         print div.text.strip()
12 
13 def ustcjob():
14     headers = {'User-Agent':'Mozilla / 5.0(X11;Linux x86_64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 50.0.2661.102 Safari / 537.36'}
15     content = requests.get('http://job.ustc.edu.cn/list.php?MenuID=002', headers = headers).content
16     soup = BeautifulSoup(content, 'html.parser')
17 
18     for Jop in soup.find_all('div', {'class' : 'Joplistone'}):
19         for item in Jop.find_all('li'):
20             print "%-30s%-20s%-40s" % (item.a.text.strip() , item.span.text.strip() , item.span.next_sibling.text.strip())
21 
22 
23 if __name__ == '__main__':
24     #qiushibaike()
25     ustcjob()

IV. Achieve results

 

 

 

Guess you like

Origin www.cnblogs.com/qq991025/p/11831776.html