Learning Python web crawler is divided into three major sections: crawl , analysis , storage
In addition, more common reptiles framework Scrapy , eventually explain in detail here.
First list what I have summarized the relevant article, which covers the basic concepts and techniques required for entry-Web crawler: Ningge the station - Web Crawler
When we enter a url in the browser enter, what happens backstage? For example, you enter http://www.lining0806.com/ , you will see the station Ningge home.
In simple terms this process takes four steps:
- Find domain names corresponding IP addresses.
- Sending a request to the server corresponding to the IP.
- The server responds to the request back to the page content.
- Browser parses web content.
Web crawlers do, in simple terms, is to realize the function of the browser. By specifying url, directly back to the data desired by the user, without the need to manipulate a step of doing the browser to retrieve.
Crawl
At this point, what you want to get clear is that? It is the HTML source code, or Json format string and so on.
1. The most basic crawl
Grab most cases belong to the get request that obtain data directly from the server on the other side.
First, Python and urllib2 comes urllib both modules substantially to meet the general page fetch. In addition, Requests also very useful package, similar to this, there httplib2 and so on.
Requests:
import requests
response = requests.get(url)
content = requests.get(url).content
print "response headers:", response.headers
print "content:", content
Urllib2:
import urllib2
response = urllib2.urlopen(url)
content = urllib2.urlopen(url).read()
print "response headers:", response.headers
print "content:", content
Httplib2:
import httplib2
http = httplib2.Http()
response_headers, content = http.request(url, 'GET')
print "response headers:", response_headers
print "content:", content
Further, with respect to the url field of the query, typically GET request data annexed future request url, to? Url segmentation and transmission of data, a plurality of parameters & connect.
data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data为dict,json
import requests
response = requests.get(url=url, params=data)
Urllib2:data为string
import urllib, urllib2
data = urllib.urlencode(data)
full_url = url+'?'+data
response = urllib2.urlopen(full_url)
Related reference: Netease News rankings crawl Review
Reference Project: The most basic web crawler of reptiles: crawling Netease News rankings
2. For the landed situation of treatment
2.1 Use form login
This case belongs to post requests, i.e. a data sending form Xianxiang server, and then returned into the local cookie.
data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data为dict,json
import requests
response = requests.post(url=url, data=data)
Urllib2:data为string
import urllib, urllib2
data = urllib.urlencode(data)
req = urllib2.Request(url=url, data=data)
response = urllib2.urlopen(req)
2.2 Use cookie landing
Use cookie landing, the server will think you are a logged in user, so there will be a return to the content you have landed. Therefore, if necessary verification code verification code can be used with a cookie landing resolved.
import requests
requests_session = requests.session()
response = requests_session.post(url=url_login, data=data)
If there is a verification code used at this time response = requests_session.post (url = url_login, data = data) is not acceptable, approach should be as follows:
response_captcha = requests_session.get(url=url_login, cookies=cookies)
response1 = requests.get(url_login) # 未登陆
response2 = requests_session.get(url_login) # 已登陆,因为之前拿到了Response Cookie!
response3 = requests_session.get(url_results) # 已登陆,因为之前拿到了Response Cookie!
Related reference: Web Crawler - Verification Code landing
Reference projects: web crawler's user name and password login verification code: crawling know almost website
3. For the anti-handling mechanisms reptiles
3.1 Use a proxy
Situation: IP address restrictions situation, but also solve the case due to the "frequent click on the" need to enter a verification code landing.
The best way to maintain this situation is a proxy IP pool, there are many online free agent IP, good and bad, you can be found by screening. In the case of "frequent Click on", we can also restrict crawlers visit frequency to avoid being the site ban.
proxies = {'http':'http://XX.XX.XX.XX:XXXX'}
Requests:
import requests
response = requests.get(url=url, proxies=proxies)
Urllib2:
import urllib2
proxy_support = urllib2.ProxyHandler(proxies)
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener) # 安装opener,此后调用urlopen()时都会使用安装过的opener对象
response = urllib2.urlopen(url)
3.2 Time Settings
Situation: limiting frequencies.
Requests, Urllib2 can use time library sleep () function:
import time
time.sleep(1)
3.3 disguised as a browser, or against "anti-hotlinking"
Some sites will check that you are not really a browser to access, or access to the machine automatically. This situation, coupled with User-Agent, that you are a browser can access. Sometimes with a check that will check your Referer Referer information is legitimate, usually coupled with the Referer.
headers = {'User-Agent':'XXXXX'} # 伪装成浏览器访问,适用于拒绝爬虫的网站
headers = {'Referer':'XXXXX'}
headers = {'User-Agent':'XXXXX', 'Referer':'XXXXX'}
Requests:
response = requests.get(url=url, headers=headers)
Urllib2:
import urllib, urllib2
req = urllib2.Request(url=url, headers=headers)
response = urllib2.urlopen(req)
4. For reconnection
Not much to say.
def multi_session(session, *arg):
retryTimes = 20
while retryTimes>0:
try:
return session.post(*arg)
except:
print '.',
retryTimes -= 1
or
def multi_open(opener, *arg):
retryTimes = 20
while retryTimes>0:
try:
return opener.open(*arg)
except:
print '.',
retryTimes -= 1
So that we can use multi_session or multi_open of reptiles crawl session or opener held.
5. Multi-crawling process
Here for the Wall Street knowledge experimental comparison parallel crawl: Python multiprocessing crawl and crawl single and multithreaded Java
Related reference: About calculation method of multi-process multi-threaded Python and Java comparison
6. For the processing of requests Ajax
For the "load more" cases, the use Ajax to transfer a lot of data.
Its working principle is: from the load after the page's url page's source code, the program will execute JavaScript in your browser. These programs will load more content, "fill" to web pages in. That's why if you climb the direct url of the page itself, you will not find the actual content of the page.
Here, if you use Google Chrome analyzes the link for "request" (: Right → Inspect Element → Network → empty, click the "load more", GET appear to find the corresponding link Type is text / html, click on to see get parameters or copy Request URL), cyclic process.
- If the "request" there page before, the analysis is derived based on the step 1 page URL. And so on, grab data grab Ajax address.
- Json format of data (str) returned by a regular match. json format data, to be converted from '\ uxxxx' form unicode_escape coded into u '\ uxxxx' of unicode encoding.
7. Selenium automated testing tools
Selenium is an automated testing tool. It can achieve manipulation browsers, including padding characters, mouse clicks, get the elements, and a series of page switching operation. In short, all the browser can do, Selenium are able to do.
Here are the list after a given city, the use of selenium to crawl dynamic where the network codes of fare information.
Reference projects: web crawler using the Selenium proxy login: take away where climbing site
8. identification codes
For site verification code, we have three options:
- Use a proxy, update IP.
- Use cookie landing.
- Identification codes.
Before using proxy and use cookie landed already said, Here to talk about verification code recognition.
Can take advantage of the open source Tesseract-OCR code verification system to identify and download the picture, the recognized character is transferred crawler system to simulate landing. Of course, you can upload pictures to verify the identification code on the coding platform. If unsuccessful, it can update the verification code identification again, until it succeeds.
Reference item: codes identifying items First Edition: Captcha1
There are two problems crawling to note:
- How to monitor a series of updates to the site, that is to say, how to incremental crawling?
- For the huge amounts of data, how to implement a distributed crawling?
analysis
After crawling for content is captured for analysis, what you need, we extract from the relevant contents.
Common analytical tools have regular expressions , BeautifulSoup , lxml and so on.
storage
After analyzing the content we need, the next step is stored.
We can choose to deposit a text file, you can also choose deposit MySQL or MongoDB databases.
There are two storage problem needs attention:
- How similar web pages?
- Content is stored in what form?
Scrapy
Scrapy is a Python-based open source crawler Twisted framework, and is widely used in industry.
Can refer to related content based Scrapy build web crawler , while giving This article describes the micro-channel search project code crawling, learning to you as a reference.
Reference project: using Scrapy or Requests recursive crawl micro-letter search results
Robots agreement
Good web crawler, you first need to abide by Robots agreement . Robots agreement (also called crawlers agreement, robots protocol, etc.) The full name is "web crawler exclusion criteria" (Robots Exclusion Protocol), the site tells the search engines which pages can be crawled by Robots agreement, which can not crawl the page.
Decentralization in the root directory of a website robots.txt text file (such as https://www.taobao.com/robots.txt ), which you can specify a different page web crawler can access and prohibited access to the page, the page specified by the regular expression means. Web crawlers in gathering this site before, first of all get to the robots.txt text file, and then resolves to rule them, and then to collect data in accordance with the rules of the site.
1. Robots protocol rules
User-agent: 指定对哪些爬虫生效
Disallow: 指定不允许访问的网址
Allow: 指定允许访问的网址
Note: An English should be capitalized, the colon is the English state, there is a space after the colon, "/" on behalf of the entire site
Example 2. Robots Protocol
禁止所有机器人访问
User-agent: *
Disallow: /
允许所有机器人访问
User-agent: *
Disallow:
禁止特定机器人访问
User-agent: BadBot
Disallow: /
允许特定机器人访问
User-agent: GoodBot
Disallow:
禁止访问特定目录
User-agent: *
Disallow: /images/
仅允许访问特定目录
User-agent: *
Allow: /images/
Disallow: /
禁止访问特定文件
User-agent: *
Disallow: /*.html$
仅允许访问特定文件
User-agent: *
Allow: /*.html$
Disallow: /