[Github moving bricks] Python web crawler's Essentials Starter

Learning Python web crawler is divided into three major sections: crawl , analysis , storage

In addition, more common reptiles framework Scrapy , eventually explain in detail here.

First list what I have summarized the relevant article, which covers the basic concepts and techniques required for entry-Web crawler: Ningge the station - Web Crawler

When we enter a url in the browser enter, what happens backstage? For example, you enter http://www.lining0806.com/ , you will see the station Ningge home.

In simple terms this process takes four steps:

Find domain names corresponding IP addresses.
Sending a request to the server corresponding to the IP.
The server responds to the request back to the page content.
Browser parses web content.

Web crawlers do, in simple terms, is to realize the function of the browser. By specifying url, directly back to the data desired by the user, without the need to manipulate a step of doing the browser to retrieve.

Crawl

At this point, what you want to get clear is that? It is the HTML source code, or Json format string and so on.

1. The most basic crawl

Grab most cases belong to the get request that obtain data directly from the server on the other side.

First, Python and urllib2 comes urllib both modules substantially to meet the general page fetch. In addition, Requests also very useful package, similar to this, there httplib2 and so on.

Requests：
    import requests
    response = requests.get(url)
    content = requests.get(url).content
    print "response headers:", response.headers
    print "content:", content
Urllib2：
    import urllib2
    response = urllib2.urlopen(url)
    content = urllib2.urlopen(url).read()
    print "response headers:", response.headers
    print "content:", content
Httplib2：
    import httplib2
    http = httplib2.Http()
    response_headers, content = http.request(url, 'GET')
    print "response headers:", response_headers
    print "content:", content

Further, with respect to the url field of the query, typically GET request data annexed future request url, to? Url segmentation and transmission of data, a plurality of parameters & connect.

data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests：data为dict，json
    import requests
    response = requests.get(url=url, params=data)
Urllib2：data为string
    import urllib, urllib2    
    data = urllib.urlencode(data)
    full_url = url+'?'+data
    response = urllib2.urlopen(full_url)

Related reference: Netease News rankings crawl Review

Reference Project: The most basic web crawler of reptiles: crawling Netease News rankings

2. For the landed situation of treatment

2.1 Use form login

This case belongs to post requests, i.e. a data sending form Xianxiang server, and then returned into the local cookie.

data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests：data为dict，json
    import requests
    response = requests.post(url=url, data=data)
Urllib2：data为string
    import urllib, urllib2    
    data = urllib.urlencode(data)
    req = urllib2.Request(url=url, data=data)
    response = urllib2.urlopen(req)

2.2 Use cookie landing

Use cookie landing, the server will think you are a logged in user, so there will be a return to the content you have landed. Therefore, if necessary verification code verification code can be used with a cookie landing resolved.

import requests         
requests_session = requests.session() 
response = requests_session.post(url=url_login, data=data)

If there is a verification code used at this time response = requests_session.post (url = url_login, data = data) is not acceptable, approach should be as follows:

response_captcha = requests_session.get(url=url_login, cookies=cookies)
response1 = requests.get(url_login) # 未登陆
response2 = requests_session.get(url_login) # 已登陆，因为之前拿到了Response Cookie！
response3 = requests_session.get(url_results) # 已登陆，因为之前拿到了Response Cookie！

Related reference: Web Crawler - Verification Code landing

Reference projects: web crawler's user name and password login verification code: crawling know almost website

3. For the anti-handling mechanisms reptiles

3.1 Use a proxy

Situation: IP address restrictions situation, but also solve the case due to the "frequent click on the" need to enter a verification code landing.

The best way to maintain this situation is a proxy IP pool, there are many online free agent IP, good and bad, you can be found by screening. In the case of "frequent Click on", we can also restrict crawlers visit frequency to avoid being the site ban.

proxies = {'http':'http://XX.XX.XX.XX:XXXX'}
Requests：
    import requests
    response = requests.get(url=url, proxies=proxies)
Urllib2：
    import urllib2
    proxy_support = urllib2.ProxyHandler(proxies)
    opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
    urllib2.install_opener(opener) # 安装opener，此后调用urlopen()时都会使用安装过的opener对象
    response = urllib2.urlopen(url)

3.2 Time Settings

Situation: limiting frequencies.

Requests, Urllib2 can use time library sleep () function:

import time
time.sleep(1)

3.3 disguised as a browser, or against "anti-hotlinking"

Some sites will check that you are not really a browser to access, or access to the machine automatically. This situation, coupled with User-Agent, that you are a browser can access. Sometimes with a check that will check your Referer Referer information is legitimate, usually coupled with the Referer.

headers = {'User-Agent':'XXXXX'} # 伪装成浏览器访问，适用于拒绝爬虫的网站
headers = {'Referer':'XXXXX'}
headers = {'User-Agent':'XXXXX', 'Referer':'XXXXX'}
Requests：
    response = requests.get(url=url, headers=headers)
Urllib2：
    import urllib, urllib2   
    req = urllib2.Request(url=url, headers=headers)
    response = urllib2.urlopen(req)

4. For reconnection

Not much to say.

def multi_session(session, *arg):
    retryTimes = 20
    while retryTimes>0:
        try:
            return session.post(*arg)
        except:
            print '.',
            retryTimes -= 1

def multi_open(opener, *arg):
    retryTimes = 20
    while retryTimes>0:
        try:
            return opener.open(*arg)
        except:
            print '.',
            retryTimes -= 1

So that we can use multi_session or multi_open of reptiles crawl session or opener held.

5. Multi-crawling process

Here for the Wall Street knowledge experimental comparison parallel crawl: Python multiprocessing crawl and crawl single and multithreaded Java

6. For the processing of requests Ajax

For the "load more" cases, the use Ajax to transfer a lot of data.

Its working principle is: from the load after the page's url page's source code, the program will execute JavaScript in your browser. These programs will load more content, "fill" to web pages in. That's why if you climb the direct url of the page itself, you will not find the actual content of the page.

Here, if you use Google Chrome analyzes the link for "request" (: Right → Inspect Element → Network → empty, click the "load more", GET appear to find the corresponding link Type is text / html, click on to see get parameters or copy Request URL), cyclic process.

If the "request" there page before, the analysis is derived based on the step 1 page URL. And so on, grab data grab Ajax address.
Json format of data (str) returned by a regular match. json format data, to be converted from '\ uxxxx' form unicode_escape coded into u '\ uxxxx' of unicode encoding.

7. Selenium automated testing tools

Selenium is an automated testing tool. It can achieve manipulation browsers, including padding characters, mouse clicks, get the elements, and a series of page switching operation. In short, all the browser can do, Selenium are able to do.

Here are the list after a given city, the use of selenium to crawl dynamic where the network codes of fare information.

Reference projects: web crawler using the Selenium proxy login: take away where climbing site

8. identification codes

For site verification code, we have three options:

Use a proxy, update IP.
Use cookie landing.
Identification codes.

Before using proxy and use cookie landed already said, Here to talk about verification code recognition.

Can take advantage of the open source Tesseract-OCR code verification system to identify and download the picture, the recognized character is transferred crawler system to simulate landing. Of course, you can upload pictures to verify the identification code on the coding platform. If unsuccessful, it can update the verification code identification again, until it succeeds.

Reference item: codes identifying items First Edition: Captcha1

There are two problems crawling to note:

How to monitor a series of updates to the site, that is to say, how to incremental crawling?
For the huge amounts of data, how to implement a distributed crawling?

analysis

After crawling for content is captured for analysis, what you need, we extract from the relevant contents.

Common analytical tools have regular expressions , BeautifulSoup , lxml and so on.

storage

After analyzing the content we need, the next step is stored.

We can choose to deposit a text file, you can also choose deposit MySQL or MongoDB databases.

There are two storage problem needs attention:

How similar web pages?
Content is stored in what form?

Scrapy

Scrapy is a Python-based open source crawler Twisted framework, and is widely used in industry.

Can refer to related content based Scrapy build web crawler , while giving This article describes the micro-channel search project code crawling, learning to you as a reference.

Reference project: using Scrapy or Requests recursive crawl micro-letter search results

Robots agreement

Good web crawler, you first need to abide by Robots agreement . Robots agreement (also called crawlers agreement, robots protocol, etc.) The full name is "web crawler exclusion criteria" (Robots Exclusion Protocol), the site tells the search engines which pages can be crawled by Robots agreement, which can not crawl the page.

Decentralization in the root directory of a website robots.txt text file (such as https://www.taobao.com/robots.txt ), which you can specify a different page web crawler can access and prohibited access to the page, the page specified by the regular expression means. Web crawlers in gathering this site before, first of all get to the robots.txt text file, and then resolves to rule them, and then to collect data in accordance with the rules of the site.

1. Robots protocol rules

User-agent: 指定对哪些爬虫生效
Disallow: 指定不允许访问的网址
Allow: 指定允许访问的网址

Note: An English should be capitalized, the colon is the English state, there is a space after the colon, "/" on behalf of the entire site

Example 2. Robots Protocol

禁止所有机器人访问
    User-agent: *
    Disallow: /
允许所有机器人访问
    User-agent: *
    Disallow: 
禁止特定机器人访问
    User-agent: BadBot
    Disallow: /
允许特定机器人访问
    User-agent: GoodBot
    Disallow: 
禁止访问特定目录
    User-agent: *
    Disallow: /images/
仅允许访问特定目录
    User-agent: *
    Allow: /images/
    Disallow: /
禁止访问特定文件
    User-agent: *
    Disallow: /*.html$
仅允许访问特定文件
    User-agent: *
    Allow: /*.html$
    Disallow: /