Come! Write your first web crawler

To crawl the site, we first need to download the page that contains the data of interest, a process commonly referred to crawl (crawling) . Crawling a website there are many ways, and the choice of which method is more appropriate depends on the structure of the target site. In this chapter, we will first discuss how to safely download page, and then will introduce the following three common methods of crawling the site:

  • Crawling site map;
  • Using the database ID through each page;
  • Tracking Web links.

So far, we have used alternately crawling and crawling two terms, let us first define the similarities of these two methods and differences.

1.5.1 Comparison gripping and crawling

Depending on the information and site content and structure of your concern, you may need to crawl web sites or crawling. They then what difference does it?

Web crawling generally, and to obtain specific information on these sites for a specific site. Web crawler used to access these special pages, occurs if the site location change information or site changes, then the need to be modified. For example, you might want to grab your favorite local restaurant to see the daily specials through the network, in order to achieve this, you need to grab part of its Web site daily updated information.

In contrast, the network crawl usually constructed generic way, the goal is a series of top-level domain of the site or the entire network. Crawling can be used to gather more specific information, but the more common situation is crawling network, access to small and universal information from many different sites or pages and follow links to other pages.

In addition to crawling and crawling, we will introduce a web crawler in Chapter 8. Crawling reptiles can be used to specify a list of sites, or more extensive crawling across multiple sites or even across the Internet.

In general, we will use specific terminology reflects our use cases. When you develop web crawlers, they may notice the difference in you want to use technology, libraries and packages. In these cases, your understanding of the different terms that can help you select the appropriate package or technical terms used based (for example, if only for crawling? Also apply to reptiles?).

1.5.2 download page

To crawl the web, we first need to download it. The following sample script uses Python's urllibmodule download URL.

import urllib.request
def download(url):
    return urllib.request.urlopen(url).read()

When an incoming URL parameter, the function will return to its download page and HTML. However, the existence of this code snippet is a problem that when downloading pages, we may encounter some errors can not control, such as the requested page may not exist. In this case, urlliban exception is thrown, then exit the script. Safety, the following and then give a more stable build version, you can catch them.

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

def download(url):
    print('Downloading:', url)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
    return html

Now, when a download URL or an error occurs, the function can capture the exception and then return None.

In this book, we will assume that the way you write code in the file, instead of using the prompt (as shown in the code above). When you find the code or the Python >>> prompt IPython prompt In [1]:start, you need to enter it into the main file is in use, or to save the file after import of these functions and classes in the Python interpreter.

1. Retry download

Error encountered while downloading is often temporary, such as when the server returned an overload 503 Service Unavailableerror. For this type of error, we can try downloading it again after a short wait because the server may now issue has been resolved. However, we do not need all the errors and try downloading again. If the server returns 404 Not Foundthis error, it indicates that the page does not currently exist, try the same request again generally do not appear different results.

Internet Engineering Task Force (Internet Engineering Task Force) defines a complete list of HTTP errors, which you can learn 4xxan error occurs when there is a problem request, but 5xxan error occurs when there is a problem in the server. So, we just need to make sure that downloadfunction in the event of 5xxan error retry download. The following are support the new version of the code to retry download function.

def download(url, num_retries=2):
    print('Downloading:', url)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
               if hasattr(e, 'code') and 500 <= e.code < 600:
             # recursively retry 5xx HTTP errors
             return download(url, num_retries - 1)
    return html

Now, when the downloadfunction encounters an 5xxerror code, the recursive call function itself will be retried. Additionally, this function also adds a parameter for setting the number of retries to download a default value twice. We are here to try to limit the number of pages downloaded, it is wrong because the server may be temporarily has not been restored. Want to test this function, you can try to download http://httpstat.us/500, the URL will always return 500an error code.

    >>> download('http://httpstat.us/500')
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error

As can be seen from the above results returned, downloadbehavior and function expected, first try to download the page, is received 500after the error, and conducted two retries before giving up.

2. Set the user agent

By default, urllibthe use of Python-urllib/3.xa user agent download web content, which 3.xis the current version number of the Python environment of use. If we can use to identify the user agent is better, our web crawler to avoid some of the problems encountered. In addition, perhaps because the server had experienced poor quality of the Python web crawler caused by overload, some sites will be closed for the default user agent.

Therefore, in order to make more reliable download site, we need to control user agent settings. The following code for the downloadfunction is modified, it sets a default user agent ‘wswp’(ie, Web Scraping with Python acronym).

def download(url, user_agent='wswp', num_retries=2):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        html = urllib.request.urlopen(request).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

Now, if you try to access again meetup.com, you can see a valid HTML a. Our download function can be obtained in subsequent code reuse, this function can catch the exception, where possible retry sites, and set the user agent.

1.5.3 Site Map reptiles

In the first simple reptiles, we will use the sample site robots.txtfile found in the site map to download all pages. To parse the site map, we will use a simple regular expressions, from the &lt;loc&gt;extracted URL tag.

We need to update the code to handle encoding conversion, because our current downloadfunction simply returns the bytes. In the next chapter, we will introduce a more robust analytical method - CSS selectors . The following is a code sample crawler.

import re

def download(url, user_agent='wswp', num_retries=2, charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
            # recursively retry 5xx HTTP errors
            return download(url, num_retries - 1)
    return html

def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        html = download(link)
        # scrape html here
        # ...

Now run the sitemap crawler, download pages from all countries in the sample site.

    >>> crawl_sitemap('http://example.python-scraping.com/sitemap.xml')
Downloading: http://example.python-scraping.com/sitemap.xml
Downloading: http://example.python-scraping.com/view/Afghanistan-1
Downloading: http://example.python-scraping.com/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/view/Albania-3
... 

As the above code in downloadthe method shown, we must update the character encoding to use regular expression processing site response. The Python readmethod returns bytes, while the regular expression character string is desired. Our code depends include the appropriate character encoding header in response to the site maintainer. If you do not return the character encoding of the head, we will set it as the default value UTF-8, and to have the greatest hope. Of course, if the return header encoded incorrectly, or the code is not set and is not UTF-8, then this will throw an error. There are also more sophisticated ways guess for coding (see https://pypi.python.org/pypi/chardet), which is very easy to implement.

So far, the site has been in line with expectations reptiles map. However, as previously mentioned, we can not rely on Sitemapdocuments provided links to each page. The next section, we will introduce another simple reptiles, reptile is no longer dependent on the Sitemapfile.

If you do not want to continue crawling at any time, you can press Ctrl + C or cmd + C to exit the Python interpreter or program execution.

1.5.4 ID traverse reptiles

In this section, we will take advantage of weaknesses in the structure of the site, easier access to all content. The following are some examples of URL countries (or regions).

As can be seen, these URL differ only in the last part of the URL path, including the national (or regional) name (as the page alias) and ID. A page that contains an alias in the URL is very common practice, can help play a role in search engine optimization. In general, Web servers will ignore this string, using only ID to match the database records. Below we will remove it, to see http://example.python-scraping.com/view/1whether the link test sample site is still available. The test results shown in Figure 1.1.

..\0101.jpg{70%}

Figure 1.1

As can be seen from Figure 1.1, the page can still load successfully, that this method is useful. Now, we can ignore page aliases, using only the database ID to the download page for all countries (or regions) of the. The following code fragment using the techniques.

import itertools

def crawl_site(url):
    for page in itertools.count(1):
        pg_url = '{}{}'.format(url, page)
        html = download(pg_url)
        if html is None:
            break
        # success - can scrape the result

Now, we can use this function incoming base URL.

>>> crawl_site('http://example.python-scraping.com/view/-')
Downloading: http://example.python-scraping.com/view/-1
Downloading: http://example.python-scraping.com/view/-2
Downloading: http://example.python-scraping.com/view/-3
Downloading: http://example.python-scraping.com/view/-4
[...]

In this code, we traverse of ID, until the download error stop, we assume that this time has arrived to crawl the last page of a country (or region). However, this implementation there is a flaw in the way that some records may have been deleted, is not continuous between the ID database. At this point, as long as access to a spaced points, reptiles would quit immediately. Here is an improved version of this code, will exit the program after the continuous occurrence in this release repeatedly download error.

def crawl_site(url, max_errors=5):
    for page in itertools.count(1):
        pg_url = '{}{}'.format(url, page)
        html = download(pg_url)
        if html is None:
            num_errors += 1
            if num_errors == max_errors:
                # max errors reached, exit loop
                break
        else:
            num_errors = 0
            # success - can scrape the result

The above code to achieve reptiles need to download five consecutive errors will stop the walk, thus greatly reduces the risk of premature stop traversing the face of record is deleted or hidden.

When crawling the site, traversing ID is a very convenient way, but the site map and reptiles, this approach can not guarantee always available. For example, some sites will check whether the page in the URL aliases, if not, it will return 404 Not Foundan error. While other sites will be used as a large number of non-continuous ID, or do not use the value as an ID, this time to traverse it difficult to play their role in the. For example, Amazon. ISBN books available as an ID, which encodes a polypeptide comprising at least 10 digits. ISBN traverse using the ID of the need to test the billions of possible combinations, so this method is certainly the site content is not the most efficient way to grab.

As you have been concerned about, you may have noticed some of the TOO MANY REQUESTSdownload error message. Now do not worry about it, we will introduce more errors of this type of treatment method in Section 1.5.5, "Advanced Features" section.

1.5.5 Link reptiles

So far, we have used the structural characteristics of the sample Web site to achieve two simple reptiles, for download all published national (or regional) page. As long as these two technologies are available, we should use them crawling, because the number of pages of these two methods will need to be downloaded to a minimum. However, for other sites, we need to let the spiders behave more like a normal user to follow links to access content of interest.

By tracking each link of the way, we can easily download page for the entire site. However, this approach may download many pages do not need. For example, we want to grab the user account details page from an online forum, so at this time we only need to download the account page, without the need to download the discussion threads of the page. This chapter uses the link crawler will use regular expressions to determine which pages should download. The following is the initial release of this code.

import re

def link_crawler(start_url, link_regex):
    """ Crawl from the given start URL following links matched by
link_regex
    """
    crawl_queue = [start_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        if html is not None:
            continue
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                crawl_queue.append(link)

def get_links(html):
    """ Return a list of links from html
    """
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""",
re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)

To run this code, just call link_crawlerthe function, passing two parameters: the URL for crawling the Web site as well as being used to link you want to track matching the regular expression. For example the website, we want crawled national (or regional) list index page and countries (or regions) page.

We can view the site index page that links to follow the following format:

Countries (or regions) page follow the following format:

Therefore, we can use /(index|view)/this simple regular expression to match these two types of pages. When these input parameters Reptile what happens when it runs? You will get a download error as shown below.

>>> link_crawler('http://example.python-scraping.com', '/(index|view)/')
Downloading: http://example.python-scraping.com
Downloading: /index/1
Traceback (most recent call last):
    ...
ValueError: unknown url type: /index/1

Regular expressions are a very good tool to extract information from a string, so I recommend every programmer should "learn how to read and write some regular expressions." Even so, they tend to be very fragile and prone to failure. We will introduce more advanced extraction link and identify ways in subsequent pages of this book.

As can be seen, the problem is downloaded /index/1when the link is only part of the web path, without the protocol and server part, that this is a relative link . Since the browser knows what page you are browsing, and can take the necessary steps to deal with these links, so when browser, the relative links are able to work properly. However, urlliband without context. In order to urllibbe able to locate the page, we need to convert links to absolute links form to include all the details of the positioning of the page. As you wish, Python is urllibthere a module that can be used to achieve this, the module name parse. Here is link_crawleran improved version, using a urljoinmethod to create an absolute path.

from urllib.parse import urljoin

def link_crawler(start_url, link_regex):
    """ Crawl from the given start URL following links matched by
link_regex
    """
    crawl_queue = [start_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        if not html:
            continue
        for link in get_links(html):
            if re.match(link_regex, link):
                abs_link = urljoin(start_url, link)
                crawl_queue.append(abs_link)

When you run this code, you will see the download page though match, but the same place will always continue to be downloaded to. Possible causes of this behavior is that a link exists between each of these locations. For example, Australia's link to Antarctica, and Antarctica has links back to Australia, then these reptiles will continue into the URL queue, never reach the end of the queue. To avoid duplication of crawling the same link, we need to record what has been crawling through the link. The following is a modified link_crawlerfunction, it has been found to have the URL storage function, to avoid duplication download.

def link_crawler(start_url, link_regex):
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        if not html:
            continue
        for link in get_links(html):
            # check if link matches expected regex
            if re.match(link_regex, link):
            abs_link = urljoin(start_url, link)
            # check if have already seen this link
            if abs_link not in seen:
                seen.add(abs_link)
                crawl_queue.append(abs_link)

When you run the script, it will be crawling all locations, and can be scheduled stop. In the end, we get a usable link reptiles!

Advanced Features

Now, let's add some functionality to link reptiles, make it more useful when crawling other sites.

1. Parse robots.txt

First, we need to parse robots.txtfiles to avoid download URL prohibit crawling. Use Python's urlliblibrary robotparsermodule, you can easily get the job done, as shown in the following code.

    >>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://example.python-scraping.com/robots.txt')
>>> rp.read()
>>> url = 'http://example.python-scraping.com'
>>> user_agent = 'BadCrawler'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'GoodCrawler'
>>> rp.can_fetch(user_agent, url)
True

robotparserFirst load module robots.txtfile, and then by can_fetch()function determines whether the specified user agent allows web access. In this example, when the user agent is set 'BadCrawler', the robotparserreturn module can not obtain the results show that page, as we have in the sample Web site robots.txtis defined as seeing the file.

In order to robotparserintegrate links to reptiles, we first need to create a new function is used to return robotparserthe object.

def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp

We need to reliably set robots_url, then we can method extra keyword parameters to achieve this goal by passing to the function. We can also set a default value, prevents the user does not pass the variable. Assuming that begin crawling from the root of the site, then we can simply be robots.txtadded at the end of the URL. In addition, we also need to define user_agent.

def link_crawler(start_url, link_regex, robots_url=None,
user_agent='wswp'):
    ...
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)

Finally, we crawladd the parser to check loop.

...
while crawl_queue:
    url = crawl_queue.pop()
    # check url passes robots.txt restrictions
    if rp.can_fetch(user_agent, url):
        html = download(url, user_agent=user_agent)
        ...
    else:
        print('Blocked by robots.txt:', url)

We can test our senior reptiles and links by using the user agent string of bad robotparseruse.

>>> link_crawler('http://example.python-scraping.com', '/(index|view)/',
user_agent='BadCrawler')
Blocked by robots.txt: http://example.python-scraping.com

2. Support agent

Sometimes we need to use a proxy to access a Web site. For example, Hulu is blocked in many countries outside the United States, some videos on YouTube too. Use urllibsupport agents are not so easy to imagine. We will introduce in the section behind a more user-friendly Python HTTP module - requestsThis module is also capable of handling agent. Here is urllibsupported proxy code.

proxy = 'http://myproxy.net:1234' # example string
proxy_support = urllib.request.ProxyHandler({'http': proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
# now requests via urllib.request will be handled via proxy

Below is an integrated feature of the new version of the downloadfunction.

def download(url, user_agent='wswp', num_retries=2, charset='utf-8',
proxy=None):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        if proxy:
            proxy_support = urllib.request.ProxyHandler({'http': proxy})
            opener = urllib.request.build_opener(proxy_support)
            urllib.request.install_opener(opener)
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
            # recursively retry 5xx HTTP errors
            return download(url, num_retries - 1)
    return html

Currently default (Python 3.5), urllibthe module does not support the httpsproxy. The problem may detect changes in future versions of Python, so please check the latest documentation. In addition, you can also use the document recommended tips ( https://code.activestate.com/recipes/456195), or continue reading to learn how to use the requestslibrary.

3. Download speed limit

If we crawl the site too fast, it will face the risk of being banned or server overload. To reduce these risks, we can add a set time delay between the two downloads, so that the speed limit on reptiles. Here is the code of the feature class.

from urllib.parse import urlparse
import time

class Throttle:
    """Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
    # amount of delay between downloads for each domain
    self.delay = delay
    # timestamp of when a domain was last accessed
    self.domains = {}

def wait(self, url):
    domain = urlparse(url).netloc
    last_accessed = self.domains.get(domain)

    if self.delay > 0 and last_accessed is not None:
        sleep_secs = self.delay - (time.time() - last_accessed)
        if sleep_secs > 0:
            # domain has been accessed recently
            # so need to sleep
            time.sleep(sleep_secs)
    # update the last accessed time
    self.domains[domain] = time.time()

ThrottleClass records the time of last access for each domain name, if the current time is less than a specified distance from the last access time delay, perform the sleep operation. We can call before every download throttleof reptiles speed limit.

throttle = Throttle(delay)
...
throttle.wait(url)
html = download(url, user_agent=user_agent, num_retries=num_retries,
                proxy=proxy, charset=charset)

4. Avoid reptiles trap

At present, our crawler tracks all links not previously visited. However, some sites will be dynamically generated page content, so there will be an infinite number of pages. For example, the website has an online calendar function provides links can visit next month and next year, then the next month the page will also include links to visit again next month, which would have continued to request member is provided given maximum time (it may be after a long time). The site may also provide the same functionality in a simple page navigation, the paging request access to constantly empty search results pages in essence, up to the maximum number of pages. This condition is called spider trap .

Want to avoid falling into the trap of reptiles, a simple method is to record the arrival of the current page after a number of links, which is the depth. When reaching the maximum depth, reptiles do not add a link to that page in the queue again. To achieve maximum depth of functionality, we need to modify seenvariables. The only variable was originally recorded pages visited links, now revised to a dictionary, add depth records it has been found links.

def link_crawler(..., max_depth=4):
    seen = {}
    ...
    if rp.can_fetch(user_agent, url):
        depth = seen.get(url, 0)
        if depth == max_depth:
            print('Skipping %s due to depth' % url)
            continue
        ...
        for link in get_links(html):
            if re.match(link_regex, link):
                abs_link = urljoin(start_url, link)
                if abs_link not in seen:
                    seen[abs_link] = depth + 1
                    crawl_queue.append(abs_link)

With this feature later, we have the confidence to be able to complete the final reptiles. If you want to disable this feature, simply max_depthset to a negative number, this time with the current depth is never equal.

5. The final version

The Advanced Link reptile complete source code can be downloaded in an asynchronous community, the file name advanced_link_crawler.py. To facilitate the operation according to the book, you can derive the code base, and use it to compare and test your own code.

To test the link reptiles, we can set the user agent BadCrawler, which is described in this chapter before being robots.txtblocked that user agent. The results can be seen from the following operation, the crawler indeed blocked immediately after the end of the code will start.

    >>> start_url = 'http://example.python-scraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.python-scraping.com/
```
现在,让我们使用默认的用户代理,并将最大深度设置为`1`,这样只有主页上的链接才会被下载。

```
    >>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.python-scraping.com//index
Downloading: http://example.python-scraping.com/index/1
Downloading: http://example.python-scraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.python-scraping.com/view/Antarctica-9
Downloading: http://example.python-scraping.com/view/Anguilla-8
Downloading: http://example.python-scraping.com/view/Angola-7
Downloading: http://example.python-scraping.com/view/Andorra-6
Downloading: http://example.python-scraping.com/view/American-Samoa-5
Downloading: http://example.python-scraping.com/view/Algeria-4
Downloading: http://example.python-scraping.com/view/Albania-3
Downloading: http://example.python-scraping.com/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/view/Afghanistan-1

As expected, reptiles downloaded in the country (or region) after the first page of the list of stops.

1.5.6 library use requests

Although we only urllibhad achieved a relatively high level parser written in Python but the current mainstream crawlers usually use the requestslibrary to manage complex HTTP request. The project started as a "human-readable" means assistance package urllibfeatures a small library, but now has grown into a large project with hundreds of contributors. Some features available include the built-in encoding process, it is important to update SSL and security as well as POST requests, JSON, cookie and simple treatment agent.

The book, in most cases, will use the requestslibrary because it is simple enough and easy to use, and it is also the fact that most standard web crawler project.

You want to install requests, just use the pipcan.

pip install requests

If you want to know further information on all of its features, you can read its documentation, address http://python-requests.org, in addition can also browse the source code, address https://github.com/kennethreitz/requests.

To compare the difference between the use of these two libraries, I also created a use requestsAdvanced link crawlers. You can find and view the code downloaded from the asynchronous community source file, the file name advanced_link_crawler_using_requests.py. In the main downloadfunction, demonstrating its key differences. requestsVersion shown below.

def download(url, user_agent='wswp', num_retries=2, proxies=None):
    print('Downloading:', url)
    headers = {'User-Agent': user_agent}
    try:
        resp = requests.get(url, headers=headers, proxies=proxies)
        html = resp.text
        if resp.status_code >= 400:
            print('Download error:', resp.text)
            html = None
            if num_retries and 500 <= resp.status_code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    except requests.exceptions.RequestException as e:
        print('Download error:', e.reason)
        html = None

A notable difference is, status_codethe more convenient to use, because each request contains the attribute. In addition, we no longer need to test the character encoding, because Responsethe object of the textproperty has been implemented this feature for our automation. For the rare cases such as a URL or a time-out can not handle, can be used RequestExceptionfor processing, just a simple statement to capture the exception. Acting processing has also been taken into account, we simply pass the dictionary to the agent (ie {'http': 'http://myproxy.net:1234', 'https': 'https://myproxy.net:1234'}).

We will continue to compare and use these two libraries, so they need you to become familiar with and use cases based on. Whether you are dealing with more complex sites, or to deal with important human methods (such as cookie or session), I strongly recommend using requests. We will discuss more on the topic of these methods in Chapter 6.

This article is taken "written in Python web crawler" (2nd edition)

Author: [Germany] Catherine Yamu Seoul (Katharine Jarmul), [O] Richard Lawson (Richard Lawson)

Translator: Li Bin

  • Python 3 popular web crawler
  • Data acquisition and analysis crawl
  • Develop new upgraded version of the actual book, written for Python 3
  • The previous version of the annual sales of nearly 40,000, provides an example of a complete source code and source code examples of sites built

This book is the use of Python 3.6's new features come crawling Getting Started network data. This book explains the method to extract data from a static website, and how to use databases and file caching technology to save time and server load management, and describes how to use a browser, reptiles and reptiles concurrent development of a more sophisticated reptiles.

By means of PyQt and Selenium, you can decide when and how to take the climb from dependence on JavaScript site data, and submit the form to better understand the complex on the site CAPTCHA-protected. The book also explains how to use the Python package (such as mechanize) be automated, create a method based on the use of reptiles Scrapy libraries, and how to implement reptile skills learned on a real site.

Finally, the book also covers the use of reptiles Test your site, remote crawling technology, image processing, and other related topics.

This book reads as follows:

  • Using a simple Python program extract data from the page;
  • Building concurrent reptiles, parallel processing of a page;
  • To crawl a site by following links;
  • Extracting characteristics from HTML;
  • Caching downloaded HTML, for reuse;
  • Comparative concurrency model, determined relatively fast reptiles;
  • JavaScript parsing depends on the site;
  • Interact with forms and sessions.

Guess you like

Origin blog.csdn.net/epubit17/article/details/93307564