Performance comparison of three web crawling methods of Python crawler

 Below we will introduce three methods for scraping web data, first with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module.

1. Regular expressions

  If you are new to regular expressions, or need some hints, check out the Regular Expression HOWTO  for a full introduction.

  When we grab country area data using regular expressions, we first try to match the content of the element, like this:

>>> import re
>>> import urllib2
>>> url = 'http://example.webscraping.com/view/United-Kingdom-239' >>> html = urllib2.urlopen(url).read() >>> re.findall('<td class="w2p_fw">(.*?)</td>', html) ['<img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', '<a href="/continent/EU">EU</a>', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\\d{2}[A-Z]{2})|([A-Z]\\d{3}[A-Z]{2})|([A-Z]{2}\\d{2}[A-Z]{2})|([A-Z]{2}\\d{3}[A-Z]{2})|([A-Z]\\d[A-Z]\\d[A-Z]{2})|([A-Z]{2}\\d[A-Z]\\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', '<div><a href="/iso/IE">IE </a></div>'] >>> 

 

 

   From the above results, it can be seen that the < td class=”w2p_fw” > tag is used for multiple country attributes. To isolate the area attribute, we can just select the second element in it, like this:

>>> re.findall('<td class="w2p_fw">(.*?)</td>', html)[1]
'244,820 square kilometres'

 

   While this scheme is available now, it is likely to fail if the page changes. For example, the table has been changed to remove the land area data in the second row. If we only scrape data now, we can ignore this possible future change. However, if we want to fetch this data again in the future, we need a more robust solution that avoids the impact of this layout change as much as possible. To make the regular expression more robust, we can add its parent element <tr> as well. Since the element has an ID attribute, it should be unique.

>>> re.findall('<tr id="places_area__row"><td class="w2p_fl"><label for="places_area" id="places_area__label">Area: </label></td><td class="w2p_fw">(.*?)</td>', html) ['244,820 square kilometres']

 

 

  This iterative version looks a bit better, but there are many other ways that web page updates can also make this regex unsatisfactory. For example, changing double quotes to single quotes, adding extra spaces between <td> tags, or changing area_label, etc. Below is an improved version that attempts to support these possibilities.

>>> re.findall('<tr id="places_area__row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>',html)['244,820 square kilometres']

 

  While this regular expression is easier to adapt to future changes, it has the problems of being difficult to construct and less readable. In addition, there are some minor layout changes that can make this regular expression unsatisfactory, such as adding a title attribute to the <td> tag. 
  As can be seen from this example, regular expressions provide us with a shortcut to scrape data, however, this method is too fragile and prone to problems after the page is updated. Fortunately, there are some better solutions, which will be introduced later.

2. Beautiful Soup

  Beautiful Soup is a very popular Python module. This module can parse web pages and provide a convenient interface for locating content. If you haven't installed the module, you can use the following command to install its latest version (you need to install pip first, please Baidu):

pip install beautifulsoup4

 

  The first step in using Beautiful Soup is to parse the downloaded HTML content into a soup document. Since most web pages don't have well-formed HTML, Beautiful Soup needs to determine its actual format. For example, in the listing of this simple web page below, there are issues with missing quotes around attribute values ​​and unclosed tags.

<ul class=country>
    <li>Area <li>Population </ul>

 

 

  If the Population list item were parsed as a child of the Area list item, instead of two list items side by side, we would get the wrong result when grabbing. Let's see how Beautiful Soup handles it.

>>> from bs4 import BeautifulSoup
>>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print fixed_html <ul class="country"> <li> Area <li> Population </li> </li> </ul>

  As you can see from the execution results above, Beautiful Soup is able to correctly parse the missing quotes and close the tags. Now we can use the find() and find_all() methods to locate the element we need.

>>> ul = soup.find('ul', attrs={'class':'country'})
>>> ul.find('li') # return just the first match
<li>Area<li>Population</li></li> >>> ul.find_all('li') # return all matches [<li>Area<li>Population</li></li>, <li>Population</li>]

 

Note: Due to the difference in fault tolerance of Python built-in libraries of different versions, the processing results may be different from the above. For details, please refer to:  https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing- a-parser . To know all the methods and parameters, you can refer to the official documentation of Beautiful Soup 

  Below is the complete code to extract sample country area data using this method.

>>> from bs4 import BeautifulSoup
>>> import urllib2 >>> url = 'http://example.webscraping.com/view/United-Kingdom-239' >>> html = urllib2.urlopen(url).read() >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> # locate the area tag >>> td = tr.find(attrs={'class':'w2p_fw'}) >>> area = td.text # extract the text from this tag >>> print area 244,820 square kilometres

 

  This code, while more complex than regular expression code, is easier to construct and understand. Also, little changes in layout like extra whitespace and tab attributes, we don't need to worry about it anymore.

3. Lxml

  Lxml is a Python wrapper around libxml2, an XML parsing library. This module is written in C, which parses faster than Beautiful Soup, but the installation process is more complicated. The latest installation instructions can be found at  http://lxml.de/installation.html  .**

  Like Beautiful Soup, the first step in using the lxml module is to parse potentially invalid HTML into a unified format. Here is an example of parsing an incomplete HTML using this module:

>>> import lxml.html
>>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> tree = lxml.html.fromstring(broken_html) >>> fixed_html = lxml.html.tostring(tree, pretty_print=True) >>> print fixed_html <ul class="country"> <li>Area</li> <li>Population</li> </ul>

 

 

  Likewise, lxml correctly parses missing quotes around attributes and closes tags, but the module does not add additional <html> and <body> tags.

  After parsing the input, it's time to select elements, where lxml has several different methods, such as XPath selectors and find() methods like Beautiful Soup. However, in the future we will use CSS selectors because it is more concise and can be reused when parsing dynamic content. In addition, some readers with experience with jQuery selectors will be more familiar with it.

  Here's an example code to extract area data using lxml's CSS selector:

>>> import urllib2
>>> import lxml.html
>>> url = 'http://example.webscraping.com/view/United-Kingdom-239' >>> html = urllib2.urlopen(url).read() >>> tree = lxml.html.fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] # *行代码 >>> area = td.text_content() >>> print area 244,820 square kilometres

 

   The *row code first finds the table row element with ID places_area__row and then selects the table data subtag with class w2p_fw.

   CSS selectors represent the patterns used to select elements. Here are some examples of commonly used selectors:

选择所有标签: *
选择 <a> 标签: a
选择所有 class="link" 的元素: .link
选择 class="link" 的 <a> 标签: a.link 选择 id="home" 的 <a> 标签: a#home 选择父元素为 <a> 标签的所有 <span> 子标签: a > span 选择 <a> 标签内部的所有 <span> 标签: a span 选择 title 属性为"Home"的所有 <a> 标签: a[title=Home]

   The W3C has proposed the CSS3 specification at  https://www.w3.org/TR/2011/REC-css3-selectors-20110929/

  Lxml has implemented most of the CSS3 properties, the unsupported functions can be found at:  https://cssselect.readthedocs.io/en/latest/  .

Note: lxml's internal implementation actually converts CSS selectors to equivalent XPath selectors.

4. Performance comparison

   In the following code, each crawler will be executed 1000 times, each execution will check whether the crawling result is correct, and then print the total time.

# -*- coding: utf-8 -*-

import csv
import time
import urllib2
import re import timeit from bs4 import BeautifulSoup import lxml.html FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') def regex_scraper(html): results = {} for field in FIELDS: results[field] = re.search('<tr id="places_{}__row">.*?<td class="w2p_fw">(.*?)</td>'.format(field), html).groups()[0] return results def beautiful_soup_scraper(html): soup = BeautifulSoup(html, 'html.parser') results = {} for field in FIELDS: results[field] = soup.find('table').find('tr', id='places_{}__row'.format(field)).find('td', class_='w2p_fw').text return results def lxml_scraper(html): tree = lxml.html.fromstring(html) results = {} for field in FIELDS: results[field] = tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content() return results def main(): times = {} html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read() NUM_ITERATIONS = 1000 # number of times to test each scraper for name, scraper in ('Regular expressions', regex_scraper), ('Beautiful Soup', beautiful_soup_scraper), ('Lxml', lxml_scraper): times[name] = [] # record start time of scrape start = time.time() for i in range(NUM_ITERATIONS): if scraper == regex_scraper: # the regular expression module will cache results # so need to purge this cache for meaningful timings re.purge() # *行代码 result = scraper(html) # check scraped result is as expected assert(result['area'] == '244,820 square kilometres') times[name].append(time.time() - start) # record end time of scrape and output the total end = time.time() print '{}: {:.2f} seconds'.format(name, end - start) writer = csv.writer(open('times.csv', 'w')) header = sorted(times.keys()) writer.writerow(header) for row in zip(*[times[scraper] for scraper in header]): writer.writerow(row) if __name__ == '__main__': main()

 

 

   Note that we call the re.purge() method in *line of code. By default, regular expressions cache search results, and to be fair, we need to use this method to clear the cache.

Here is the result of running the script on my computer:

write picture description here


   Due to the difference in hardware conditions, the execution results of different computers will also have certain differences. However, the relative differences between each method should be comparable. As you can see from the results, Beautiful Soup is over 7 times slower than the other two methods when crawling our sample webpage. In fact, this result is expected, because the lxml and regex modules are written in C, while Beautiful Soup is written in pure Python. An interesting fact is that lxml performs about as well as regular expressions. Additional overhead is incurred because lxml must parse the input into an internal format before searching for elements. When crawling multiple features of the same web page, the overhead of this initial parsing will be reduced, and lxml will be more competitive, so lxml is a powerful module.

5. Summary

Advantages and disadvantages of three web scraping methods:

       Crawl method     performance       Difficulty to use       Installation difficulty
regular expression quick difficulty Simple (built-in modules)
Beautiful Soup slow Simple Simple (pure Python)
Lxml quick Simple relatively difficult



   If your crawler's bottleneck is downloading pages, not extracting data, then using slower methods (like Beautiful Soup) is not a problem. Regular expressions are very useful in a one-shot extraction, in addition to avoiding the overhead of parsing the entire web page, if you only need to scrape a small amount of data and want to avoid additional dependencies, then regular expressions may be more suitable. However, lxml is usually the best choice for scraping data because it is not only faster and more feature-rich, while regular expressions and Beautiful Soup are only useful in certain scenarios.

 

 

 

 

Reprinted from: https://blog.csdn.net/oscer2016/article/details/70209144

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325204222&siteId=291194637