To crawl the site, we first need to download the page that contains the data of interest, a process commonly referred to crawl (crawling) . Crawling a website there are many ways, and the choice of which method is more appropriate depends on the structure of the target site. In this chapter, we will first discuss how to safely download page, and then will introduce the following three common methods of crawling the site:
- Crawling site map;
- Using the database ID through each page;
- Tracking Web links.
So far, we have used alternately crawling and crawling two terms, let us first define the similarities of these two methods and differences.
1.5.1 Comparison gripping and crawling
Depending on the information and site content and structure of your concern, you may need to crawl web sites or crawling. They then what difference does it?
Web crawling generally, and to obtain specific information on these sites for a specific site. Web crawler used to access these special pages, occurs if the site location change information or site changes, then the need to be modified. For example, you might want to grab your favorite local restaurant to see the daily specials through the network, in order to achieve this, you need to grab part of its Web site daily updated information.
In contrast, the network crawl usually constructed generic way, the goal is a series of top-level domain of the site or the entire network. Crawling can be used to gather more specific information, but the more common situation is crawling network, access to small and universal information from many different sites or pages and follow links to other pages.
In addition to crawling and crawling, we will introduce a web crawler in Chapter 8. Crawling reptiles can be used to specify a list of sites, or more extensive crawling across multiple sites or even across the Internet.
In general, we will use specific terminology reflects our use cases. When you develop web crawlers, they may notice the difference in you want to use technology, libraries and packages. In these cases, your understanding of the different terms that can help you select the appropriate package or technical terms used based (for example, if only for crawling? Also apply to reptiles?).
1.5.2 download page
To crawl the web, we first need to download it. The following sample script uses Python's urllib
module download URL.
import urllib.request
def download(url):
return urllib.request.urlopen(url).read()
When an incoming URL parameter, the function will return to its download page and HTML. However, the existence of this code snippet is a problem that when downloading pages, we may encounter some errors can not control, such as the requested page may not exist. In this case, urllib
an exception is thrown, then exit the script. Safety, the following and then give a more stable build version, you can catch them.
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
def download(url):
print('Downloading:', url)
try:
html = urllib.request.urlopen(url).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
return html
Now, when a download URL or an error occurs, the function can capture the exception and then return None
.
In this book, we will assume that the way you write code in the file, instead of using the prompt (as shown in the code above). When you find the code or the Python >>> prompt IPython prompt
In [1]:
start, you need to enter it into the main file is in use, or to save the file after import of these functions and classes in the Python interpreter.
1. Retry download
Error encountered while downloading is often temporary, such as when the server returned an overload 503 Service Unavailable
error. For this type of error, we can try downloading it again after a short wait because the server may now issue has been resolved. However, we do not need all the errors and try downloading again. If the server returns 404 Not Found
this error, it indicates that the page does not currently exist, try the same request again generally do not appear different results.
Internet Engineering Task Force (Internet Engineering Task Force) defines a complete list of HTTP errors, which you can learn 4xx
an error occurs when there is a problem request, but 5xx
an error occurs when there is a problem in the server. So, we just need to make sure that download
function in the event of 5xx
an error retry download. The following are support the new version of the code to retry download function.
def download(url, num_retries=2):
print('Downloading:', url)
try:
html = urllib.request.urlopen(url).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
Now, when the download
function encounters an 5xx
error code, the recursive call function itself will be retried. Additionally, this function also adds a parameter for setting the number of retries to download a default value twice. We are here to try to limit the number of pages downloaded, it is wrong because the server may be temporarily has not been restored. Want to test this function, you can try to download http://httpstat.us/500
, the URL will always return 500
an error code.
>>> download('http://httpstat.us/500')
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error
As can be seen from the above results returned, download
behavior and function expected, first try to download the page, is received 500
after the error, and conducted two retries before giving up.
2. Set the user agent
By default, urllib
the use of Python-urllib/3.x
a user agent download web content, which 3.x
is the current version number of the Python environment of use. If we can use to identify the user agent is better, our web crawler to avoid some of the problems encountered. In addition, perhaps because the server had experienced poor quality of the Python web crawler caused by overload, some sites will be closed for the default user agent.
Therefore, in order to make more reliable download site, we need to control user agent settings. The following code for the download
function is modified, it sets a default user agent ‘wswp’
(ie, Web Scraping with Python acronym).
def download(url, user_agent='wswp', num_retries=2):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
Now, if you try to access again meetup.com
, you can see a valid HTML a. Our download function can be obtained in subsequent code reuse, this function can catch the exception, where possible retry sites, and set the user agent.
1.5.3 Site Map reptiles
In the first simple reptiles, we will use the sample site robots.txt
file found in the site map to download all pages. To parse the site map, we will use a simple regular expressions, from the <loc>
extracted URL tag.
We need to update the code to handle encoding conversion, because our current download
function simply returns the bytes. In the next chapter, we will introduce a more robust analytical method - CSS selectors . The following is a code sample crawler.
import re
def download(url, user_agent='wswp', num_retries=2, charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
def crawl_sitemap(url):
# download the sitemap file
sitemap = download(url)
# extract the sitemap links
links = re.findall('<loc>(.*?)</loc>', sitemap)
# download each link
for link in links:
html = download(link)
# scrape html here
# ...
Now run the sitemap crawler, download pages from all countries in the sample site.
>>> crawl_sitemap('http://example.python-scraping.com/sitemap.xml')
Downloading: http://example.python-scraping.com/sitemap.xml
Downloading: http://example.python-scraping.com/view/Afghanistan-1
Downloading: http://example.python-scraping.com/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/view/Albania-3
...
As the above code in download
the method shown, we must update the character encoding to use regular expression processing site response. The Python read
method returns bytes, while the regular expression character string is desired. Our code depends include the appropriate character encoding header in response to the site maintainer. If you do not return the character encoding of the head, we will set it as the default value UTF-8, and to have the greatest hope. Of course, if the return header encoded incorrectly, or the code is not set and is not UTF-8, then this will throw an error. There are also more sophisticated ways guess for coding (see https://pypi.python.org/pypi/chardet
), which is very easy to implement.
So far, the site has been in line with expectations reptiles map. However, as previously mentioned, we can not rely on Sitemap
documents provided links to each page. The next section, we will introduce another simple reptiles, reptile is no longer dependent on the Sitemap
file.
If you do not want to continue crawling at any time, you can press Ctrl + C or cmd + C to exit the Python interpreter or program execution.
1.5.4 ID traverse reptiles
In this section, we will take advantage of weaknesses in the structure of the site, easier access to all content. The following are some examples of URL countries (or regions).
- http://example.python-scraping.com/view/Afghanistan-1
- http://example.python-scraping.com/view/Australia-2
- http://example.python-scraping.com/view/Brazil-3
As can be seen, these URL differ only in the last part of the URL path, including the national (or regional) name (as the page alias) and ID. A page that contains an alias in the URL is very common practice, can help play a role in search engine optimization. In general, Web servers will ignore this string, using only ID to match the database records. Below we will remove it, to see http://example.python-scraping.com/view/1
whether the link test sample site is still available. The test results shown in Figure 1.1.
Figure 1.1
As can be seen from Figure 1.1, the page can still load successfully, that this method is useful. Now, we can ignore page aliases, using only the database ID to the download page for all countries (or regions) of the. The following code fragment using the techniques.
import itertools
def crawl_site(url):
for page in itertools.count(1):
pg_url = '{}{}'.format(url, page)
html = download(pg_url)
if html is None:
break
# success - can scrape the result
Now, we can use this function incoming base URL.
>>> crawl_site('http://example.python-scraping.com/view/-')
Downloading: http://example.python-scraping.com/view/-1
Downloading: http://example.python-scraping.com/view/-2
Downloading: http://example.python-scraping.com/view/-3
Downloading: http://example.python-scraping.com/view/-4
[...]
In this code, we traverse of ID, until the download error stop, we assume that this time has arrived to crawl the last page of a country (or region). However, this implementation there is a flaw in the way that some records may have been deleted, is not continuous between the ID database. At this point, as long as access to a spaced points, reptiles would quit immediately. Here is an improved version of this code, will exit the program after the continuous occurrence in this release repeatedly download error.
def crawl_site(url, max_errors=5):
for page in itertools.count(1):
pg_url = '{}{}'.format(url, page)
html = download(pg_url)
if html is None:
num_errors += 1
if num_errors == max_errors:
# max errors reached, exit loop
break
else:
num_errors = 0
# success - can scrape the result
The above code to achieve reptiles need to download five consecutive errors will stop the walk, thus greatly reduces the risk of premature stop traversing the face of record is deleted or hidden.
When crawling the site, traversing ID is a very convenient way, but the site map and reptiles, this approach can not guarantee always available. For example, some sites will check whether the page in the URL aliases, if not, it will return 404 Not Found
an error. While other sites will be used as a large number of non-continuous ID, or do not use the value as an ID, this time to traverse it difficult to play their role in the. For example, Amazon. ISBN books available as an ID, which encodes a polypeptide comprising at least 10 digits. ISBN traverse using the ID of the need to test the billions of possible combinations, so this method is certainly the site content is not the most efficient way to grab.
As you have been concerned about, you may have noticed some of the TOO MANY REQUESTS
download error message. Now do not worry about it, we will introduce more errors of this type of treatment method in Section 1.5.5, "Advanced Features" section.
1.5.5 Link reptiles
So far, we have used the structural characteristics of the sample Web site to achieve two simple reptiles, for download all published national (or regional) page. As long as these two technologies are available, we should use them crawling, because the number of pages of these two methods will need to be downloaded to a minimum. However, for other sites, we need to let the spiders behave more like a normal user to follow links to access content of interest.
By tracking each link of the way, we can easily download page for the entire site. However, this approach may download many pages do not need. For example, we want to grab the user account details page from an online forum, so at this time we only need to download the account page, without the need to download the discussion threads of the page. This chapter uses the link crawler will use regular expressions to determine which pages should download. The following is the initial release of this code.
import re
def link_crawler(start_url, link_regex):
""" Crawl from the given start URL following links matched by
link_regex
"""
crawl_queue = [start_url]
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
if html is not None:
continue
# filter for links matching our regular expression
for link in get_links(html):
if re.match(link_regex, link):
crawl_queue.append(link)
def get_links(html):
""" Return a list of links from html
"""
# a regular expression to extract all links from the webpage
webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""",
re.IGNORECASE)
# list of all links from the webpage
return webpage_regex.findall(html)
To run this code, just call link_crawler
the function, passing two parameters: the URL for crawling the Web site as well as being used to link you want to track matching the regular expression. For example the website, we want crawled national (or regional) list index page and countries (or regions) page.
We can view the site index page that links to follow the following format:
Countries (or regions) page follow the following format:
- http://example.python-scraping.com/view/Afghanistan-1
- http://example.python-scraping.com/view/Aland-Islands-2
Therefore, we can use /(index|view)/
this simple regular expression to match these two types of pages. When these input parameters Reptile what happens when it runs? You will get a download error as shown below.
>>> link_crawler('http://example.python-scraping.com', '/(index|view)/')
Downloading: http://example.python-scraping.com
Downloading: /index/1
Traceback (most recent call last):
...
ValueError: unknown url type: /index/1
Regular expressions are a very good tool to extract information from a string, so I recommend every programmer should "learn how to read and write some regular expressions." Even so, they tend to be very fragile and prone to failure. We will introduce more advanced extraction link and identify ways in subsequent pages of this book.
As can be seen, the problem is downloaded /index/1
when the link is only part of the web path, without the protocol and server part, that this is a relative link . Since the browser knows what page you are browsing, and can take the necessary steps to deal with these links, so when browser, the relative links are able to work properly. However, urllib
and without context. In order to urllib
be able to locate the page, we need to convert links to absolute links form to include all the details of the positioning of the page. As you wish, Python is urllib
there a module that can be used to achieve this, the module name parse
. Here is link_crawler
an improved version, using a urljoin
method to create an absolute path.
from urllib.parse import urljoin
def link_crawler(start_url, link_regex):
""" Crawl from the given start URL following links matched by
link_regex
"""
crawl_queue = [start_url]
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
if not html:
continue
for link in get_links(html):
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
crawl_queue.append(abs_link)
When you run this code, you will see the download page though match, but the same place will always continue to be downloaded to. Possible causes of this behavior is that a link exists between each of these locations. For example, Australia's link to Antarctica, and Antarctica has links back to Australia, then these reptiles will continue into the URL queue, never reach the end of the queue. To avoid duplication of crawling the same link, we need to record what has been crawling through the link. The following is a modified link_crawler
function, it has been found to have the URL storage function, to avoid duplication download.
def link_crawler(start_url, link_regex):
crawl_queue = [start_url]
# keep track which URL's have seen before
seen = set(crawl_queue)
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
if not html:
continue
for link in get_links(html):
# check if link matches expected regex
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
# check if have already seen this link
if abs_link not in seen:
seen.add(abs_link)
crawl_queue.append(abs_link)
When you run the script, it will be crawling all locations, and can be scheduled stop. In the end, we get a usable link reptiles!
Advanced Features
Now, let's add some functionality to link reptiles, make it more useful when crawling other sites.
1. Parse robots.txt
First, we need to parse robots.txt
files to avoid download URL prohibit crawling. Use Python's urllib
library robotparser
module, you can easily get the job done, as shown in the following code.
>>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://example.python-scraping.com/robots.txt')
>>> rp.read()
>>> url = 'http://example.python-scraping.com'
>>> user_agent = 'BadCrawler'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'GoodCrawler'
>>> rp.can_fetch(user_agent, url)
True
robotparser
First load module robots.txt
file, and then by can_fetch()
function determines whether the specified user agent allows web access. In this example, when the user agent is set 'BadCrawler'
, the robotparser
return module can not obtain the results show that page, as we have in the sample Web site robots.txt
is defined as seeing the file.
In order to robotparser
integrate links to reptiles, we first need to create a new function is used to return robotparser
the object.
def get_robots_parser(robots_url):
" Return the robots parser object using the robots_url "
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp
We need to reliably set robots_url
, then we can method extra keyword parameters to achieve this goal by passing to the function. We can also set a default value, prevents the user does not pass the variable. Assuming that begin crawling from the root of the site, then we can simply be robots.txt
added at the end of the URL. In addition, we also need to define user_agent
.
def link_crawler(start_url, link_regex, robots_url=None,
user_agent='wswp'):
...
if not robots_url:
robots_url = '{}/robots.txt'.format(start_url)
rp = get_robots_parser(robots_url)
Finally, we crawl
add the parser to check loop.
...
while crawl_queue:
url = crawl_queue.pop()
# check url passes robots.txt restrictions
if rp.can_fetch(user_agent, url):
html = download(url, user_agent=user_agent)
...
else:
print('Blocked by robots.txt:', url)
We can test our senior reptiles and links by using the user agent string of bad robotparser
use.
>>> link_crawler('http://example.python-scraping.com', '/(index|view)/',
user_agent='BadCrawler')
Blocked by robots.txt: http://example.python-scraping.com
2. Support agent
Sometimes we need to use a proxy to access a Web site. For example, Hulu is blocked in many countries outside the United States, some videos on YouTube too. Use urllib
support agents are not so easy to imagine. We will introduce in the section behind a more user-friendly Python HTTP module - requests
This module is also capable of handling agent. Here is urllib
supported proxy code.
proxy = 'http://myproxy.net:1234' # example string
proxy_support = urllib.request.ProxyHandler({'http': proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
# now requests via urllib.request will be handled via proxy
Below is an integrated feature of the new version of the download
function.
def download(url, user_agent='wswp', num_retries=2, charset='utf-8',
proxy=None):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
if proxy:
proxy_support = urllib.request.ProxyHandler({'http': proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
Currently default (Python 3.5), urllib
the module does not support the https
proxy. The problem may detect changes in future versions of Python, so please check the latest documentation. In addition, you can also use the document recommended tips ( https://code.activestate.com/recipes/456195
), or continue reading to learn how to use the requests
library.
3. Download speed limit
If we crawl the site too fast, it will face the risk of being banned or server overload. To reduce these risks, we can add a set time delay between the two downloads, so that the speed limit on reptiles. Here is the code of the feature class.
from urllib.parse import urlparse
import time
class Throttle:
"""Add a delay between downloads to the same domain
"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}
def wait(self, url):
domain = urlparse(url).netloc
last_accessed = self.domains.get(domain)
if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (time.time() - last_accessed)
if sleep_secs > 0:
# domain has been accessed recently
# so need to sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = time.time()
Throttle
Class records the time of last access for each domain name, if the current time is less than a specified distance from the last access time delay, perform the sleep operation. We can call before every download throttle
of reptiles speed limit.
throttle = Throttle(delay)
...
throttle.wait(url)
html = download(url, user_agent=user_agent, num_retries=num_retries,
proxy=proxy, charset=charset)
4. Avoid reptiles trap
At present, our crawler tracks all links not previously visited. However, some sites will be dynamically generated page content, so there will be an infinite number of pages. For example, the website has an online calendar function provides links can visit next month and next year, then the next month the page will also include links to visit again next month, which would have continued to request member is provided given maximum time (it may be after a long time). The site may also provide the same functionality in a simple page navigation, the paging request access to constantly empty search results pages in essence, up to the maximum number of pages. This condition is called spider trap .
Want to avoid falling into the trap of reptiles, a simple method is to record the arrival of the current page after a number of links, which is the depth. When reaching the maximum depth, reptiles do not add a link to that page in the queue again. To achieve maximum depth of functionality, we need to modify seen
variables. The only variable was originally recorded pages visited links, now revised to a dictionary, add depth records it has been found links.
def link_crawler(..., max_depth=4):
seen = {}
...
if rp.can_fetch(user_agent, url):
depth = seen.get(url, 0)
if depth == max_depth:
print('Skipping %s due to depth' % url)
continue
...
for link in get_links(html):
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
if abs_link not in seen:
seen[abs_link] = depth + 1
crawl_queue.append(abs_link)
With this feature later, we have the confidence to be able to complete the final reptiles. If you want to disable this feature, simply max_depth
set to a negative number, this time with the current depth is never equal.
5. The final version
The Advanced Link reptile complete source code can be downloaded in an asynchronous community, the file name advanced_link_crawler.py
. To facilitate the operation according to the book, you can derive the code base, and use it to compare and test your own code.
To test the link reptiles, we can set the user agent BadCrawler
, which is described in this chapter before being robots.txt
blocked that user agent. The results can be seen from the following operation, the crawler indeed blocked immediately after the end of the code will start.
>>> start_url = 'http://example.python-scraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.python-scraping.com/
```
现在,让我们使用默认的用户代理,并将最大深度设置为`1`,这样只有主页上的链接才会被下载。
```
>>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.python-scraping.com//index
Downloading: http://example.python-scraping.com/index/1
Downloading: http://example.python-scraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.python-scraping.com/view/Antarctica-9
Downloading: http://example.python-scraping.com/view/Anguilla-8
Downloading: http://example.python-scraping.com/view/Angola-7
Downloading: http://example.python-scraping.com/view/Andorra-6
Downloading: http://example.python-scraping.com/view/American-Samoa-5
Downloading: http://example.python-scraping.com/view/Algeria-4
Downloading: http://example.python-scraping.com/view/Albania-3
Downloading: http://example.python-scraping.com/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/view/Afghanistan-1
As expected, reptiles downloaded in the country (or region) after the first page of the list of stops.
1.5.6 library use requests
Although we only urllib
had achieved a relatively high level parser written in Python but the current mainstream crawlers usually use the requests
library to manage complex HTTP request. The project started as a "human-readable" means assistance package urllib
features a small library, but now has grown into a large project with hundreds of contributors. Some features available include the built-in encoding process, it is important to update SSL and security as well as POST requests, JSON, cookie and simple treatment agent.
The book, in most cases, will use the
requests
library because it is simple enough and easy to use, and it is also the fact that most standard web crawler project.
You want to install requests
, just use the pip
can.
pip install requests
If you want to know further information on all of its features, you can read its documentation, address http://python-requests.org
, in addition can also browse the source code, address https://github.com/kennethreitz/requests
.
To compare the difference between the use of these two libraries, I also created a use requests
Advanced link crawlers. You can find and view the code downloaded from the asynchronous community source file, the file name advanced_link_crawler_using_requests.py
. In the main download
function, demonstrating its key differences. requests
Version shown below.
def download(url, user_agent='wswp', num_retries=2, proxies=None):
print('Downloading:', url)
headers = {'User-Agent': user_agent}
try:
resp = requests.get(url, headers=headers, proxies=proxies)
html = resp.text
if resp.status_code >= 400:
print('Download error:', resp.text)
html = None
if num_retries and 500 <= resp.status_code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
except requests.exceptions.RequestException as e:
print('Download error:', e.reason)
html = None
A notable difference is, status_code
the more convenient to use, because each request contains the attribute. In addition, we no longer need to test the character encoding, because Response
the object of the text
property has been implemented this feature for our automation. For the rare cases such as a URL or a time-out can not handle, can be used RequestException
for processing, just a simple statement to capture the exception. Acting processing has also been taken into account, we simply pass the dictionary to the agent (ie {'http': 'http://myproxy.net:1234', 'https': 'https://myproxy.net:1234'}
).
We will continue to compare and use these two libraries, so they need you to become familiar with and use cases based on. Whether you are dealing with more complex sites, or to deal with important human methods (such as cookie or session), I strongly recommend using requests
. We will discuss more on the topic of these methods in Chapter 6.
This article is taken "written in Python web crawler" (2nd edition)
Author: [Germany] Catherine Yamu Seoul (Katharine Jarmul), [O] Richard Lawson (Richard Lawson)
Translator: Li Bin
- Python 3 popular web crawler
- Data acquisition and analysis crawl
- Develop new upgraded version of the actual book, written for Python 3
- The previous version of the annual sales of nearly 40,000, provides an example of a complete source code and source code examples of sites built
This book is the use of Python 3.6's new features come crawling Getting Started network data. This book explains the method to extract data from a static website, and how to use databases and file caching technology to save time and server load management, and describes how to use a browser, reptiles and reptiles concurrent development of a more sophisticated reptiles.
By means of PyQt and Selenium, you can decide when and how to take the climb from dependence on JavaScript site data, and submit the form to better understand the complex on the site CAPTCHA-protected. The book also explains how to use the Python package (such as mechanize) be automated, create a method based on the use of reptiles Scrapy libraries, and how to implement reptile skills learned on a real site.
Finally, the book also covers the use of reptiles Test your site, remote crawling technology, image processing, and other related topics.
This book reads as follows:
- Using a simple Python program extract data from the page;
- Building concurrent reptiles, parallel processing of a page;
- To crawl a site by following links;
- Extracting characteristics from HTML;
- Caching downloaded HTML, for reuse;
- Comparative concurrency model, determined relatively fast reptiles;
- JavaScript parsing depends on the site;
- Interact with forms and sessions.