What are web crawlers used for? How to climb? Teach you step by step how to crawl the web (Python information)

01 Overview of web crawlers

Next, let everyone have a basic understanding of web crawlers from three aspects: the concept, use, value and structure of web crawlers.

1. Web crawlers and their applications

With the rapid development of the Internet, the World Wide Web has become a carrier of large amounts of information. How to effectively extract and utilize this information has become a huge challenge, and web crawlers emerged as the times require. **A web crawler (also known as a web spider or web robot) is a program or script that automatically crawls information from the World Wide Web according to certain rules. **The following shows the role of web crawlers in the Internet through Figure 3-1:

▲Figure 3-1 Web crawler

According to the system structure and implementation technology, web crawlers can be roughly divided into the following types: ** general web crawlers, focused web crawlers, incremental web crawlers, and deep web crawlers. **The actual web crawler system is usually implemented by a combination of several crawler technologies.

Search Engine , such as the traditional general search engines Baidu, Yahoo and Google, is a large and complex web crawler and belongs to the category of general web crawlers. However, universal search engines have certain limitations:

  1. Users in different fields and backgrounds often have different search purposes and needs. The results returned by general search engines include a large number of web pages that users do not care about.

  2. The goal of a general search engine is to maximize network coverage. The contradiction between limited search engine server resources and unlimited network data resources will further deepen.

  3. With the richness of data forms on the World Wide Web and the continuous development of network technology, different data such as pictures, databases, audio, and video multimedia appear in large quantities. General search engines are often powerless to discover and obtain these data with dense information content and a certain structure. .

  4. Most general search engines provide keyword-based retrieval and are difficult to support queries based on semantic information.

In order to solve the above problems, focused crawlers that specifically capture relevant web resources emerged as the times require.

**Focused crawler is a program that automatically downloads web pages. It selectively accesses web pages and related links on the World Wide Web based on established crawling goals to obtain the required information. **Unlike general crawlers, focused crawlers do not pursue large coverage, but set the goal of crawling web pages related to a specific topic content and preparing data resources for topic-oriented user queries.

After talking about focused crawlers, let’s talk about incremental web crawlers. Incremental web crawlers refer to crawlers that incrementally update downloaded web pages and only crawl newly generated or changed web pages. It can ensure to a certain extent that the pages crawled are as new as possible.

Compared with web crawlers that periodically crawl and refresh pages, incremental crawlers will only crawl newly generated or updated pages when needed, and will not re-download pages that have not changed, which can effectively reduce the amount of data downloaded and ensure timely Updating crawled web pages reduces the time and space consumption, but increases the complexity and implementation difficulty of the crawling algorithm.

For example, if you want to obtain recruitment information from Ganji.com, there is no need to crawl the previously crawled data. You only need to obtain updated recruitment data. In this case, you need to use an incremental crawler.

Finally, let’s talk about deep web crawlers. Web pages can be divided into surface web pages and deep web pages according to their existence methods. Surface web pages refer to pages that can be indexed by traditional search engines, and are mainly composed of static web pages that can be reached by hyperlinks. The deep web is those Web pages where most of the content cannot be obtained through static links, is hidden behind search forms, and can only be obtained by users submitting some keywords.

For example, pages that users can only access after logging in or registering. You can imagine a scenario like this: crawling data from Tieba or forums requires that users log in and have permission to obtain complete data.

2. Web crawler structure

The following uses a general web crawler structure to illustrate the basic workflow of a web crawler, as shown in Figure 3-4.

▲Figure 3-4 Web crawler structure

The basic workflow of a web crawler is as follows:

  1. Start by selecting a subset of carefully selected torrent URLs.

  2. Put these URLs into the URL queue to be crawled.

  3. Read the URL of the queue to be crawled from the URL queue to be crawled, parse the DNS, obtain the IP of the host, download the web page corresponding to the URL, and store it in the downloaded web page library. Additionally, these URLs are put into the crawled URL queue.

  4. Analyze the URLs in the crawled URL queue, analyze other URLs from the downloaded web page data, compare them with the crawled URLs to remove duplicates, and finally put the deduplicated URLs into the queue of URLs to be crawled, thus Enter the next cycle.

02 Python implementation of HTTP request

Through the above web crawler structure, we can see that reading URLs and downloading web pages are essential and key functions of every crawler, which requires dealing with HTTP requests. Next, we will explain the three ways to implement HTTP requests in Python: urllib2/urllib, httplib/urllib and Requests.

1. urllib2/urllib implementation

urllib2 and urllib are two built-in modules in Python. To implement the HTTP function, the implementation method is to use urllib2 as the main module and urllib as the supplement.

1.1 First implement a complete request and response model

urllib2 provides a basic function urlopen to obtain data by making a request to a specified URL. The simplest form is:

import urllib2  
response=urllib2.urlopen('http://www.zhihu.com')  
html=response.read()  
print html

In fact, the above request response to http://www.zhihu.com can be divided into two steps, one is the request, and the other is the response. The form is as follows:

import urllib2  
# 请求  
request=urllib2.Request('http://www.zhihu.com')  
# 响应  
response = urllib2.urlopen(request)  
html=response.read()  
print html
  

The above two forms are both GET requests. Next, we will demonstrate the POST request. In fact, they are similar, except that the request data is added. In this case, urllib is used. Examples are as follows:

import urllib  
import urllib2  
url = 'http://www.xxxxxx.com/login'  
postdata = {
    
    'username' : 'qiye',  
    'password' : 'qiye_pass'}  
# info 需要被编码为urllib2能理解的格式,这里用到的是urllib  
data = urllib.urlencode(postdata)  
req = urllib2.Request(url, data)  
response = urllib2.urlopen(req)  
html = response.read()

But sometimes this happens: even if the data in the POST request is correct, the server denies your access. Why is this? The problem lies in the header information in the request. The server will check the request header to determine whether it is an access from the browser. This is also a common method for anti-crawlers.

1.2 Request headers processing

Rewrite the above example, add request header information, and set the User-Agent field and Referer field information in the request header.

import urllib  
import urllib2  
url = 'http://www.xxxxxx.com/login'  
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
referer='http://www.xxxxxx.com/'  
postdata = {
    
    'username' : 'qiye',  
    'password' : 'qiye_pass'}  
# 将user_agent,referer写入头信息  
headers={
    
    'User-Agent':user_agent,'Referer':referer}  
data = urllib.urlencode(postdata)  
req = urllib2.Request(url, data,headers)  
response = urllib2.urlopen(req)  
html = response.read()

You can also write it like this, use add_header to add request header information, and modify it as follows:

import urllib  
import urllib2  
url = 'http://www.xxxxxx.com/login'  
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
referer='http://www.xxxxxx.com/'  
postdata = {
    
    'username' : 'qiye',  
    'password' : 'qiye_pass'}  
data = urllib.urlencode(postdata)  
req = urllib2.Request(url)  
# 将user_agent,referer写入头信息  
req.add_header('User-Agent',user_agent)  
req.add_header('Referer',referer)  
req.add_data(data)  
response = urllib2.urlopen(req)  
html = response.read()

  

Pay special attention to some headers. The server will check these headers, for example:

  • User-Agent : Some servers or Proxy will use this value to determine whether the request is made by the browser.

  • Content-Type : When using the REST interface, the server will check this value to determine how the content in the HTTP Body should be parsed. When using RESTful or SOAP services provided by the server, an incorrect Content-Type setting will cause the server to deny service. Common values ​​are: application/xml (used when calling XML RPC, such as RESTful/SOAP), application/json (used when calling JSON RPC), application/x-www-form-urlencoded (the browser submits a web form when used).

  • Referer : The server sometimes checks anti-hotlinking.

1.3 Cookie processing

urllib2 also automatically handles cookies, using the CookieJar function to manage cookies. If you need to get the value of a cookie item, you can do this:

import urllib2  
import cookielib  
cookie = cookielib.CookieJar()  
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))  
response = opener.open('http://www.zhihu.com')  
for item in cookie:  
    print item.name+':'+item.value

But sometimes we encounter this situation. We don’t want urllib2 to handle it automatically. We want to add the cookie content ourselves. We can do this by setting the Cookie domain in the request header:

import  urllib2  
opener = urllib2.build_opener()  
opener.addheaders.append( ( 'Cookie', 'email=' + "[email protected]" ) )  
req = urllib2.Request( "http://www.zhihu.com/" )  
response = opener.open(req)  
print response.headers  
retdata = response.read()

1.4 Timeout setting timeout

In versions before Python 2.6, the API of urllib2 does not expose the Timeout setting. To set the Timeout value, you can only change the global Timeout value of the Socket. Examples are as follows:

import urllib2  
import socket  
socket.setdefaulttimeout(10) # 10 秒钟后超时  
urllib2.socket.setdefaulttimeout(10) # 另一种方式

In Python 2.6 and newer versions, the urlopen function provides the setting of Timeout. The example is as follows:

import urllib2  
request=urllib2.Request('http://www.zhihu.com')  
response = urllib2.urlopen(request,timeout=2)  
html=response.read()  
print html

  

1.5 Get HTTP response code

For 200 OK, you can get the HTTP return code by using the getcode() method of the response object returned by urlopen. But for other return codes, urlopen will throw an exception. At this time, it is necessary to check the code attribute of the exception object. The example is as follows:

import urllib2  
try:  
    response = urllib2.urlopen('http://www.google.com')  
    print response  
except urllib2.HTTPError as e:  
    if hasattr(e, 'code'):  
        print 'Error code:',e.code

1.6 Redirect

By default, urllib2 will automatically redirect for HTTP 3XX return codes. To detect whether a redirect action has occurred, just check whether the URL of the Response and the URL of the Request are consistent. The example is as follows:

import urllib2  
response = urllib2.urlopen('http://www.zhihu.cn')  
isRedirected = response.geturl() == 'http://www.zhihu.cn'

If you don’t want to redirect automatically, you can customize the HTTPRedirectHandler class. The example is as follows:

import urllib2  
class RedirectHandler(urllib2.HTTPRedirectHandler):  
    def http_error_301(self, req, fp, code, msg, headers):  
        pass  
    def http_error_302(self, req, fp, code, msg, headers):  
        result = urllib2.HTTPRedirectHandler.http_error_301(self, req, fp, code,   
        msg, headers)  
        result.status = code  
        result.newurl = result.geturl()  
        return result  
opener = urllib2.build_opener(RedirectHandler)  
opener.open('http://www.zhihu.cn')

  

1.7 Proxy settings

In crawler development, agents are indispensable. By default, urllib2 will use the environment variable http_proxy to set the HTTP Proxy. However, we generally do not use this method. Instead, we use ProxyHandler to dynamically set the proxy in the program. The sample code is as follows:

import urllib2  
proxy = urllib2.ProxyHandler({
    
    'http': '127.0.0.1:8087'})  
opener = urllib2.build_opener([proxy,])  
urllib2.install_opener(opener)  
response = urllib2.urlopen('http://www.zhihu.com/')  
print response.read()

One detail to note here is that using urllib2.install_opener() will set the global opener of urllib2, and all subsequent HTTP access will use this proxy. This will be very convenient to use, but it cannot provide more fine-grained control. For example, if you want to use two different Proxy settings in the program, this scenario is very common in crawlers. A better approach is not to use install_opener to change the global settings, but to directly call the opener's open method instead of the global urlopen method. The modification is as follows:

import urllib2  
proxy = urllib2.ProxyHandler({
    
    'http': '127.0.0.1:8087'})  
opener = urllib2.build_opener(proxy,)  
response = opener.open("http://www.zhihu.com/")  
print response.read()

2. httplib/urllib implementation

The httplib module is a low-level basic module. You can see every step of establishing an HTTP request, but it implements relatively few functions and is rarely used under normal circumstances. It is basically not used in Python crawler development, so I am just here to popularize knowledge. The following introduces commonly used objects and functions:

  • Create HTTPConnection object:

    class httplib.HTTPConnection(host[, port[, strict[, timeout[, source_address]]]])。

  • send request:

    HTTPConnection.request(method, url[, body[, headers]])。

  • Get response:

    HTTPConnection.getresponse()。

  • Read response information:

    HTTPResponse.read([amt])。

  • Get the specified header information:

    HTTPResponse.getheader(name[, default])。

  • Get a list of response header (header, value) tuples:

    HTTPResponse.getheaders()。

  • Get the underlying socket file descriptor:

    HTTPResponse.fileno()。

  • Get header content:

    HTTPResponse.msg。

  • Get the header http version:

    HTTPResponse.version。

  • Get return status code:

    HTTPResponse.status。

  • Get return instructions:

    HTTPResponse.reason。

Next, we will demonstrate the sending of GET requests and POST requests. The first is an example of a GET request, as shown below:

import httplib  
conn =None  
try:  
    conn = httplib.HTTPConnection("www.zhihu.com")  
    conn.request("GET", "/")  
    response = conn.getresponse()  
    print response.status, response.reason  
    print '-' * 40  
    headers = response.getheaders()  
    for h in headers:  
        print h  
    print '-' * 40  
    print response.msg  
except Exception,e:  
    print e  
finally:  
    if conn:  
        conn.close()

An example of a POST request is as follows:

import httplib, urllib  
conn = None  
try:  
    params = urllib.urlencode({
    
    'name': 'qiye', 'age': 22})  
    headers = {
    
    "Content-type": "application/x-www-form-urlencoded"  
    , "Accept": "text/plain"}  
    conn = httplib.HTTPConnection("www.zhihu.com", 80, timeout=3)  
    conn.request("POST", "/login", params, headers)  
    response = conn.getresponse()  
    print response.getheaders() # 获取头信息  
    print response.status  
    print response.read()  
except Exception, e:  
    print e  
    finally:  
    if conn:  
        conn.close()

3. More user-friendly Requests

The way Requests implements HTTP requests in Python is highly recommended by me and is also the most commonly used method in Python crawler development. Requests is very simple to implement HTTP requests and the operation is more user-friendly.

The Requests library is a third-party module and requires additional installation. Requests is an open source library, the source code is located at:

GitHub: https://github.com/kennethreitz/requests

I hope everyone will support the author.

To use the Requests library, you need to install it first. There are generally two installation methods:

  • Use pip to install. The installation command is: pip install requests, but it may not be the latest version.

  • Go directly to GitHub to download the source code of Requests. The download link is:

    https://github.com/kennethreitz/requests/releases

    Unzip the source code compressed package, then enter the decompressed folder and run the setup.py file.

How to verify whether the Requests module is installed successfully? Enter import requests in the Python shell. If no error is reported, the installation is successful. As shown in Figure 3-5.

▲Figure 3-5 Verify Requests installation

3.1 First, implement a complete request and response model

Taking the GET request as an example, the simplest form is as follows:

import requests  
r = requests.get('http://www.baidu.com')  
print r.content

As you can see, the amount of code is less than the urllib2 implementation. Next, let's demonstrate the POST request, which is also very short and more Python-style. Examples are as follows:

import requests  
postdata={
    
    'key':'value'}  
r = requests.post('http://www.xxxxxx.com/login',data=postdata)  
print r.content

Other request methods in HTTP can also be implemented using Requests. Examples are as follows:

r = requests.put('http://www.xxxxxx.com/put', data = {
    
    'key':'value'})  
r = requests.delete('http://www.xxxxxx.com/delete')  
r = requests.head('http://www.xxxxxx.com/get')  
r = requests.options('http://www.xxxxxx.com/get')

Next, let’s explain a slightly more complicated method. You must have seen URLs like this:

http://zzk.cnblogs.com/s/blogpost?Keywords=blog:qiyeboy&pageindex=1

It means that the URL is followed by "?", and there are parameters after "?". So how to send such a GET request? Some people will definitely say that you can just bring in the complete URL directly, but Requests also provides other methods, examples are as follows:

import requests  
    payload = {
    
    'Keywords': 'blog:qiyeboy','pageindex':1}  
r = requests.get('http://zzk.cnblogs.com/s/blogpost', params=payload)  
print r.url

By printing the results, we see that the final URL becomes:

http://zzk.cnblogs.com/s/blogpost?Keywords=blog:qiyeboy&pageindex=1

3.2 Response and encoding

Let’s start with the code. The example is as follows:

import requests  
r = requests.get('http://www.baidu.com')  
print 'content-->'+r.content  
print 'text-->'+r.text  
print 'encoding-->'+r.encoding  
r.encoding='utf-8'  
print 'new text-->'+r.text

  

Among them, r.content returns the byte form, r.text returns the text form, and r.encoding returns the web page encoding format guessed based on the HTTP header.

In the output result: the content after "text–>" is garbled on the console, and the content after "encoding–>" is ISO-8859-1 (the actual encoding format is UTF-8), due to Requests guess An encoding error resulted in garbled text in the parsed text. Requests provides a solution. You can set the encoding format yourself. After r.encoding='utf-8' is set to UTF-8, the content of "new text–>" will not be garbled.

However, this manual method is a bit clumsy. Here is a simpler method: chardet, which is a very excellent string/file encoding detection module. Installation method is as follows:

pip install chardet

After the installation is complete, use chardet.detect() to return the dictionary, where confidence is the detection accuracy and encoding is the encoding form. Examples are as follows:

import requests  
r = requests.get('http://www.baidu.com')  
print chardet.detect(r.content)  
r.encoding = chardet.detect(r.content)['encoding']  
print r.text

Directly assign the encoding detected by chardet to r.encoding to implement decoding, and the r.text output will not be garbled.

In addition to the above method of directly obtaining all responses, there is also a streaming mode. The example is as follows:

import requests  
r = requests.get('http://www.baidu.com',stream=True)  
print r.raw.read(10)

Set the stream=True flag to read the response as a byte stream, and the r.raw.read function specifies the number of bytes to read.

3.3 Request headers processing

The processing of headers by Requests is very similar to urllib2. Just add the headers parameter in the get function of Requests. Examples are as follows:

import requests  
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
headers={
    
    'User-Agent':user_agent}  
r = requests.get('http://www.baidu.com',headers=headers)  
print r.content

  

3.4 Response code and response headers processing

To obtain the response code, use the status_code field in Requests, and to obtain the response header, use the headers field in Requests. Examples are as follows:

import requests  
r = requests.get('http://www.baidu.com')  
if r.status_code == requests.codes.ok:  
    print r.status_code# 响应码  
    print r.headers# 响应头  
    print r.headers.get('content-type')# 推荐使用这种获取方式,获取其中的某个字段  
    print r.headers['content-type']# 不推荐使用这种获取方式  
else:  
    r.raise_for_status()

In the above program, r.headers contains all response header information. You can get one of the fields through the get function, or you can get the dictionary value through dictionary reference, but it is not recommended, because if there is no such field in the field, second The first method will throw an exception, and the first method will return None.

r.raise_for_status() is used to actively generate an exception. When the response code is 4XX or 5XX, the raise_for_status() function will throw an exception. When the response code is 200, the raise_for_status() function returns None.

3.5 Cookie handling

If the response contains the value of Cookie, you can obtain the value of the Cookie field in the following way. The example is as follows:

import requests  
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
headers={
    
    'User-Agent':user_agent}  
r = requests.get('http://www.baidu.com',headers=headers)  
# 遍历出所有的cookie字段的值  
for cookie in r.cookies.keys():  
    print cookie+':'+r.cookies.get(cookie)

If you want to customize the cookie value to send, you can use the following method, the example is as follows:

import requests  
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
headers={
    
    'User-Agent':user_agent}  
cookies = dict(name='qiye',age='10')  
r = requests.get('http://www.baidu.com',headers=headers,cookies=cookies)  
print r.text

  

There is also a more advanced way that can automatically handle cookies. Sometimes we don't need to care about the cookie value. We just want the program to automatically bring the cookie value every time we visit, just like a browser. Requests provides a session concept, which is particularly convenient when continuously accessing web pages and handling login jumps, without paying attention to specific details. Examples of usage are as follows:

import Requests  
oginUrl = 'http://www.xxxxxxx.com/login'  
s = requests.Session()  
#首先访问登录界面,作为游客,服务器会先分配一个cookie  
r = s.get(loginUrl,allow_redirects=True)  
datas={'name':'qiye','passwd':'qiye'}  
#向登录链接发送post请求,验证成功,游客权限转为会员权限  
r = s.post(loginUrl, data=datas,allow_redirects= True)  
print r.text

The above program is actually a problem encountered in formal Python development. If you do not access the login page in the first step, but directly send a Post request to the login link, the system will treat you as an illegal user because you access the login interface. A Cookie will be allocated when sending a Post request. This Cookie needs to be brought when sending a Post request. This method of using the Session function to process Cookies will be very common in the future.

3.6 Redirection and historical information

To handle redirects, you just need to set the allow_redirects field, for example:

r=requests.get(‘http://www.baidu.com’,allow_redirects=True)

Set allow_redirects to True to allow redirection; to False to disable redirection. If redirection is allowed, you can view historical information through the r.history field, that is, all request jump information before successful access. Examples are as follows:

import requests  
r = requests.get('http://github.com')  
print r.url  
print r.status_code  
print r.history

The print result is as follows:

https://github.com/  
200  
(<Response [301]>,)

The effect shown by the above sample code is that when accessing the GitHub URL, all HTTP requests will be redirected to HTTPS.

3.7 Timeout setting

The timeout option is set through the parameter timeout. The example is as follows:

requests.get('http://github.com', timeout=2)

3.8 Proxy settings

Using a proxy, you can configure a single request for any request method by setting the proxies parameter:

import requests  
proxies = {
    
      
    "http": "http://0.10.1.10:3128",  
    "https": "http://10.10.1.10:1080",  
}  
requests.get("http://example.org", proxies=proxies)

The proxy can also be configured through the environment variables HTTP_PROXY and HTTPS_PROXY?, but it is not commonly used in crawler development. Your proxy needs to use HTTP Basic Auth, which can be done using the http://user:password@host/ syntax:

proxies = {
    
      
    "http": "http://user:[email protected]:3128/",  
}

03 Summary

This article mainly explains the structure and application of web crawlers, as well as several methods of implementing HTTP requests in Python. I hope you will focus on absorbing and digesting the web crawler workflow and the way Requests implements HTTP requests in this article.

at last:

[For those who want to learn crawlers, I have compiled a lot of Python learning materials and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]

1. Study Outline

Insert image description here

2. Development tools

Insert image description here

3. Python basic materials

Insert image description here

4. Practical data

Insert image description here

Guess you like

Origin blog.csdn.net/Z987421/article/details/133310837