[Xiao Mu learns Python] Web crawler urllib

Article directory

1 Introduction
2. Function introduction
3. Code examples
4. urllib3 related examples
5. Chrome debugging
Conclusion

1 Introduction

When using a Python crawler, you need to simulate initiating network requests. The main libraries used are the requests library and the urllib library built into Python. It is generally recommended to use requests, which is a re-encapsulation of urllib.

urllib, urllib2, and urllib3 can all access resource files on the Internet through the network.

urllib: The built-in network request library of Python2 and Python3. Python3's urllib is actually the merger of urllib and urllib2 in the Python2 version.
urllib2: It only exists in the built-in library of the Python2 version. Its functions are basically similar to urllib, mainly enhancing urllib.
urllib3: Both Python2 and Python3 can be used, but this is not a standard library and needs to be installed using pip. urllib3 provides a thread-safe pool and file post, etc. urllib3 is a powerful, user-friendly Python HTTP client. Much of the Python ecosystem is already using urllib3, and you should too. urllib3 brings many key features missing from the standard library in Python.

Supplement: In Python2, urllib and urllib2 are generally used together. urllib has features that urllib2 does not have, and urllib2 has features that urllib does not have.

2. Function introduction

2.1 urllib library and requests library

(1) urllib library
The response object of the urllib library is to first create the http request object and load it into requests.urlopen to complete the http request.
What is returned is the http rresponse object, which is actually the html attribute. Use .read().decode() to decode and convert it into str string type. After decoding, Chinese characters can be displayed.

from urllib import request

headers = {
    
    
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'
}

url = "http://www.baidu.com"
req = request.Request(url, headers=headers)
response = request.urlopen(req)
print(response)

data = response.read().decode()
print(data)

(2) requests library
The requests library calls the requests.get method to pass in the url and parameters. The returned object is the Response object. When printed out, the response status code is displayed.

import requests

headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-cn; BLA-AL00 Build/HUAWEIBLA-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.9 Mobile Safari/537.36"
}

url = "http://www.baidu.com"
response = requests.get(url, params=wd, headers=headers)
text = response.text
content = response.content

print(text)
print(content)

(3)
When comparing the results of Python crawlers, it is more recommended to use the requests library. Because requests is more convenient than urllib, requests can directly construct get and post requests and initiate them, while urllib.request can only construct get and post requests first and then initiate them.

2.2 Modules of urllib library

The urllib package contains the following modules:

urllib.request - 打开和读取 URL。
urllib.error - 包含 urllib.request 抛出的异常。
urllib.parse - 解析 URL。
urllib.robotparser - 解析 robots.txt 文件。

https://docs.python.org/3/library/urllib.html
Insert image description here

2.2.1 urllib.request

urllib.request defines some functions and classes for opening URLs, including authorization verification, redirection, browser cookies, etc.
urllib.request can simulate a browser request initiation process.

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Use urlopen to open a URL:

from urllib.request import urlopen

myURL = urlopen("https://api.money.126.net/data/feed/1000002,1000001,1000881,0601398,money.api")
print(myURL.read())
#print(myURL.read(300)) # 读取指定长度的文本
#print(myURL.readline()) # 读取一行内容
#lines = myURL.readlines() # 读取文件的全部内容，它会把读取的内容赋值给一个列表变量。

2.2.2 urllib.error

The urllib.error module defines exception classes for exceptions caused by urllib.request. The basic exception class is URLError.
urllib.error contains two methods, URLError and HTTPError.

import urllib.request
import urllib.error

myURL1 = urllib.request.urlopen("https://www.baidu.com/")
print(myURL1.getcode())   # 200

try:
    myURL2 = urllib.request.urlopen("https://www.baidu.com/no.html")
except urllib.error.HTTPError as e:
    if e.code == 404:
        print(404)   # 404

2.2.3 urllib.parse

urllib.parse is used to parse URLs, the format is as follows:

urllib.parse.urlparse(urlstring, scheme=‘’, allow_fragments=True)

from urllib.parse import urlparse

o = urlparse("https://api.money.126.net/data/feed/1000002,1000001,1000881,0601398,money.api")
print(o)

2.2.4 urllib.robotparser

urllib.robotparser is used to parse robots.txt files.
robots.txt (uniformly lowercase) is a robots protocol stored in the root directory of the website. It is usually used to tell search engines the crawling rules for the website.

import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("http://www.musi-cal.com/robots.txt")
rp.read()
rrate = rp.request_rate("*")

print(rrate.requests)
print(rrate.seconds)
print(rp.crawl_delay("*"))
print(rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco"))
print(rp.can_fetch("*", "http://www.musi-cal.com/"))

2.3 Getting Started Example

urllib initiates a GET request

from urllib import request

res = request.urlopen("http://httpbin.org/get")
print(res.read().decode())  # red()方法读取的数据是bytes的二进制格式，需要解码

urllib initiates a POST request

from urllib import request

res = request.urlopen("http://httpbin.org/post", data=b'hello=world')
print(res.read().decode())

urllib adds headers to requests

from urllib import request

url = "http://httpbin.org/get"
headers = {
    
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}

req = request.Request(url=url, headers=headers) # 传递的Request对象
res = request.urlopen(req)
print(res.read().decode())

3. Code examples

3.1 urlib obtains web pages (1)

Download the web page corresponding to the url to the local

# -*- coding: UTF-8 -*-
import urllib.request

def get_html(url):
    # TODO(You): 请在此实现代码
    return html

if __name__ == '__main__':
    url = "http://www.baidu.com"
    html = get_html(url)
    print(html)

def get_html(url):
    response = urllib.request.urlopen(url)
    buff = response.read()
    html = buff.decode("utf8")
    return html

3.2 urlib obtains web pages (2) with header


 # -*- coding: UTF-8 -*-
import urllib.request

def get_html(url, headers):
    req = urllib.request.Request(url)
    for key in headers:
        req.add_header(key, headers[key])
    response = urllib.request.urlopen(req)
    buff = response.read()
    html = buff.decode("utf8")
    return html
    
if __name__ == '__main__':
    headers = {
    
    
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
    }
    url = "http://www.baidu.com"
    html = get_html(url, headers)
    print(html)

3.3 urllib post request

An example of a Post request:

# -*- coding: UTF-8 -*-
import urllib.request
import urllib.parse

def get_response(url, data):
    data = bytes(urllib.parse.urlencode(data), encoding='utf8')
    response = urllib.request.urlopen(
        url, data
    )
    buff = response.read()
    result = buff.decode("utf8")
    return result

if __name__ == '__main__':
    data = {
    
    
        "key1": "value1",
        "key2": "value2"
    }
    url = "http://httpbin.org/post"
    html = get_response(url, data)
    print(html)

Another example of a Post request:

import urllib.request as rq
import urllib.parse as ps

url='https://www.python.org/search/'
dictionary = {
    
     'q': 'urllib' }
 
data = ps.urlencode(dictionary)
data = data.encode('utf-8')
 
req = rq.Request(url,data)
res = rq.urlopen(req)
 
print(res.read())

4. urllib3 related examples

Install the urllib3 library:

# https://pypi.org/project/urllib3/
python -m pip install urllib3

urllib3 initiates a GET request

import urllib3

http = urllib3.PoolManager()  # 线程池生成请求
res = http.request('GET', 'http://httpbin.org/get')
print(res.data.decode())

urllib3 initiates a POST request

import urllib3

http = urllib3.PoolManager()  # 线程池生成请求
res = http.request('POST', 'http://httpbin.org/post', fields={
    
    'hello': 'world'})
print(res.data.decode())

urllib3 sets headers

headers = {
    
    'X-Something': 'value'}
res = http.request('POST', 'http://httpbin.org/post', headers=headers, fields={
    
    'hello': 'world'})

5. Chrome debugging

Chrome is a free web browser developed by Google, which is very convenient for front-end development (especially debugging code).
In the Chrome browser, you can open the developer interface through the following shortcut keys:

(1) Press the shortcut key: F12
(2) Press the shortcut key: Ctrl+Shift+i
(3) Right-click the page and select "Inspect" to open the developer tools.

The debugging tool is opened as follows. The most commonly used debugging functions are:

Element 标签页： 用于查看和编辑当前页面中的 HTML 和 CSS 元素。
Console 标签页：用于显示脚本中所输出的调试信息，或运行测试脚本等。
Source 标签页：用于查看和调试当前页面所加载的脚本的源文件。
Network 标签页：用于查看 HTTP 请求的详细信息，如请求头、响应头及返回内容等。

Insert image description here

Conclusion

如果您觉得该方法或代码有一点点用处，可以给作者点个赞，或打赏杯咖啡；╮(￣▽￣)╭
如果您感觉方法或代码不咋地//(ㄒoㄒ)// ，就在评论处留言，作者继续改进；o_O???
如果您需要相关功能的代码定制化开发，可以留言私信作者；(✿◡‿◡)
感谢各位大佬童鞋们的支持！( ´ ▽´ )ﾉ ( ´ ▽´)! ! !