python urllib use

Urllib is python's built-in HTTP request library,
including the following modules
urllib.request request module
urllib.error exception handling module
urllib.parse url parsing module
urllib.robotparser robots.txt parsing module

vacation

About the introduction of urllib.request.urlopen parameters:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Use of url parameters

Write a simple example first:

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

There are three parameters commonly used in urlopen. Its parameters are as follows:
urllib.requeset.urlopen(url, data, timeout)
response.read() can get the content of the web page, if there is no read(), it will return the following content

Use of the data parameter

The above example is to obtain Baidu by requesting Baidu's get request. The post request using urllib is demonstrated
here through the http://httpbin.org/post website (this website can be used as a site to practice using urllib, which can
simulate various requests operate).

copy code
import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
print(data)
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())
copy code

urllib.parse is used here, and the post data can be converted into the data parameter of urllib.request.urlopen through bytes(urllib.parse.urlencode()). This completes a post request.
So if we add the data parameter, it is a post request, and if there is no data parameter, it is a get request.

The use of the timeout parameter
will cause slow requests or abnormal requests in some network conditions or abnormal server-side conditions. Therefore, at this time, we need to
set a timeout period for the request instead of letting the program keep waiting for the result. Examples are as follows:

import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
print(response.read())

After running, we can see that the result can be returned normally, then we set the timeout time to 0.1 and the
running program will prompt the following error

So we need to grab the exception and change the code to

copy code
import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')
copy code

response

Response type, status code, response headers

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response))

You can see that the result is: <class 'http.client.httpresponse'="">
We can get the status code and header information response through response.status, response.getheaders().response.getheader("server")
. read() gets the content of the response body

Of course, the above urlopen can only be used for some simple requests, because it cannot add some header information. If we write a crawler later, we can know that in many cases, we need to add header information to access the target station, which is used at this time. urllib.request

request

There are many websites for setting Headers
. In order to prevent the website from being paralyzed by the program crawler, it will need to carry some header information to access it. The most common one is the user-agent parameter.

Write a simple example:

import urllib.request

request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

Add header information to the request to customize the header information when you request the website

copy code
from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'zhaofan'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
copy code

The second way to add request headers

copy code
from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
copy code

This way of adding has the advantage that you can define a request header dictionary yourself, and then add it in a loop

Advanced usage of various handlers

Proxy, ProxyHandler

You can set a proxy through rulllib.request.ProxyHandler(), and the website will detect the number of visits of a certain IP in a certain period of time. If the number of visits is too many, it will prohibit your visit, so you need to set a proxy to crawl data at this time.

copy code
import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())
copy code

cookie,HTTPCookiProcessor

Our common login information is stored in cookies. Sometimes crawling websites need to carry cookie information to access. http.cookiejar is used here to obtain cookies and store cookies.

copy code
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)
copy code

At the same time, cookies can be written to a file and saved. There are two ways: http.cookiejar.MozillaCookieJar and http.cookiejar.LWPCookieJar(). Of course, you can use either method yourself

Specific code examples are as follows:
http.cookiejar.MozillaCookieJar() method

copy code
import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
copy code

http.cookiejar.LWPCookieJar()方式

copy code
import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
copy code

Similarly, if you want to obtain the cookie in the file, you can use the load method. Of course, whichever method is used to write, whichever method is used to read.

copy code
import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
copy code

exception handling

Many times when we access pages through programs, some pages may have errors, such as 404, 500 and other errors,
we need to catch exceptions at this time. Let's write a simple example first.

copy code
from urllib import request,error

try:
    response = request.urlopen("http://pythonsite.com/1111.html")
except error.URLError as e:
    print(e.reason)
copy code

The above code accesses a page that does not exist. By catching the exception, we can print the exception error

What we need to know here is that there are two exception errors in the urllb exception:
URLError, HTTPError, HTTPError is a subclass of URLError

There is only one attribute in URLError: reason, that is, only error information can be printed when catching exceptions, similar to the above example

There are three attributes in HTTPError: code, reason, headers, that is, when catching exceptions, you can get three information of code, reson, and headers. Examples are as follows:

copy code
from urllib import request,error
try:
    response = request.urlopen("http://pythonsite.com/1111.html")
except error.HTTPError as e:
    print(e.reason)
    print(e.code)
    print(e.headers)
except error.URLError as e:
    print(e.reason)

else:
    print("reqeust successfully")
copy code

At the same time, e.reason can actually make in-depth judgments. Examples are as follows:

copy code
import socket

from urllib import error,request

try:
    response = request.urlopen("http://www.pythonsite.com/",timeout=0.001)
except error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason,socket.timeout):
        print("time out")
copy code

URL parsing

urlparse
The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

Function one:

from urllib.parse import urlparse

result = urlparse("http://www.baidu.com/index.html;user?id=5#comment")
print(result)

The result is:

Here we can split the url address you pass in
and we can specify the protocol type:
result = urlparse(" www.baidu.com/index.html;user?id=5#comment",scheme="https" )
When splitting in this way, the protocol type part will be the part you specified. Of course, if your url already has a protocol in it, the protocol you specify through the scheme will not take effect.

urlunpars

In fact, the function is the opposite of that of urlparse. It is used for splicing. Examples are as follows:

from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=123','commit']
print(urlunparse(data))

The result is as follows

urljoin

This function is actually splicing, examples are as follows:

copy code
from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://pythonsite.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://pythonsite.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))
copy code

The result is:

From the results of splicing, we can see that the priority of the latter is higher than that of the former url when splicing

The urlencode
method can convert the dictionary into url parameters, the example is as follows

copy code
from urllib.parse import urlencode

params = {
    "name":"zhaofan",
    "age":23,
}
base_url = "http://www.baidu.com?"

url = base_url+urlencode(params)
print(url)
copy code

The result is:

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324492630&siteId=291194637