[Crawler] Urllib lets our python pretend to be a browser

In the built-in Urllib library of Python, there are four modules:

  1. request, the request module is what we use a lot, it is used to initiate requests, so let's focus on this module.
  2. error, the error module, when we encounter an error when using the request module, we can use it for exception handling.
  3. parse, the parse module is used to parse our URL address, such as parsing the domain name address, the directory specified by the URL, etc.
  4. robotparser, this is used less, it is used to parse the robot.txt of the website

After understanding urllib, let's use python code to simulate requests!

urllib.request

First we import the request module of urllib:

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

We directly use the Get request method to request Baidu through the urlopen method of the request module, and the returned content is the same as that of the browser.
In the urlopen method of request, there are mainly 3 parameters that can be passed in. The
urllib.request.urlopen(url, data=None, [timeout, ]*)
first url is the link we requested, for example, we just requested Baidu.

The second parameter, data, is specially used to carry parameters for our post request. For example, when we log in, we can encapsulate the username and password into data and pass it over. The value of data here can be passed in byte type.

The third parameter timeout is to set the request timeout time. If the server does not return data to us after waiting for a long time, we will stop him! This is the main usage of request's urlopen.

What if we want to trick the server into saying that we are requested by a browser or a mobile phone ? At this time, we need to add request header information, which is the request header we mentioned last time. Then, at this time, it is time to let the Request method in the request module appear. This Request method has more parameters:
urllib.request.Request(url, data=None, headers={}, method=None)
in addition to defining url and data, we can also define request header information. urlopen defaults to Get requests. We Passing in parameters makes it a Post request, and Request allows us to define our own request method, so that we can use Request to encapsulate our request information.
You can impersonate a device browser by setting the headers parameter:

headers = {
    
    
    #假装自己是浏览器
    'User-Agent':' Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
}

Remember that the request parameter should be converted to unicode code:

from urllib import request,parse
dict = {
    
    
    'return_url':'https://zhihu.com/',
    'user_name':'[email protected]',
    'password':'123456789',
    '_post_type':'ajax',
}
data = bytes(parse.urlencode(dict),'utf-8')

Then we can encapsulate the request:

req = request.Request(url,data=data,headers=headers,method='POST')

Finally send the request:

response = request.urlopen(req)
print(response.read().decode('utf-8'))

Guess you like

Origin blog.csdn.net/weixin_42468475/article/details/132309946