urllib library
The urllib library is one of the most basic network request libraries in Python, which can simulate the behavior of the browser, send a request to the specified server, and save the data returned by the server
urlopen function
In Python3's urllib library, all methods related to network requests are collected under the urllib.request module. Let's first look at the basic use of the urlopen function:
from urllib import request
resp = request.urlopen("http://baidu.com")
print(resp.read())
In fact, use the browser to access Baidu, and right-click to view the source code. You will find that the data we printed just now is exactly the same. In other words, the above three lines of code have helped us crawl down all the code on Baidu's homepage. -The python code corresponding to a basic ur request is really simple.
The following explains the urlopen function in detail:
- url: the requested url. .
- data: the requested date, if this value is set, it will become a post request.
- Near return value: The return value is an htptctient HTRsponste object. This object is a file handle object. There is read (size). redline. readlines and getcode methods.
urlretrieve function
This function can conveniently save a file on the webpage to the local. The following code can easily download Baidu's homepage to the local:
from urllib import request
request.urlretrieve("http://baidu.com", 'index.html')
urlencode function
When sending a request with a browser, if the URL contains Chinese or other special characters. Then the browser will automatically start coding. If you use the code to send the request, you must manually encode it. At this time, you should use the urlencode function. urlencode can convert dictionary data into URL-encoded data.
The sample code is as follows:
from urllib import parse
data = {'name':'张三','age': 10}
q = parse.urlencode(data) #q.encode('utf-8')将会把unicode字符编程二进制的unicode,前面加上了b
print(q) #name=%E5%BC%A0%E4%B8%89&age=10
parse.quote can only be used for string encoding
parse_qs function
You can decode the encoded URL parameters,
from urllib import parse
qs = 'name=%E5%BC%A0%E4%B8%89&age=10'
print(parse.parse_qs(qs)) #{'name': ['张三'], 'age': ['10']}
urlparse and urlsplit
Sometimes I get a url and want to split the various components in this url, then at this time you can use urlparse or urlsplit to split. Sample code added:
from urllib import request,parse
url = 'https://www.baidu.com/s?wd=github'
result = parse.urlsplit(url)
#parse.urlparse(url)
print('schema',result.scheme) #schema https
print('netloc',result.netloc) #netloc www.baidu.com
print('path',result.path) #path /s
print('query',result.query) #query wd=github
request.Request类
If you want to add some request headers when requesting, then you must use request, Request class to achieve. For example, to add a User-Agent, the sample code is as follows:
from urlib import request
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
}
req = request.Request('http://baidu.com', headers=headers)
resp = request.urlopen(req)
print(resp.read())
ProxyHandler processor (proxy settings)
Many websites will detect the number of visits of an IP for a certain period of time (through traffic statistics, system logs, etc.). If the number of visits is not like a normal person, it will prohibit access to this IP.
So we can set up some proxy servers and change a proxy every-for a period of time. Even if the IP is forbidden, we can still change another IP to continue crawling.
The proxy server is set in urllib through ProxyHandler. The following code shows how to use a custom opener to use a proxy:
from urllib import request
#设置代理,传入字典
handler = request.ProxyHandler({'http':'proxyip})
opener = request.build_opener(handler)
#这个网址可以测试ip来源
req = request.Request("http://httpbin.org/ip")
resp = opener.open(req)
print(resp.read())
Commonly used agents are:
- Free IP of Westthorn: http://www.xicidaili.com/
- Express agent: http://ww.kuaidaili.com/
- Agent Cloud: http://www.daliyun.com/
What is a cookie:
In a website, http requests are stateless. In other words, even after connecting to the server for the first time and logging in successfully, the second request to the server still cannot know which user the current request is. The appearance of cookies is to solve this problem. After the first login, the server returns some data (cookies) to the browser, and the browser is saved locally. When the user sends the second request, the last The cookie data requested to be stored is automatically carried to the server, and the server
can determine which user is currently using the data carried by the browser . The amount of data stored by cookies is limited. Different browsers have different storage sizes, but generally no more than 4KB. Therefore, the use of cookies can only store some small amounts of data.
Cookie format
Set-Cookie: NAME=VALUE: Expires/Max-age=DATE: Path=PATH: Domain=DOMAIN NAME: SECURE
Parameter meaning:
- NAME: The name of the cookie.
- VALUE: The value of the cookie.
- Expires: the expiration time of the cookie.
- Path: The path of the cookie.
- Domain: The domain name of the cookie.
- SECURE: Whether it only works under https protocol.
Use cookielib library and HTTPCookieProcessor to simulate login:
Cookie refers to the text file stored on the user's browser in order to identify the user's identity and session tracking. The cookie can keep the login information until the user's next session with the server.
Let's take Renren.com as an example. In Renren.com, to access a person's homepage, you must log in before you can access it. Logging in plainly means having cookie information. Then if we want to access by code, we must have the correct cookie information to access. There are two solutions, the first is to use a browser to access, and then copy the cookie information and put it in the headers.
from urllib import request
login_url = 'http://www...'
headers = {
'User-Agent':'.......'
}
req = request.Request(url = login_url, headers = headers)
resp = request.urlopen(req)
with open('index.html', 'w', encode='utf-8') as fp:
fp.write(resp.read().decode('utf-8'))
But copying cookies from the browser every time you visit a page that requires cookies is troublesome. Processing cookies in Python is generally done through the HTTPCookieProcessor processor class of the http.cookiejar module and urllib module. The main function of the http.cookiejar module is to provide objects for storing cookies. The main function of the HTTPCookieProcessor processor is to process these cookie objects and construct a handler object.
http.cookiejar模块
The main classes of this module are CookieJar, FileCookieJar, MoillaCookieJar. The functions of LWPCookidJaro are as follows:
- CookieJar: an object that manages HTTP cookie values, stores cookies generated by HTTP requests, and adds cookies to outgoing HTTP requests. The entire cookie is stored in memory, and the cookie will be lost after garbage collection of the CookieJar instance.
- FileCookieJar (filename, delayload = None, policy = None): derived from CookieJar, used to create FileCookieJar instances, retrieve cookie information and store cookies in files. flename is the file name where the cookie is stored. When delayload is True, support delayed access to the file, that is, only read the file or store data in the file when needed.
- MoillaCookieJar (filename, delayload = None, policy = None): Derived from FileCookieJar, create a FileCookieJar instance compatible with Moilla browser cookies.txt.
- LWPCookieJar (flename, delayload = None, policy = None): Derived from FileCookieJar to create a FileCookieJa instance compatible with the libwww-per standard Set-Cookie3 file format.
Use http.cookiejar and request .HTTPCookieProcessor to log in to Renren. The relevant sample code is as follows:
Save cookies locally
You can use the save method of cookiejar, you need to specify the file name
from urllib import request
from http.cookiejar import MozillaCookieJar
cookiejar = MozillaCookieJar('cookie.txt')
cookiejar.load(ignore_discard=True) #过期的cookie也进行存储
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)
resp = opener.open('http://httpbin.org/cookies')
for cookie in cookiejar:
print(cookie)