First acquaintance with the python crawler urllib library

urllib library

The urllib library is one of the most basic network request libraries in Python, which can simulate the behavior of the browser, send a request to the specified server, and save the data returned by the server

urlopen function

In Python3's urllib library, all methods related to network requests are collected under the urllib.request module. Let's first look at the basic use of the urlopen function:

from urllib import request
resp = request.urlopen("http://baidu.com")
print(resp.read())

In fact, use the browser to access Baidu, and right-click to view the source code. You will find that the data we printed just now is exactly the same. In other words, the above three lines of code have helped us crawl down all the code on Baidu's homepage. -The python code corresponding to a basic ur request is really simple.
The following explains the urlopen function in detail:

  1. url: the requested url. .
  2. data: the requested date, if this value is set, it will become a post request.
  3. Near return value: The return value is an htptctient HTRsponste object. This object is a file handle object. There is read (size). redline. readlines and getcode methods.

urlretrieve function

This function can conveniently save a file on the webpage to the local. The following code can easily download Baidu's homepage to the local:

from urllib import request
request.urlretrieve("http://baidu.com", 'index.html')

urlencode function

When sending a request with a browser, if the URL contains Chinese or other special characters. Then the browser will automatically start coding. If you use the code to send the request, you must manually encode it. At this time, you should use the urlencode function. urlencode can convert dictionary data into URL-encoded data.
The sample code is as follows:

from urllib import parse
data = {'name':'张三','age': 10}
q = parse.urlencode(data)		#q.encode('utf-8')将会把unicode字符编程二进制的unicode,前面加上了b
print(q)	#name=%E5%BC%A0%E4%B8%89&age=10

parse.quote can only be used for string encoding

parse_qs function

You can decode the encoded URL parameters,

from urllib import parse
qs = 'name=%E5%BC%A0%E4%B8%89&age=10'
print(parse.parse_qs(qs)) #{'name': ['张三'], 'age': ['10']}

urlparse and urlsplit

Sometimes I get a url and want to split the various components in this url, then at this time you can use urlparse or urlsplit to split. Sample code added:

from urllib import request,parse
url = 'https://www.baidu.com/s?wd=github'
result = parse.urlsplit(url) 
#parse.urlparse(url)
print('schema',result.scheme)	#schema https
print('netloc',result.netloc)	#netloc www.baidu.com
print('path',result.path)	#path /s
print('query',result.query)	#query wd=github

request.Request类

If you want to add some request headers when requesting, then you must use request, Request class to achieve. For example, to add a User-Agent, the sample code is as follows:

from urlib import request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
}
req = request.Request('http://baidu.com', headers=headers)
resp = request.urlopen(req)
print(resp.read())

ProxyHandler processor (proxy settings)

Many websites will detect the number of visits of an IP for a certain period of time (through traffic statistics, system logs, etc.). If the number of visits is not like a normal person, it will prohibit access to this IP.
So we can set up some proxy servers and change a proxy every-for a period of time. Even if the IP is forbidden, we can still change another IP to continue crawling.
The proxy server is set in urllib through ProxyHandler. The following code shows how to use a custom opener to use a proxy:

from urllib import request
#设置代理,传入字典
handler = request.ProxyHandler({'http':'proxyip})
opener = request.build_opener(handler)  
#这个网址可以测试ip来源                                
req = request.Request("http://httpbin.org/ip")
resp = opener.open(req)
print(resp.read())                                

Commonly used agents are:

  • Free IP of Westthorn: http://www.xicidaili.com/
  • Express agent: http://ww.kuaidaili.com/
  • Agent Cloud: http://www.daliyun.com/

What is a cookie:

In a website, http requests are stateless. In other words, even after connecting to the server for the first time and logging in successfully, the second request to the server still cannot know which user the current request is. The appearance of cookies is to solve this problem. After the first login, the server returns some data (cookies) to the browser, and the browser is saved locally. When the user sends the second request, the last The cookie data requested to be stored is automatically carried to the server, and the server
can determine which user is currently using the data carried by the browser . The amount of data stored by cookies is limited. Different browsers have different storage sizes, but generally no more than 4KB. Therefore, the use of cookies can only store some small amounts of data.

Cookie format

Set-Cookie: NAME=VALUE: Expires/Max-age=DATE: Path=PATH: Domain=DOMAIN NAME: SECURE

Parameter meaning:

  • NAME: The name of the cookie.
  • VALUE: The value of the cookie.
  • Expires: the expiration time of the cookie.
  • Path: The path of the cookie.
  • Domain: The domain name of the cookie.
  • SECURE: Whether it only works under https protocol.

Use cookielib library and HTTPCookieProcessor to simulate login:

Cookie refers to the text file stored on the user's browser in order to identify the user's identity and session tracking. The cookie can keep the login information until the user's next session with the server.
Let's take Renren.com as an example. In Renren.com, to access a person's homepage, you must log in before you can access it. Logging in plainly means having cookie information. Then if we want to access by code, we must have the correct cookie information to access. There are two solutions, the first is to use a browser to access, and then copy the cookie information and put it in the headers.

from urllib import request
login_url = 'http://www...'
headers = {
    'User-Agent':'.......'
}
req = request.Request(url = login_url, headers = headers)
resp = request.urlopen(req)
with open('index.html', 'w', encode='utf-8') as fp:
    fp.write(resp.read().decode('utf-8'))

But copying cookies from the browser every time you visit a page that requires cookies is troublesome. Processing cookies in Python is generally done through the HTTPCookieProcessor processor class of the http.cookiejar module and urllib module. The main function of the http.cookiejar module is to provide objects for storing cookies. The main function of the HTTPCookieProcessor processor is to process these cookie objects and construct a handler object.

http.cookiejar模块

The main classes of this module are CookieJar, FileCookieJar, MoillaCookieJar. The functions of LWPCookidJaro are as follows:

  1. CookieJar: an object that manages HTTP cookie values, stores cookies generated by HTTP requests, and adds cookies to outgoing HTTP requests. The entire cookie is stored in memory, and the cookie will be lost after garbage collection of the CookieJar instance.
  2. FileCookieJar (filename, delayload = None, policy = None): derived from CookieJar, used to create FileCookieJar instances, retrieve cookie information and store cookies in files. flename is the file name where the cookie is stored. When delayload is True, support delayed access to the file, that is, only read the file or store data in the file when needed.
  3. MoillaCookieJar (filename, delayload = None, policy = None): Derived from FileCookieJar, create a FileCookieJar instance compatible with Moilla browser cookies.txt.
  4. LWPCookieJar (flename, delayload = None, policy = None): Derived from FileCookieJar to create a FileCookieJa instance compatible with the libwww-per standard Set-Cookie3 file format.
    Use http.cookiejar and request .HTTPCookieProcessor to log in to Renren. The relevant sample code is as follows:

Insert picture description here
Save cookies locally

You can use the save method of cookiejar, you need to specify the file name

from urllib import request
from http.cookiejar import MozillaCookieJar
cookiejar = MozillaCookieJar('cookie.txt')
cookiejar.load(ignore_discard=True)				#过期的cookie也进行存储
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)

resp = opener.open('http://httpbin.org/cookies')
for cookie in cookiejar:
    print(cookie)
Published 8 original articles · won 3 · views 186

Guess you like

Origin blog.csdn.net/qq_42641075/article/details/105456236