Use urllib library of Python Reptile

urllib library

urllibThe library is Pythona basic network request library. Can simulate the behavior of the browser, it sends a request to the specified server, and can save the data returned by the server.

urlopen function:

In Python3the urlliblibrary, all network requests and related methods, are set to urllib.requestbelow the modules, and to first look at the urlopenuse of basic functions:

from urllib import request
resp = request.urlopen('http://www.baidu.com')
print(resp.read())

In fact, using a browser to access Baidu, Right View Source. You will find us just print out the data is exactly the same. In other words, the above three lines of code has helped us to Baidu home page of all the code climb down. A basic url python code corresponding to the request is really quite simple.
The following to urlopenexplain in detail the function:

  1. url: Url requested.
  2. data: Request data, if this value is set, it will become postrequest.
  3. Return Value: The return value is an http.client.HTTPResponseobject that is a class file handle objects. There are read(size), readline, readlinesand getcodeother methods.

urlretrieve function:

This function can easily be saved to a local file on a web page. The following code can very easily be downloaded to the local Baidu's home page:

from urllib import request
request.urlretrieve('http://www.baidu.com/','baidu.html')

urlencode function:

When using the browser sends a request, if the url contains Chinese or other special characters, the browser will be automatically encoded to us. If using the code transmission request, then it must be encoded manually, it should be used when urlencodethe function is achieved. urlencodeDictionary data can be converted to URLencoded data. Sample code is as follows:

from urllib import parse
data = {'name':'爬虫基础','greet':'hello world','age':100} qs = parse.urlencode(data) print(qs) 

parse_qs function:

Url can be decoded after encoded parameters. Sample code is as follows:

from urllib import parse
qs = "name=%E7%88%AC%E8%99%AB%E5%9F%BA%E7%A1%80&greet=hello+world&age=100"
print(parse.parse_qs(qs))

urlparse sum urlsplit:

Sometimes to get a url, I wanted for the various components of this url in the split, so this time we can use urlparseor urlsplitto split. Sample code is as follows:

from urllib import request,parse

url = 'http://www.baidu.com/s?username=zhiliao'

result = parse.urlsplit(url)
# result = parse.urlparse(url)

print('scheme:',result.scheme) print('netloc:',result.netloc) print('path:',result.path) print('query:',result.query) 

urlparseAnd urlsplitsubstantially identical. The only place is not the same, urlparsewhich one more paramsattribute, but urlsplitdo not have this paramsproperty. For example, there is a urlis: url = 'http://www.baidu.com/s;hello?wd=python&username=abc#1',
you urlparsecan get to hello, but urlsplitcan not get to. urlIt is paramsalso with less than.

request.Request类:

If you want to add some request headers when requested, then you must use request.Requestclasses to implement. For example, to increase a User-Agent, the following sample code:

from urllib import request

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}
req = request.Request("http://www.baidu.com/",headers=headers) resp = request.urlopen(req) print(resp.read()) 

Connotation piece reptile combat operations:

  1. url link: http://neihanshequ.com/bar/1/
  2. Requirements: an access to data can climb on it.

ProxyHandler processor (setting agent)

Many sites will detect a certain period of time a number of visits over IP (through traffic statistics, system logs, etc.), if the number of visits and more like normal people, it will ban the IP access.
So we can set some proxy server, a proxy change from time to time, even if IP is disabled, you can still change the IP continue crawling.
urllib ProxyHandler set by using a proxy server, the following code shows how to use a custom opener Agent:

from urllib import request

# 这个是没有使用代理的
# resp = request.urlopen('http://httpbin.org/get')
# print(resp.read().decode("utf-8")) # 这个是使用了代理的 handler = request.ProxyHandler({"http":"218.66.161.88:31769"}) opener = request.build_opener(handler) req = request.Request("http://httpbin.org/ip") resp = opener.open(req) print(resp.read()) 

Commonly used agents are:

What is a cookie:

In the Web site, http request is stateless. In other words, even after the first and after the server connection and login is successful, a second request to the server still does not know the current request which user. cookieThe appearance is to solve this problem, for the first time after the login server returns some data (cookie) to the browser, and then stored in the local browser when the user sends a second request, it will automatically put the last request the stored cookiedata is automatically carried to the server, the data carried by the browser will be able to determine which of the current user. cookieA limited amount of data storage, different browsers have different storage sizes, but generally not more than 4KB. Therefore, using cookieonly a small amount of data to store some.

cookie format:

Set-Cookie: NAME=VALUE;Expires/Max-age=DATE;Path=PATH;Domain=DOMAIN_NAME;SECURE

Parameters meaning:

  • NAME: cookie name.
  • VALUE: the value of the cookie.
  • Expires: the cookie expiration time.
  • Path: Path cookie action.
  • Domain: domain cookie action.
  • SECURE: Are only work under the https protocol.

Use cookielib libraries and simulation HTTPCookieProcessor login:

Cookie refers to the server to identify the user's identity and conduct Session tracking, and text files stored on the user's browser, Cookie can keep the conversation next time the user login information to the server.
Here an example to all networks. All networks, to access someone's home, you must be logged in to visit, it means to have a login cookie information. So if we want to access with the way the code, you must have the right to access cookie information. There are two solutions, the first is accessed using a browser, and copy the cookie information down, put headers in. Sample code is as follows:

from urllib import request

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
    'Cookie': 'anonymid=jacdwz2x-8bjldx; depovince=GW; _r01_=1; _ga=GA1.2.1455063316.1511436360; _gid=GA1.2.862627163.1511436360; wp=1; JSESSIONID=abczwY8ecd4xz8RJcyP-v; jebecookies=d4497791-9d41-4269-9e2b-3858d4989785|||||; ick_login=884e75d4-f361-4cff-94bb-81fe6c42b220; _de=EA5778F44555C091303554EBBEB4676C696BF75400CE19CC; p=61a3c7d0d4b2d1e991095353f83fa2141; first_login_flag=1; [email protected]; ln_hurl=http://hdn.xnimg.cn/photos/hdn121/20170428/1700/main_nhiB_aebd0000854a1986.jpg; t=3dd84a3117737e819dd2c32f1cdb91d01; societyguester=3dd84a3117737e819dd2c32f1cdb91d01; id=443362311; xnsid=169efdc0; loginfrom=syshome; ch_id=10016; jebe_key=9c062f5a-4335-4a91-bf7a-970f8b86a64e%7Ca022c303305d1b2ab6b5089643e4b5de%7C1511449232839%7C1; wp_fold=0' } url = 'http://www.renren.com/880151247/profile' req = request.Request(url,headers=headers) resp = request.urlopen(req) with open('renren.html','w') as fp: fp.write(resp.read().decode('utf-8')) 

But every time the browser cookie is too much trouble to copy the page access requires cookie. In Python process Cookie, typically by http.cookiejarthe module and urllib模块的HTTPCookieProcessorused with the processor type. http.cookiejarThe main role is to provide an object module for storing a cookie. The HTTPCookieProcessormain role is to deal with these cookie processor objects, and build the handler object.

http.cookiejar modules:

The main classes of the modules have CookieJar, FileCookieJar, MozillaCookieJar, LWPCookieJar. Role of these four classes are as follows:

  1. CookieJar: Management HTTP cookie value, store HTTP request generated by the cookie, the outgoing HTTP request object to add a cookie. The entire cookie is stored in memory, for instance cookie CookieJar after garbage collection will also be lost.
  2. FileCookieJar (filename, delayload = None, policy = None): derived from CookieJar for creating FileCookieJar instance, and retrieve cookie information stored in the cookie file. filename is stored in the cookie file name. delayload supports file access latency access to True, that is, only when needed to read files or data stored in a file.
  3. MozillaCookieJar (filename, delayload = None, policy = None): derived from FileCookieJar, create FileCookieJar instance cookies.txt compatible with the Mozilla browser.
  4. LWPCookieJar (filename, delayload = None, policy = None): Compatible derived from FileCookieJar, with libwww-perl create standard Set-Cookie3 file format FileCookieJar instance.

Log all network:

Use http.cookiejarand request.HTTPCookieProcessorlog on all networks. Related sample code is as follows:

from urllib import request,parse
from http.cookiejar import CookieJar

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } def get_opener(): cookiejar = CookieJar() handler = request.HTTPCookieProcessor(cookiejar) opener = request.build_opener(handler) return opener def login_renren(opener): data = {"email": "[email protected]", "password": "pythonspider"} data = parse.urlencode(data).encode('utf-8') login_url = "http://www.renren.com/PLogin.do" req = request.Request(login_url, headers=headers, data=data) opener.open(req) def visit_profile(opener): url = 'http://www.renren.com/880151247/profile' req = request.Request(url,headers=headers) resp = opener.open(req) with open('renren.html','w') as fp: fp.write(resp.read().decode("utf-8")) if __name__ == '__main__': opener = get_opener() login_renren(opener) visit_profile(opener) 

Save cookie to a local:

Save cookieto a local, you can use cookiejarthe savemethod, and the need to specify a file name:

from urllib import request
from http.cookiejar import MozillaCookieJar

cookiejar = MozillaCookieJar("cookie.txt") handler = request.HTTPCookieProcessor(cookiejar) opener = request.build_opener(handler) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } req = request.Request('http://httpbin.org/cookies',headers=headers) resp = opener.open(req) print(resp.read()) cookiejar.save(ignore_discard=True,ignore_expires=True) 

Loaded from a local cookie:

, From a local cookie, you need to use cookiejarthe loadmethod, and also need to specify the method:

from urllib import request
from http.cookiejar import MozillaCookieJar

cookiejar = MozillaCookieJar("cookie.txt") cookiejar.load(ignore_expires=True,ignore_discard=True) handler = request.HTTPCookieProcessor(cookiejar) opener = request.build_opener(handler) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } req = request.Request('http://httpbin.org/cookies',headers=headers) resp = opener.open(req) print(resp.read())

Guess you like

Origin www.cnblogs.com/csnd/p/11469412.html