making use of urllib python3 crawlers

urllib

urllib is to the URL processing package python3

among them:

1, urllib.request mainly open and read urls

In peacetime use of personal 1:

Open the corresponding URL: urllib.request.open (url)

With urllib.request.build_opener ([handler, ...]), to the corresponding browser disguised

import urllib
#要伪装成的浏览器(我这个是用的chrome)
'''
遇到python不懂的问题,可以加Python学习交流群:1004391443一起学习交流,群文件还有零基础入门的学习资料
'''
headers = ('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36')
url='http://hotels.ctrip.com/'
opener = urllib.request.build_opener()
#将要伪装成的浏览器添加到对应的http头部
opener.addheaders=[headers]
#读取相应的url
data = opener.open(url).read()
#将获得的html解码为utf-8
data=data.decode('utf-8')
print(data)

2, urllib.parse is mainly used to resolve url

Main methods:

urllib.parse.urlparse(urlstring)

Function: the URL corresponding parsed into six parts, and back to the data format tuple. (Functionally and urlsplit () almost exactly the same)

import urllib
o = urllib.parse.urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
print(o)
print(o.path)
print(o.scheme)
print(o.port)
print(o.geturl())

Corresponding results:

ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
/%7Eguido/Python.html
http
80
http://www.cwi.nl:80/%7Eguido/Python.html

2, build a new url - urllib.parse.urljoin (base, url)

Parameters: base: basic URL link

   url: url Another

from urllib.parse import urljoin
a=urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
print(a)

The results: http://www.cwi.nl/%7Eguido/FAQ.html

This function reptile when it should be much more convenient, as I use the stupid way directly string concatenation

3, exception handling urllib.error

With try-except to catch an exception

The main wrong way on two kinds URLError and HTTPError

Because HTTPError is a subclass of URLError, so URLError should be written on the back HttpError, it means to find his son will know his father, to find his father, his son may not know.

try:
  data=urllib.request.urlopen(url)
  print(data.read().decode('utf-8'))
except urllib.error.HTTPError as e:
  print(e.code)
except urllib.error.URLError as e:
  print(e.reason)

Results: [WinError 10060] Since the connection is unable to correctly respond after a period of time or the connected host does not respond, the connection attempt fails.

If the capture to the HTTPError, the output code, will not be processed URLError exception. If there is not HTTPError, it will go to capture URLError abnormal cause of the error output

Guess you like

Origin blog.csdn.net/qq_40925239/article/details/92630254