1. Common library
1. requests are used when making requests.
requests.get("url")
2. Selenium automation will be used.
3、lxml
4、beautifulsoup
5. The pyquery web page parsing library is said to be easier to use than beautiful, and its syntax is very similar to jquery.
6. pymysql repository. Manipulate mysql data.
7. pymongo operates the MongoDB database.
8. Redis is a non-relational database.
9, jupyter online notepad.
2. What is Urllib
Python's built-in Http request library
urllib.request request module simulates a browser
urllib.error exception handling module
urllib.parse url parsing module Tool module, such as: split, merge
urllib.robotparser robots.txt parsing module
The difference between 2 and 3
Python2
import urllib2
response = urllib2.urlopen('http://www.baidu.com');
Python3
import urllib.request
response =urllib.request.urlopen('http://www.baidu.com');
usage:
urlOpen sends a request to the server.
urllib.request.urlopen(url,data=None[参数],[timeout,]*,cafile=None,capath=None,cadefault=false,context=None)
example:
Example 1:
import urllib.requests
response=urllib.reqeust.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
Example 2:
import urllib.request
import urllib.parse
data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response=urllib.reqeust.urlopen('http://httpbin.org/post',data=data)
print(response.read())
Note: Adding data is sent by post, if not, it is sent by get.
Example 3:
Timeout test
import urllib.request
response =urllib.request.urlopen('http://httpbin.org/get',timeout=1)
print(response.read())
-----normal
import socket
import urllib.reqeust
import urllib.error
try:
response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
if isinstance(e.reason,socket.timeout):
print('TIME OUT')
This is the output TIME OUT
response
response type
import urllib.request
response=urllib.request.urlopen('https://www.python.org')
print(type(response))
Output: print(type(response))
Status code, response header
import urllib.request
response = urllib.request.urlopen('http://www.python.org')
print(response.status) // correctly returns 200
print(response.getheaders()) //return request headers
print(response.getheader('Server'))
3. Request can add headers
import urllib.request
request=urllib.request.Request('https://python.org')
response=urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
example:
from urllib import request,parse
url='http://httpbin.org/post'
headers={
}
dict={
'name':'Germey'
}
data=bytes(parse.urlencode(dict),encoding='utf8')
req= request.Request(url=url,data=data,headers=headers,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
Fourth, the agent
import urllib.request
proxy_handler =urllib.request.ProxyHandler({
'http':'http://127.0.0.1:9743',
'https':'http://127.0.0.1:9743',
})
opener =urllib.request.build_opener(proxy_handler)
response= opener.open('http://httpbin.org/get')
print(response.read())
5. Cookies
import http.cookiejar,urllib.request
cookie = http.cookiejar.Cookiejar()
handler=urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
print(item.name+"="+item.value)
The first way to save cookies
import http.cookiejar,urllib.request
filename = 'cookie.txt'
cookie =http.cookiejar.MozillaCookieJar(filename)
handler= urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response= opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
The second way to save cookies
import http.cookiejar,urllib.request
filename = 'cookie.txt'
cookie =http.cookiejar.LWPCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
read cookies
import http.cookiejar,urllib.request
cookie=http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
6. Exception handling
Example 1:
from urllib import reqeust,error
try:
response =request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
print(e.reason) //url exception capture
Example 2:
from urllib import reqeust,error
try:
response =request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
print(e.reason,e.code,e.headers,sep='\n') //url exception capture
except error.URLError as e:
print(e.reason)
else:
print('Request Successfully')
7. URL parsing
urlparse //url split
urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)
example:
from urllib.parse import urlparse //url split
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result)
result:
Example 2:
from urllib.parse import urlparse //没有http
result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)
Example 3:
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)
Example 4:
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)
print(result)
Example 5:
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)
print(result)
7. Splicing
urlunparse
example:
from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))
urljoin
from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','FAQ.html'))
back cover front
urlencode
from urllib.parse import urlencode
params={
'name':'gemey',
'age':22
}
base_url='http//www.baidu.com?'
url = base_url+urlencode(params)
print(url)
http://www.baidu.com?name=gemey&age=22
example:
urllib
It is the standard library that comes with Python. It can be used directly without installation.
The following functions are provided:
- web page request
- get response
- Proxy and cookie settings
- exception handling
- URL parsing
The functions required by crawlers can basically urllib
be found in . Learning this standard library can provide a deeper understanding of the more convenient requests
libraries that follow.
urllib library
urlopen syntax
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None) #url:访问的网址 #data:额外的数据,如header,form data
usage
# request:GET
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
# request: POST # http测试:http://httpbin.org/ import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post',data=data) print(response.read()) # 超时设置 import urllib.request response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) print(response.read()) import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT')
response
# 响应类型
import urllib.open
response = urllib.request.urlopen('https:///www.python.org')
print(type(response))
# 状态码, 响应头
import urllib.request response = urllib.request.urlopen('https://www.python.org') print(response.status) print(response.getheaders()) print(response.getheader('Server'))
Request
Declare a request object, which can include header and other information, and then urlopen
open it with.
# 简单例子
import urllib.request
request = urllib.request.Requests('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
# 增加header from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' 'Host':'httpbin.org' } # 构造POST表格 dict = { 'name':'Germey' } data = bytes(parse.urlencode(dict),encoding='utf8') req = request.Request(url=url,data=data,headers=headers,method='POST') response = request.urlopen(req) print(response.read()).decode('utf-8') # 或者随后增加header from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name':'Germey' } req = request.Request(url=url,data=data,method='POST') req.add_hader('User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8'))
Handler: handles more complex pages
Official description
agent
import urllib.request
proxy_handler = urllib.request.ProxyHandler({
'http':'http://127.0.0.1:9743'
'https':'https://127.0.0.1.9743' }) opener = urllib.request.build_openner(proxy_handler) response = opener.open('http://www.baidu.com') print(response.read())
Cookie : used by the client to record user identity and maintain login information
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie:
print(item.name+"="+item.value) # 保存cooki为文本 import http.cookiejar, urllib.request filename = "cookie.txt" # 保存类型有很多种 ## 类型1 cookie = http.cookiejar.MozillaCookieJar(filename) ## 类型2 cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") # 使用相应的方法读取 import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com")
exception handling
Catch exceptions to ensure stable operation of the program
# 访问不存在的页面
from urllib import request, error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
print(e.reason) # 先捕获子类错误 from urllib imort request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print("Request Successfully') # 判断原因 import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT')
URL parsing
Mainly a tool module that can be used to provide URLs to crawlers.
urlparse : split URLs
urlib.parse.urlparse(urlstring,scheme='', allow_fragments=True)
# scheme: 协议类型
# 是否忽略’#‘部分
for example
from urllib import urlparse
result = urlparse("https://edu.hellobi.com/course/157/play/lesson/2580")
result
##ParseResult(scheme='https', netloc='edu.hellobi.com', path='/course/157/play/lesson/2580', params='', query='', fragment='')
urlunparse : concatenate URLs, for urlparse
the reverse operation
from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment'] print(urlunparse(data))
urljoin : concatenates two URLs
urlencode : Convert dictionary object to GET request object
from urllib.parse import urlencode
params = {
'name':'germey',
'age': 22 } base_url = 'http://www.baidu.com?' url = base_url + urlencode(params) print(url)
Finally, there is a robotparse, which parses the parts of the website that are allowed to be crawled.
Author: hoptop
Link: https://www.jianshu.com/p/cfbdacbeac6e
Source: Jianshu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.