Crawler-Python Crawler Common Library

1. Common library

1. requests are used when making requests.

requests.get("url")

2. Selenium automation will be used.

3、lxml

4、beautifulsoup

5. The pyquery web page parsing library is said to be easier to use than beautiful, and its syntax is very similar to jquery.

6. pymysql repository. Manipulate mysql data.

7. pymongo operates the MongoDB database.

8. Redis is a non-relational database.

9, jupyter online notepad.

2. What is Urllib

Python's built-in Http request library

urllib.request request module simulates a browser

urllib.error exception handling module

urllib.parse url parsing module Tool module, such as: split, merge

urllib.robotparser robots.txt parsing module　　

The difference between 2 and 3

Python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com');

Python3

import urllib.request

response =urllib.request.urlopen('http://www.baidu.com');

usage:

urlOpen sends a request to the server.

urllib.request.urlopen(url,data=None[参数],[timeout,]*,cafile=None,capath=None,cadefault=false,context=None)

example:

Example 1:

import urllib.requests

response=urllib.reqeust.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))

　　Example 2:

　　import urllib.request

　　import urllib.parse

　　data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

　　response=urllib.reqeust.urlopen('http://httpbin.org/post',data=data)

　　print(response.read())

　　Note: Adding data is sent by post, if not, it is sent by get.

　　Example 3:

　　Timeout test

　　import urllib.request

　　response =urllib.request.urlopen('http://httpbin.org/get',timeout=1)

　　print(response.read())

　　-----normal

　　import socket

　　import urllib.reqeust

　　import urllib.error

　　try:

　　　　response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

　　except urllib.error.URLError as e:

　　　　if isinstance(e.reason,socket.timeout):

　　　　　　print('TIME OUT')

　　This is the output TIME OUT

response

response type

import urllib.request

response=urllib.request.urlopen('https://www.python.org')

print(type(response))

Output: print(type(response))

　　　Status code, response header

　　　import urllib.request

　　　response = urllib.request.urlopen('http://www.python.org')

　　　print(response.status) // correctly returns 200

　　　print(response.getheaders()) //return request headers

　　 print(response.getheader('Server'))　　

3. Request can add headers

　　import urllib.request

　　request=urllib.request.Request('https://python.org')

　　response=urllib.request.urlopen(request)

　　print(response.read().decode('utf-8'))

　　example:

　　from urllib import request,parse

　　url='http://httpbin.org/post'

　　headers={

　　　　User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36

　　　　Host:httpbin.org

　　}

　　dict={

　　　　'name':'Germey'

　　}

　　data=bytes(parse.urlencode(dict),encoding='utf8')

　　req= request.Request(url=url,data=data,headers=headers,method='POST')

　　response = request.urlopen(req)

　　print(response.read().decode('utf-8'))

Fourth, the agent

　　import urllib.request

　　proxy_handler =urllib.request.ProxyHandler({

　　　　'http':'http://127.0.0.1:9743',

　　　　'https':'http://127.0.0.1:9743',

　　})

　　opener =urllib.request.build_opener(proxy_handler)

　　response= opener.open('http://httpbin.org/get')

　　print(response.read())

5. Cookies

　　import http.cookiejar,urllib.request

　　cookie = http.cookiejar.Cookiejar()

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(handler)

　　response = opener.open('http://www.baidu.com')

　　for item in cookie:

　　　　print(item.name+"="+item.value)

　　The first way to save cookies

　　import http.cookiejar,urllib.request

　　filename = 'cookie.txt'　　

　　cookie =http.cookiejar.MozillaCookieJar(filename)

　　handler= urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response= opener.open('http://www.baidu.com')

　　cookie.save(ignore_discard=True,ignore_expires=True)

　　The second way to save cookies

　　import http.cookiejar,urllib.request

　　filename = 'cookie.txt'

　　cookie =http.cookiejar.LWPCookieJar(filename)

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response=opener.open('http://www.baidu.com')

　　cookie.save(ignore_discard=True,ignore_expires=True)

　　read cookies

　　import http.cookiejar,urllib.request

　　cookie=http.cookiejar.LWPCookieJar()

　　cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response=opener.open('http://www.baidu.com')

　　print(response.read().decode('utf-8'))

6. Exception handling

　　Example 1:

　　from urllib import reqeust,error

　　 try:

　　　　response =request.urlopen('http://cuiqingcai.com/index.htm')　

　　except error.URLError as e:

　　　　print(e.reason) //url exception capture

　　Example 2:

　　from urllib import reqeust,error

　　 try:

　　　　response =request.urlopen('http://cuiqingcai.com/index.htm')　

　　except error.HTTPError as e:

　　　　print(e.reason,e.code,e.headers,sep='\n') //url exception capture

　　except error.URLError as e:

　　　　print(e.reason)　　

　　else:

　　　　print('Request Successfully')

7. URL parsing

　　urlparse //url split

　　urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

　　example:

　　from urllib.parse import urlparse //url split

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

　　print(type(result),result)

　　result:

　　Example 2:

　　from urllib.parse import urlparse //没有http

　　result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')

　 print(result)

　　Example 3:

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

　　print(result)

　　Example 4:

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

　　print(result)

　　Example 5:

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

　　print(result)

7. Splicing

　　urlunparse

　　example:

　　from urllib.parse import urlunparse

　　data=['http','www.baidu.com','index.html','user','a=6','comment']

　　print(urlunparse(data))

　　urljoin

　　from urllib.parse import urljoin

　　print(urljoin('http://www.baidu.com','FAQ.html'))

　　back cover front

　　urlencode

　　from urllib.parse import urlencode

　　params={

　　　　'name':'gemey',

　　　　'age':22

　　}

　　base_url='http//www.baidu.com?'

　　url = base_url+urlencode(params)

　　print(url)

　　http://www.baidu.com?name=gemey&age=22

example:

urllibIt is the standard library that comes with Python. It can be used directly without installation.
The following functions are provided:

web page request
get response
Proxy and cookie settings
exception handling
URL parsing

The functions required by crawlers can basically urllibbe found in . Learning this standard library can provide a deeper understanding of the more convenient requestslibraries that follow.

urllib library

urlopen syntax

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None) #url:访问的网址 #data:额外的数据，如header，form data

usage

# request:GET
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

# request: POST # http测试：http://httpbin.org/ import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post',data=data) print(response.read()) # 超时设置 import urllib.request response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) print(response.read()) import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT')

response

# 响应类型
import urllib.open
response = urllib.request.urlopen('https:///www.python.org')
print(type(response))
# 状态码， 响应头
import urllib.request response = urllib.request.urlopen('https://www.python.org') print(response.status) print(response.getheaders()) print(response.getheader('Server'))

Request

Declare a request object, which can include header and other information, and then urlopenopen it with.

# 简单例子
import urllib.request
request = urllib.request.Requests('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

# 增加header from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' 'Host':'httpbin.org' } # 构造POST表格 dict = { 'name':'Germey' } data = bytes(parse.urlencode(dict),encoding='utf8') req = request.Request(url=url,data=data,headers=headers,method='POST') response = request.urlopen(req) print(response.read()).decode('utf-8') # 或者随后增加header from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name':'Germey' } req = request.Request(url=url,data=data,method='POST') req.add_hader('User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8'))

Handler: handles more complex pages

Official description
agent

import urllib.request
proxy_handler = urllib.request.ProxyHandler({
    'http':'http://127.0.0.1:9743'
    'https':'https://127.0.0.1.9743' }) opener = urllib.request.build_openner(proxy_handler) response = opener.open('http://www.baidu.com') print(response.read())

Cookie : used by the client to record user identity and maintain login information

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value) # 保存cooki为文本 import http.cookiejar, urllib.request filename = "cookie.txt" # 保存类型有很多种 ## 类型1 cookie = http.cookiejar.MozillaCookieJar(filename) ## 类型2 cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") # 使用相应的方法读取 import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com")

exception handling

Catch exceptions to ensure stable operation of the program

# 访问不存在的页面
from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason) # 先捕获子类错误 from urllib imort request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print("Request Successfully') # 判断原因 import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT')

URL parsing

Mainly a tool module that can be used to provide URLs to crawlers.

urlparse : split URLs

urlib.parse.urlparse(urlstring,scheme='', allow_fragments=True)
# scheme: 协议类型
# 是否忽略’#‘部分

for example

from urllib import urlparse
result = urlparse("https://edu.hellobi.com/course/157/play/lesson/2580")
result
##ParseResult(scheme='https', netloc='edu.hellobi.com', path='/course/157/play/lesson/2580', params='', query='', fragment='')

urlunparse : concatenate URLs, for urlparsethe reverse operation

from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment'] print(urlunparse(data))

urljoin : concatenates two URLs

urljoin

urlencode : Convert dictionary object to GET request object

from urllib.parse import urlencode
params = {
    'name':'germey',
    'age': 22 } base_url = 'http://www.baidu.com?' url = base_url + urlencode(params) print(url)

Finally, there is a robotparse, which parses the parts of the website that are allowed to be crawled.

Author: hoptop
Link: https://www.jianshu.com/p/cfbdacbeac6e
Source: Jianshu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.