Crawler-Python Crawler Common Library

1. Common library

1. requests are used when making requests.

requests.get("url")

2. Selenium automation will be used.

3、lxml

4、beautifulsoup

5. The pyquery web page parsing library is said to be easier to use than beautiful, and its syntax is very similar to jquery.

6. pymysql repository. Manipulate mysql data.

7. pymongo operates the MongoDB database.

8. Redis is a non-relational database.

9, jupyter online notepad.

2. What is Urllib

Python's built-in Http request library

urllib.request request module simulates a browser

urllib.error exception handling module

urllib.parse url parsing module Tool module, such as: split, merge

urllib.robotparser robots.txt parsing module  

 

The difference between 2 and 3

Python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com');

 

Python3

import urllib.request

response =urllib.request.urlopen('http://www.baidu.com');

usage:

urlOpen sends a request to the server.

urllib.request.urlopen(url,data=None[参数],[timeout,]*,cafile=None,capath=None,cadefault=false,context=None)

 example:

Example 1:

import urllib.requests

response=urllib.reqeust.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))

 

  Example 2:

  import urllib.request

  import urllib.parse

  data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

  response=urllib.reqeust.urlopen('http://httpbin.org/post',data=data)

  print(response.read())

  Note: Adding data is sent by post, if not, it is sent by get.

 

  Example 3:

  Timeout test

  import urllib.request

  response =urllib.request.urlopen('http://httpbin.org/get',timeout=1)

  print(response.read())

  -----normal

  import socket

  import urllib.reqeust

  import urllib.error

  try:

    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

  except urllib.error.URLError as e:

    if isinstance(e.reason,socket.timeout):

      print('TIME OUT')

  This is the output TIME OUT

 

 response

 response type

import urllib.request

response=urllib.request.urlopen('https://www.python.org')

print(type(response))

 Output: print(type(response))

 

     

   Status code, response header

   import urllib.request

   response = urllib.request.urlopen('http://www.python.org')

   print(response.status) // correctly returns 200 

   print(response.getheaders()) //return request headers

     print(response.getheader('Server'))  

 

3. Request can add headers

   import urllib.request

  request=urllib.request.Request('https://python.org')

  response=urllib.request.urlopen(request)

  print(response.read().decode('utf-8'))

 

 

  example:

   from urllib import request,parse

  url='http://httpbin.org/post'

  headers={

    User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36
    Host:httpbin.org

  }

  dict={

    'name':'Germey'

  }

 

  data=bytes(parse.urlencode(dict),encoding='utf8')

  req= request.Request(url=url,data=data,headers=headers,method='POST')

  response = request.urlopen(req)

  print(response.read().decode('utf-8'))

 

 

Fourth, the agent

   import urllib.request

  proxy_handler =urllib.request.ProxyHandler({

    'http':'http://127.0.0.1:9743',

    'https':'http://127.0.0.1:9743',

  })

  opener =urllib.request.build_opener(proxy_handler)

   response= opener.open('http://httpbin.org/get')

  print(response.read())

 

 

5. Cookies

   import http.cookiejar,urllib.request

  cookie = http.cookiejar.Cookiejar()

  handler=urllib.request.HTTPCookieProcessor(cookie)

  opener = urllib.request.build_opener(handler)

  response = opener.open('http://www.baidu.com')

  for item in cookie:

    print(item.name+"="+item.value)

 

  The first way to save cookies

  import http.cookiejar,urllib.request

  filename = 'cookie.txt'  

  cookie =http.cookiejar.MozillaCookieJar(filename)

  handler= urllib.request.HTTPCookieProcessor(cookie)

  opener=urllib.request.build_opener(handler)

  response= opener.open('http://www.baidu.com')

  cookie.save(ignore_discard=True,ignore_expires=True)

 

  The second way to save cookies

  import http.cookiejar,urllib.request

  filename = 'cookie.txt'

  cookie =http.cookiejar.LWPCookieJar(filename)

  handler=urllib.request.HTTPCookieProcessor(cookie)

  opener=urllib.request.build_opener(handler)

  response=opener.open('http://www.baidu.com')

  cookie.save(ignore_discard=True,ignore_expires=True)

  read cookies

  import http.cookiejar,urllib.request

  cookie=http.cookiejar.LWPCookieJar()

  cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

  handler=urllib.request.HTTPCookieProcessor(cookie)

  opener=urllib.request.build_opener(handler)

  response=opener.open('http://www.baidu.com')

  print(response.read().decode('utf-8'))

 

 

 6. Exception handling

  Example 1:

   from urllib import reqeust,error

   try:

    response =request.urlopen('http://cuiqingcai.com/index.htm') 

  except error.URLError as e:

    print(e.reason) //url exception capture

 

  Example 2:

  from urllib import reqeust,error

   try:

    response =request.urlopen('http://cuiqingcai.com/index.htm') 

  except error.HTTPError as e:

    print(e.reason,e.code,e.headers,sep='\n') //url exception capture

  except error.URLError as e:

    print(e.reason)  

  else:

    print('Request Successfully')

 

 

7. URL parsing

   urlparse //url split

  urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

  

  example:

  from urllib.parse import urlparse //url split

  result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

  print(type(result),result)

   result:

  

 

   Example 2:

  from urllib.parse import urlparse   //没有http

  result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')

     print(result)

  

 

   Example 3:

  from urllib.parse import urlparse

  result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

   print(result)

   

 

   Example 4:

  from urllib.parse import urlparse

  result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

   print(result)

   

 

   Example 5:

  from urllib.parse import urlparse

  result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

   print(result)

   

 

 

 7. Splicing  

  urlunparse

   example:

  from urllib.parse import urlunparse

  data=['http','www.baidu.com','index.html','user','a=6','comment']

  print(urlunparse(data))

   

 

   urljoin

   from urllib.parse import urljoin

  print(urljoin('http://www.baidu.com','FAQ.html'))

  

  back cover front

 

  urlencode

  from urllib.parse import urlencode

  params={

    'name':'gemey',

    'age':22

  }

  base_url='http//www.baidu.com?'

  url = base_url+urlencode(params)

  print(url)

  http://www.baidu.com?name=gemey&age=22

 

 

example:

urllibIt is the standard library that comes with Python. It can be used directly without installation.
The following functions are provided:

  • web page request
  • get response
  • Proxy and cookie settings
  • exception handling
  • URL parsing

The functions required by crawlers can basically urllibbe found in . Learning this standard library can provide a deeper understanding of the more convenient requestslibraries that follow.

urllib library

urlopen syntax

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None) #url:访问的网址 #data:额外的数据,如header,form data 

usage

# request:GET
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

# request: POST # http测试:http://httpbin.org/ import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post',data=data) print(response.read()) # 超时设置 import urllib.request response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) print(response.read()) import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT') 

response

# 响应类型
import urllib.open
response = urllib.request.urlopen('https:///www.python.org')
print(type(response))
# 状态码, 响应头
import urllib.request response = urllib.request.urlopen('https://www.python.org') print(response.status) print(response.getheaders()) print(response.getheader('Server')) 

Request

Declare a request object, which can include header and other information, and then urlopenopen it with.

# 简单例子
import urllib.request
request = urllib.request.Requests('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

# 增加header from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' 'Host':'httpbin.org' } # 构造POST表格 dict = { 'name':'Germey' } data = bytes(parse.urlencode(dict),encoding='utf8') req = request.Request(url=url,data=data,headers=headers,method='POST') response = request.urlopen(req) print(response.read()).decode('utf-8') # 或者随后增加header from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name':'Germey' } req = request.Request(url=url,data=data,method='POST') req.add_hader('User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8')) 

Handler: handles more complex pages

Official description
agent

import urllib.request
proxy_handler = urllib.request.ProxyHandler({
    'http':'http://127.0.0.1:9743'
    'https':'https://127.0.0.1.9743' }) opener = urllib.request.build_openner(proxy_handler) response = opener.open('http://www.baidu.com') print(response.read()) 

Cookie : used by the client to record user identity and maintain login information

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value) # 保存cooki为文本 import http.cookiejar, urllib.request filename = "cookie.txt" # 保存类型有很多种 ## 类型1 cookie = http.cookiejar.MozillaCookieJar(filename) ## 类型2 cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") # 使用相应的方法读取 import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") 

exception handling

Catch exceptions to ensure stable operation of the program

# 访问不存在的页面
from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason) # 先捕获子类错误 from urllib imort request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print("Request Successfully') # 判断原因 import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT') 

URL parsing

Mainly a tool module that can be used to provide URLs to crawlers.

urlparse : split URLs

urlib.parse.urlparse(urlstring,scheme='', allow_fragments=True)
# scheme: 协议类型
# 是否忽略’#‘部分

for example

from urllib import urlparse
result = urlparse("https://edu.hellobi.com/course/157/play/lesson/2580")
result
##ParseResult(scheme='https', netloc='edu.hellobi.com', path='/course/157/play/lesson/2580', params='', query='', fragment='')

urlunparse : concatenate URLs, for urlparsethe reverse operation

from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment'] print(urlunparse(data)) 

urljoin : concatenates two URLs

 
urljoin

urlencode : Convert dictionary object to GET request object

from urllib.parse import urlencode
params = {
    'name':'germey',
    'age': 22 } base_url = 'http://www.baidu.com?' url = base_url + urlencode(params) print(url) 

Finally, there is a robotparse, which parses the parts of the website that are allowed to be crawled.



Author: hoptop
Link: https://www.jianshu.com/p/cfbdacbeac6e
Source: Jianshu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325222245&siteId=291194637