Python crawler essential technical points (a)

Reptile necessary technology

Python for reptile lovers have Python-based, urllib and requests two libraries in the actual project is also widely used, not only for crawler technology can also be applied in terms of API interface calls. If the case relevant to the needs can add my QQ (610039018), and I have posted videos on reptiles related to B station (https://www.bilibili.com/video/av93731419).

A, urllib summary

The core network requests Library -> urllib

  • urllib.request module
    • urlopen(url | request: Request, data=None) data是bytes类型

    • urlretrieve (url, filename) download url of the resources to the specified file

    • build_opener (* handlder) browser object constructor

      • opener.open (url | request, data = None) initiation request
    • Request class configuration request

      url='https://www.baidu.com/s'
      data={
         'wd': '千锋'
      }
      urlencode(data)  # 结果转成'wd=%f5%e6%e6%f5%e6%e6'
      request = Request(url, data=urlencode(data).encode()) 
      
    • HTTPHandler HTTP protocol request processor

    • ProxyHandler (proxies = { 'http': 'http: // proxy_ip: port'}) Processing Agent

    • HTTPCookieProcessor(CookieJar())

      • http.cookiejar.CookieJar 类
  • urllib.parse module
    • quote (txt) will turn into a Chinese character string encoded url
    • urlencode (query: dict) turn into the url parameter dictionary coding, the result is a key = value & key = value form, i.e. application / x-www-form-urlencoded as url encoding type.

Two, requests the library [focus]

requests library is a library network request, the request network-based library and urllib3 urllib convenient use of the package.

2.1 Installation Environment

pip install requests -i https://mirrors.aliyun.com/pypi/simple

2.2 The core functions

  • requests.request () method of requesting all the basic method of
    the following parameters is described request () method
    • method: str specify the request method, GET, POST, PUT, DELETE
    • url: str requested resource interface (API), that is, in the specification RESTful URI (Uniform Resource Identifier Label)
    • params: dict, for query parameters GET request (Query String params);
    • data: dict, a form parameter POST / PUT / DELETE request (Form Data)
    • json: dict json for uploading data parameters to the package body (request form) in. Content-Type request header is set to the default application / json
    • files: dict, structure { 'name': file-like-object | tuple}, if a tuple, there are three cases:
      • (‘filename’, file-like-object)
      • (‘filename’, file-like-object, content_type)
      • ( 'filename', file-like -object, content_type, custom-headers)
        to specify files for uploading files, generally used post request, the default request header is Content-Type multipart / form-data type.
    • headers/cookies : dict
    • proxies: dict, set the proxy
    • auth: tuple, user name and password for authorization, form ( 'username', 'pwd')
  • requests.get () to initiate a GET request, query data
    available parameters:
    • url
    • params
    • json
    • headers/cookies/auth
  • requests.post () initiates POST request, upload / add data
    available parameters:
    • url
    • data/files
    • json
    • headers/cookies/auth
  • requests.put () PUT request to initiate, modify or update the data
  • Sexual problems requests.patch () HTTP idempotent may occur repeat treatment is not recommended. For updating data
  • requests.delete () initiated DELETE request to delete data

2.3 requests.Respose

The method of the above type of the object request is returned Response, the object common attributes as follows:

  • status_code response status codes
  • url request url
  • headers: head dict response with respect to the response object urllib getHeaders (), but does not contain cookie.
  • cookies: objects can be iterative, the element is a Cookie class object (name, value, path)
  • text: text message response
  • content: Response byte data
  • encoding: response coded character set data, such as utf-8, gbk, gb2312
  • json (): If the response data type is application / json, then the response data is performed in reverse order into the list or python dict object.
    • Spreading -javascript serialization and deserialization
      • JSON.stringify (obj) serialized
      • JSON.parse (text) deserialization

Three, xpath way of data analysis

xpath is a form xml / html parsing data, based on the tree structure elements (Element) a (Node> Element). When an element is selected, based on path selection elements, such as / html / head / title Get标签。

3.1 absolute path

From the root tag, beginning with the tree structure and then click Query down.
As /html/body/table/tbody/tr.

3.2 relative path

Find the path that is opposite from the current one element, worded as follows

  • With respect to the entire document
    //img
    to find out all of the document <img>label elements
  • Relative to the current node
    //table
  • If the current node is to find its wording path
    .//img

3.3 Data Extraction

  • Extract text
    //title/text()
  • Extract attribute
    //img/@href

3.4 Location conditions

  • Web page acquiring data types and character sets, get the first <meta>label
    //meta[1]//@content
  • Gets the last <meta>label
    //meta[last()]//@content
  • Gets the penultimate <meta>label
    //meta[position()-2]//@content
  • Gets the first three <meta>labels
    //meta[position()<3]//@content

3.5 Properties conditions

  • Find a class to circle-imgthe <img>label
    //img[@class="circle-img"]

3.6 Application in Python

To install the first package pip install lxml, the following code snippet shows part of:

from lxml import etree

root = etree.HTML('网页的内容')
element = root.xpath('xpath查找路径')
# element可能是两种类型,一是List=>[<Element>, <Element>...]
# 另一个是Element对象,下面是Element对象的方法或属性
value = element.get('属性名')
text = element.text # 获取标签中文本,如<span>千锋教育</span>中的文本信息
children = element.xpath('相对xpath路径') # 获取它的子元素标签

Fourth, test a test

  • Write urllib library request processor which class (try to write full path)
  • Json.loads write (), and The pickle.loads () returns data type
  • Cursor.execute pymysql the write () method and action parameters

Put your answers can be written commentary London area, I will be unified in a subsequent reply.

Released four original articles · won praise 6 · views 1723

Guess you like

Origin blog.csdn.net/ahhqdyh/article/details/104788387