Crawler requests module and data classification, and cookie and session

A, requests module (Key)

(A) requests module Introduction

    Urllib and requests module is to initiate the most common http request module.

    Installation: pip install reqeusts

get method (b) Requests module

        1、response = requests.get(

           url = url address request,

           request header headers = dictionary,

           Parameter params = request dictionary,

        )

        2, get request with the general parameters, the parameters will be packaged with parameters params dictionary to it.

           Case: Sina news two aliquots request parameters params way

        3, response objects

           (1) a response body:

               Response body a, string format: response.text

               b, byte type (bytes) response body: response.content

           (2) the response of the body to solve the garbage problem:

               a, encode and decode between a conversion method is binary and string.

               Programming the data carrier which

                  Storing variables or constants

                  You can see the contents of the data must be string format.

                  Normal computer data are essentially binary --bytes.

                str.encode ( 'coding') ---> bytes

                bytes.decode ( 'coding') ----> str

               Garbled: encoding inconsistencies caused.

               response.content.decode ( 'correctly coded page') --- correct string content page

              

               b, response.text reason string format may be obtained, in fact, in the bottom of the module with the requests

               response.encoding this property corresponding to the encoded binary string into the content. In fact response.encoding corresponding to a coding mistake. Only you need to specify the correct can be.

               This is the attribute value response.encoding requests identification module automatically.

               response.encoding = 'correct page coding'

               response.text --- "correct page content

               If response.text garbled, you can give response.encoding set the correct coding, you can get the correct page content by response.text.

              

               Built-in functions & built-in module: python own module

               to you()

               str()

           (3) Status Code: response.status_code

           (4) response header: response.headers

        4, paging how to achieve?

           Each page url request paging request basically get a request parameter decision, so in fact only need to find the law field by the paging request params dictionary get in on it through.

           Case: Baidu Post Bar crawling pagination information

(C) requests post request

        response = requests.post(

           url = url address request,

           request header headers = dictionary,

           data = data request dictionary,

        ) --- response object.

      

 

              

Third, the classification of data

    1 classification

    (1) Structure of data: data can be described in a relational database.

        Features: in units of data, information indicating a row of data entities, attributes of each row of data is the same.

        Example: a table stored in a relational database

        Approach: sql --- Structured Query Language --- Language --- can manipulate the data in a relational database.

    (2) semi-structured data: description data word has

        Features: the relevant tags, as well as semantic elements used to separate records and fields do not become ---- hierarchical structure of the self-describing

        举例: html, xml, json.

        Processing Method: Regular, xpath (xml, html)

    (3) unstructured data:

        Features: no fixed data structure.

        For example: documents, pictures, video, audio, etc., are stored in binary format through the whole saved.

            If you download video, audio.

        deal with:

           response = requests.get (url = 'video address')

           Response.content can save, after the file name to note.

    2, json (json (JavaScript Object Notation, JS object tag)) data

        [json is a data exchange format].

        json how to exchange.

        [Js json language in fact, form 'string' to represent one technique json objects and arrays. ] So json string on nature.

        js objects: var obj = {name: 'zhangsan', age: '10 '} ---- This can be as in python: dictionaries

        The array js: var arr = [ 'a', 'b', 'c', 'd'] ---- in python this can be used as: list.

    3, json data processing (focus)

        (1) using json module for processing.

        json_str: json data

        json.loads (json_str) ---> python's list or dictionary

        json.dumps (python's list or dictionary) ---> json_str

        (2) In the module requests, response json object has a method, the content can be obtained directly after a respective string parsing json

           response.json () ---> python's list or dictionary

Four, cookie and session

    1. What is a cookie and session?

        A cookie is used to identify the site user, session tracking, data stored on the local terminal.

        session (session) up to the present is meant that the beginnings and ends the series of operations and messages. In the web, session information is mainly used to store a particular server session user objects needed.

    2, the reason cookie and session produced:

        http protocol is a stateless protocol, when a particular operation, you need to save the information, and then generates a cookie and session.

    3, cookie principle:

        Is generated by the server, the browser first request, the server transmits to the client further saved.

        Continue to access the browser, the cookie information will be included with the cookie field in the request header, so the server can identify who is visited.

       

        But the cookie is defective:

           1, unsafe - saved locally, vulnerable to tampering.

           2, size is limited, itself the largest 4kb.

          

        Although the cookie to solve the 'hold' needs to a certain extent, but we hope to have a new cookie technology can overcome the defects, this technology is the session.

    4, session principle:

        session stored in the server. ---- resolve security issues.

        The question is: session on the server, but the client sends a request over how the server knows session_a, session_b, in the end, and that corresponding to the request.

        Therefore, in order to solve this problem: cookie on as the bridge. In sessionid cookie has a field corresponding to this request can be used to indicate which one of the server session.

        Disabling cookie, under normal circumstances, session can not be used. You can use url rewriting exceptional circumstances to use session.

           url rewriting: sessionid will be spliced ​​into the url inside.

          

        session life cycle: start server creates valid end (the site settings are usually about 30 minutes), it is deleted.

5 Common Mistakes: Open a web browser, the browser is closed, session this page will not fail?
        No, in the end delete the server does not delete session, the session life cycle. End of period, will be deleted.
 6, cookie field
        (1) Name: The name. Once created, the name will not be changed.
        (2) value: the value of the cookie. If the value is Unicode character, character encoding is required. If the value is binary data, it is necessary to use BASE64 encoding.
        (3) Domain: domain can access the cookle. For example, if set to .zhihu.com, all domain names ending in zhihu.com can access the cookie.
        (4) MaxAge: the cookie dead time, in seconds, and Expires often used together, the effective time can be calculated through it. Max Age if it is positive, then the cookie expire after Max Age seconds. If negative, then close the browser cookie that is invalid, it will not save the cookie in any form.
        (5) Path: a path using the cookie. If set to / path /, the only path is / path / page can access the cookie. If set to /, then all pages under this domain can access the cookieo
        (6) Size field: the size of this Cookie.
        (8) HTTP field: httponly attribute cookie. If this property is true, only with this information in the HTTP Cookie header will be, but this can not be accessed document.cookie Cookie.
        (9) Secure: whether the cookie is used only secure protocol. H TTP s security protocol is SSL and the like, prior to transmission data is first encrypted data over the network. The default is false.

    7, the session cookie and persistent cookie
    session cookie: Max Age is negative, then close the browser cookie cease to be valid cookie stored in memory.
    Persistent cookie: Max Age if it is positive, then the cookie expire after Max Age seconds. Cookie stored on your hard disk

    Persistence: The persistence of data in memory to the hard disk. In fact, save the data to a file or database.
    The main role of memory is due to the fast rate of speed, launch applications or software program and they will allocate some memory space as the program is running out of memory.
    Once the breakpoint memory will be cleared.
    
    Serialization: The object persistence to the hard disk.

 

 

Released seven original articles · won praise 0 · Views 345

Guess you like

Origin blog.csdn.net/fxd_dong/article/details/104354154