day3 Summary

Today contents:
    a reptile principle
    two Requests library request
a reptile principle
    1. What is the Internet?
        It refers to a stack of a network device, to the computer station to the Internet together with a called Internet.


    2. The purpose of the establishment of the Internet?
        The purpose is to establish the Internet transfer and data sharing data.

    3. What is the data?
        For example ... Taobao, Jingdong product information such as
        number of securities investment information East Fortune, snowball network ...
        the chain of home, such as availability of information freely ....
        12306 ticket information ...

    4. The whole process of the Internet:
        - ordinary users:
            open browser -> sending a request to a target site -> the fetch response data -> renderer in the browser

        - crawlers:
            analog browser -> sending a request to a target site -> the fetch response data -> extract valuable data -> persisted to the data

    5. What is the browser sends a request?
        request http protocol.

        - Client:
            the browser is a software -> Client IP and port

        - server
            https://www.jd.com/
            www.jd.com (Jingdong domain name) -> DNS parsing -> IP server and Jingdong port

        client ip and port ------> IP and port to send the request to the server can establish a link to obtain the corresponding data.

    6. The whole process of reptiles
        - sending a request (need to request the library: Requests request library, the Selenium request library)
        - fetch response data (just to the server a send request after by return response data)
        - parses and extracts data (requires parsing library: re, BeautifulSoup4, Xpath ...)
        - save to a local (file processing, database, MongoDB repository)

two requests requests library


    1. Installation and Use
        - open cmd
        - Input: Requests PIP3 install

    2. crawling video

    3. packet capture analysis
        Open developer mode browser (check) ----> select the network
        to find pages visited suffix xxx.html ( response text)

        1) the request url (website address access)
        2) request method:
            gET:
                direct sending a request to obtain data
                https://www.cnblogs.com/kermitjam/articles/9692597.html

            POST:
                the need to carry user information to target send the requested address
                https://www.cnblogs.com/login

        . 3) response status code:
            2xx: success
            3xx: redirection
            4xx: resource not found
            5xx: server error

        4) the request header:
            User-Agent: User Agent (proved to be a request sent by computer equipment and browser)
            Cookies: real user login information (user prove your target site)
            Referer: url on the first visit (to prove that you are jumping up from the target site a)

        5) request body:
            POST request will have the request body.
            The Data Form
                {
                    'the User': 'Tank',
                    'pwd': '123'
                }

four crawling IMDb
    : starting from the current position
    *: Find all
    :? Not looking to find the first

    * ?: non-greedy match
    . *: greed match

    : extract data in brackets (. *?)

    movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis
    <div class = "item"> * <.? class = EM ""> (. *?) </ EM>
    . *? <a href="(.*?)">. *? <span class = "
    .*?导演:(.*?)</p>.*?<span class="rating_num".*?>(.*?)</span>
    .*?<span>(.*?)人评价</span>.*?<span class="inq">(.*?)</span>

<div class="item">
<div class="pic">
    <em class="">226</em>
    <a href="https://movie.douban.com/subject/1300374/">
        <img width="100" alt="绿里奇迹" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p767586451.webp" class="">
    </a>
</div>
<div class="info">
    <div class="hd">
        <a href="https://movie.douban.com/subject/1300374/" class="">
            <span class="title">绿里奇迹</span>
                    <span class="title">&nbsp;/&nbsp;The Green Mile</span>
                <span class = "other"> & nbsp; / & nbsp; The Green Mile (units) / green mile </ span>
        </a>

            <span class = "Playable"> [play] </ span>
    </ div>
    < class = div "bd">
        <the p-class = "">
            director: Frank Darabont & nbsp; & nbsp ; & nbsp; starring: Tom Hanks Tom Hanks / David Morse M ... <br> David
            1999 & nbsp; / & nbsp ; United States & nbsp; / & nbsp; crime Drama Fantasy Mystery
        </ the p->

        <div class = "Star">
                <span class = "rating45-t"> </ span>
                <span class = "rating_num" Property = "v: Average "> 8.7 </ span>
                <span property="v:best" content="10.0"></span>
                <span>141370人评价</span>
        </div>

            <p class="quote">
                <span class = "inq"> Angel temporarily leave. </ span>
            </ P>
    </ div>
</ div>
</ div>

Guess you like

Origin www.cnblogs.com/Raywarriors/p/11094173.html