python reptile Day 01

First, the basic principles of reptiles

    1, what is reptiles?
        Reptile is crawling data.

    2. What is the Internet?
        A stack of a network device, to a station
        computer network called the Internet together.

    3, the purpose of the establishment of the Internet
        to share and transfer data to the data.

   4. What is the data?
        For example:
            Product Information electricity supplier platform (Taobao, Jingdong, Amazon)
            chain of home, comfortable rental housing information platform
            stock portfolio information (Eastern wealth, snowball network)
            ...
            12306, ticket information (grab votes)

    5. What is the Internet?
        Ordinary users:
            open a browser
            ---> enter the URL
            ---> host sends a request to a target
            ---> returns the response data
            ---> the rendering data to the browser in

        the crawler:
            Analog browser
            ---> To target host sends a request
            ---> return response data
            ---> parse and extract the valuable data
            ---> saved data (files written to the local,
            persistent to the database)

   6, the whole process crawler
        1. The transmission request (Library: Requests / the Selenium)
        2. Get response data
        3. analysis data (parsing library: BeautifulSoup4)
        4. save the data (repository: file save / MongoDB)

    summary: we can put the data in the Internet likened to a treasure,
        reptile is in fact digging treasure.

Two, requests requesting library

   1, installation and use
        PIP3 the install Requests

    2, process analysis request (analog browser)
        - Baidu:
            1. Request URL
                https://www.baidu.com/

            2. Request embodiment
                the GET
                the POST

            3. Response status code

            4. The request header information

We reptiles use to request library requests:

installation:

1
pip3 install requests

 Each time used in the previous code with

import requests
Import Requests
 # crawler Trilogy 
# 1 sends a request 
DEF the get_page (URL): 
    Response = requests.get (URL)
     return Response 

# 2. Analytical Data 
Import Re
 DEF parse_index (HTML):
     # findAll match all 
    # the re.findall ( 'regular matching rule', 'matching text', 'matching mode') 
    # re.S: searching for all text matching 
    detail_urls the re.findall = ( ' <div class = "items"> <A class = "imglink" the href = "(*.?)" ' , HTML, re.S)
     return detail_urls
 # resolve details page 
DEF parse_detail (HTML):
    movie_url Re.findall = ( ' <Source src = "(*).?"> ' , HTML, re.S)
     IF movie_url:
         return movie_url [0]   # returns only the details of the video URL 

# 3. Save the Data 
Import uuid
 # uuid .uuid4 () stamp generating section according to the only string 
DEF save_video (Content): 
    with Open (F ' . uuid.uuid4 {()} MP4 ' , ' WB ' ) AS F: 
        f.write (Content) 
        Print ( ' video downloaded ' ) 

# test 
IF  __name__ == ' __main__ ':
     For Line in Range (. 6 ): 
        URL = F ' http://www.xiaohuar.com/list-3-{line}.html ' 

        # transmission request 
        Response = the get_page (URL)
         # Print (Response) 

        # # Returns response status code 
        # Print (response.status_code) 

        # # return a response text 
        # Print (response.text) 

        #   parse Home page 
        detail_urls = parse_index (response.text) 

        # loop through the details page url 
        for detail_url in detail_urls:
             Print (detail_url) 

            #Details of each page to send a request 
            detail_response = the get_page (detail_url)
             # Print (response.text) 

            # resolve details page for a video url 
            movie_url = parse_detail (detail_response.text) 

            # determines the presence of video and print url 
            IF movie_url:
                 Print (movie_url) 

                # video transmission request url to acquire a video stream of binary 
                movie_response = the get_page (movie_url) 

                # binary video stream saved locally to save_video function 
                save_video (movie_response.content)

Three, post request automatic sign github

"""
请求URL:https://github.com/login

请求头:
    cookies
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36

请求体:
Form Data:
    commit: Sign in
    utf8: ✓
    authenticity_token: mOyooOxE7c/H7krQKEli+cwwy0+okuPOZqGEBSyu/kRfLHT2mNGu9RQcDnb1ovua1zQe3LOyYXxrWxFL+2aAcg==
    login: gr6g5r
    password: huhuhhge
    webauthn-support: supported
"""
import requests
import re
login_url = 'https://github.com/login'
login_header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
login_res = requests.get(url=login_url, headers=login_header)


# 解析提取token字符串
authenticity_token = re.findall(
    '<input type="hidden" name="authenticity_token" value="(.*?)"',
    login_res.text,
    re.S
) [0]
Print (authenticity_token) 

# acquires login information cookies page 
login_cookies = login_res.cookies.get_dict () 

# request URL 
session_url = ' https://github.com/session ' 

# request header 
session_headers = {
 ' the User-- Agent ' : ' mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 ' 
} 
# request-information 
form_data = {
     " the commit " : " Sign in " ,
     "utf8": "",
    "authenticity_token": authenticity_token,
    "login": "*******",
    "password": "*******",
    "webauthn-support": "supported"
}

session_res = requests.post(
    url=session_url,
    headers=session_headers,=
    cookieslogin_cookies,
    data=form_data
)

with open('github.html', 'w', encoding='utf-8') as f:
    f.write(session_res.text)

Guess you like

Origin www.cnblogs.com/zyl0517/p/11116591.html