0626. reptile principle

Today contents: 
a reptile principle
two Requests library request

a reptile principle
   1. What is the Internet?
It refers to a stack of a network device, to the computer station to the Internet together with a called Internet.
2. The purpose of the establishment of the Internet?
The purpose is to establish the Internet transfer and data sharing data.
3. What is the data?
For example ... Taobao, Jingdong product information such as
number of securities investment information East Fortune, snowball network ...
the chain of home, such as availability of information freely ....
12306 ticket information ...
4. The whole process of the Internet:
- ordinary users:
open browser -> sending a request to a target site -> the fetch response data -> renderer in the browser
- crawlers:
analog browser -> sending a request to a target site -> the fetch response data - -> extract valuable data -> persisted to the data
5. What is the browser sends a request?

request http protocol.

- Client:
the browser is a software -> Client IP and port

- server
https://www.jd.com/
www.jd.com (Jingdong domain name) -> DNS parsing -> IP and port services side of Jingdong
client ip and port ------> IP and port to send the request to the server you can establish a link to obtain the corresponding data.

6. The crawler whole process
- the transmission request (request requires libraries: Requests database request, requesting the Selenium library)
- fetch response data (as long as the transmission request to the server, the request returns response data)
- parses and extracts data (requires parsing library : Re, BeautifulSoup4, Xpath ...)
- save to a local (file processing, database, MongoDB repository)

(the following is the basic use of requests to Baidu climbing a single page, for example)
# Import requests import request requests Library 


# Baidu page transmission request to acquire the response object 
Response = requests.get (URL = 'HTTPS: //www.baidu.com/') 

# Set. 8-character encoding UTF 
response.encoding = 'UTF-. 8' 

# printing text in response 
Print (response.text) 

# the local response text written 
with Open ( 'baidu.html', 'W', encoding = 'UTF-. 8') AS F: 
    f.write ( response.text)

  


Two Requests library request
1. Installation and Use
- open cmd
- Input: Requests PIP3 install

2. crawling video

3. packet capture analysis
Open developer mode browser (check) ----> select the network
to find the page visited suffix xxx.html (response text)

1) the request url (website address access)
2) request method:
gET:
direct sending a request to obtain data
https://www.cnblogs.com/kermitjam/articles/9692597.html
POST:
need to bring user information transmission request to the target address
https://www.cnblogs.com/login
. 3) response status code:
2xx: success
3xx: redirection
4xx: resource not found
5xx: server error
4) the request header:
User-Agent: User Agent (proved to be a request sent by computer equipment and browser)
Cookies: real user login information (user prove your target site)
Referer: url on the first visit (to prove that you are jumping up from the target site a)
5) request body:
POST request will have the request body.
The Data Form1
{
'User': 'Tank',
'pwd': '123'
}

(in crawling "pear video" video on a single example)
'' '' '' 
'' ' 
Video options: 
    1. Video pear 
' '' 
# Import Requests 
# 
# # to the source address of the video transmission request 
# requests.get Response = ( 
# 'https://video.pearvideo.com/ MP4 / adshort / 20,190,625 / 1570302-14057031_adpkg-ad_hd.mp4-CONT ') 
# 
# # print binary stream, such as pictures, video, data 
# Print (response.content) 
# 
# # locally stored video 
# with open (' video .mp4 ',' wb ') AS f: 
# f.write (response.content) 

' '' 
1, first sending a request to the Home video pear 
    https://www.pearvideo.com/ 
    
    resolve all of the video obtain id: 
        video_1570302 
        
        re.findall () 
        

2, to obtain the video details page url: 
    ! thrill man robbed on the subway slip, go on foot 
    https://www.pearvideo.com/video_1570302 
    Secret Karez 
    https://www.pearvideo.com/video_1570107 
'' ' 
Import Requests 
Import Re # regular, for parsing the text data 
# 1, first transmission request to the Video Home pears 
response = requests.get (' https: // www .pearvideo.com / ') 
# Print (response.text) 

# Re regular video matching accessories ID 
# parameter 1: regular matching rule 
# 2 parameters: parse text 
# 3 parameters: pattern matching 
res_list = re.findall (' <a = href "(.? *) video _" ', response.text, re.S) 
# Print (res_list) 

# stitching each video detail page url 
for v_id in res_list: 
    detail_url =' https://www.pearvideo.com / Video_ '+ v_id 
    # Print (detail_url) 

    # send a request for each page for a video before the video source URL 
    Response = requests.get (URL = detail_url) 
    # Print (Response.text) 

    # parse and extract details page video url
    # Video url 
    video_url = re.findall ( 'srcUrl = "(. *?)"', Response.text, re.S) [0] 
    Print (video_url) 

    # Video Name 
    video_name = re.findall ( 
        '<h1 class = (. *?) </ h1 of "video-TT">> ', response.text, re.S) [0] 

    Print (video_name) 

    # url video transmission request to acquire a video stream of binary 
    v_response = requests.get (video_url) 

    Open with ( '% s.mp4'% video_name, 'WB') AS F: 
        f.write (v_response.content) 
        Print (video_name, 'video crawling complete')

  

Three crawling IMDb 

while crawling more information on the site, such as access to information when rating movies Top250 watercress, for the same conditions that we can ignore directly by non-greedy matching,
for important information (. *?) Extract different information. Here are some related symbols.

   : From the current position
*: Find all
:? Find the first not to look
. * ?: non-greedy match
. *: Greed matches
(. *?): Extract data in brackets

movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis
    
(the following is crawling watercress Top250 ranking of the code :)
'' '' '' 
'' ' 
Https://movie.douban.com/top250?start=0&filter= 
https://movie.douban.com/top250?start=25&filter= 
https://movie.douban.com ? / TOP250 Start = 50 & filter = 

1. send request 
2. parse the data 
3. save the data 
'' ' 
Import requests 
Import Re 

# crawler trilogy 
# 1 sends a request 
DEF the get_page (the base_url): 
    Response = requests.get (the base_url) 
    the Response return 

# 2. parse the text 
DEF parse_index (text): 

    RES = re.findall ( '<div class = "Item"> * <EM class = ""> (*) </ EM> *.?.?.? <a href="(.*?)"> * <span class = " title"> </ span> * director.? (*.?): .? (.? *) </ p> * <.? span class = "rating_num".*?>(.*?)</span>.*?<span>(.*?)人评价</span>.*?<span class="inq">(.*?)</span>', text, re.S)
    Print # (RES) 
    return RES 

# 3. Save the data 
DEF save_data (Data): 
    with Open ( 'douban.txt', 'A', encoding = 'UTF-. 8') AS F:  
        F.write(data)

# main + 回车键
if __name__ == '__main__':
    # 10 NUM = 
    # = the base_url ' https://movie.douban.com/top250?start={}&filter='.format(num) 

    NUM = 0 
    for Line Range in (10): 
        the base_url = f'https: //movie.douban.com/top250 ? NUM} {Start = & filter = ' 
        NUM = + 25 
        Print (the base_url) 

        # 1. sending a request, the calling function 
        Response = the get_page (the base_url) 

        # 2. parse text 
        movie_list = parse_index (response.text) 

        # 3. save the data 
        # the data format 
        for movie in movie_list:
            # print(movie)

            # Decompression assignment 
            # movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis 
            v_top , v_url, v_name, v_daoyan, v_point , v_num, v_desc = movie
            # V_TOP = Movie [0] 
            # v_url = Movie [. 1] 
            moive_content = F '' ' 
            Movies Ranking: {v_top} 
            film url: {v_url} 
            Movie Title: {v_name} 
            Director Starring: {v_daoyan} 
            Movie Rating: {v_point} 
            number of evaluators: {v_num} 
            movie Synopsis: v_desc} { 
            \ the n- 
            '' ' 

            Print (moive_content) 

            # save the data 
            save_data (moive_content)

  

Guess you like

Origin www.cnblogs.com/chestnuttt/p/11106412.html