Day 03 reptiles

Today contents: 
a reptile principle
two Requests request library
Day03
A reptile principle 
1. What is the Internet?
It refers to a stack of a network device, to the computer station to the Internet together with a called Internet.

2. The purpose of the establishment of the Internet?
The purpose is to establish the Internet transfer and data sharing data.

3. What is the data?
For example ... Taobao, Jingdong product information such as
number of securities investment information East Fortune, snowball network ...
the chain of home, such as availability of information freely ....
12306 ticket information ...

4. The whole process of the Internet:
- ordinary users:
open browser -> sending a request to a target site -> the fetch response data -> renderer in the browser

- crawlers:
analog browser -> sending a request to a target site -> the fetch response data - -> extract valuable data -> persisted to the data


5. What is the browser sends a request?
request http protocol.

- Client:
the browser is a software -> Client IP and port


- server
https://www.jd.com/
www.jd.com (Jingdong domain name) -> DNS parsing -> IP server and Jingdong port

Client ip and port ------> IP and port to send the request to the server can establish a link to obtain the corresponding data.


6. The crawler whole process
- the transmission request (request requires libraries: Requests database request, requesting the Selenium library)
- fetch response data (as long as the transmission request to the server, the request returns response data)
- parses and extracts data (requires parsing library : Re, BeautifulSoup4, the Xpath ...)
- saved locally (file processing, database, MongoDB repository)


two requests requests library

1. installation and use
- open cmd
- input: requests the install PIP3

2. crawling video


3. grip packet analysis
open developer mode browser (check) ----> select the network
to find the page suffix xxx.html access (response text)

1) the request url (website address access)
2) request method:
GET:
direct send request the data
https://www.cnblogs.com/kermitjam/articles/9692597.html

the POST:
Need to carry the user information transmission request to the target address
https://www.cnblogs.com/login

. 3) Response Status Code:
2xx: Success
3xx: Redirection
4xx: resource not found
5xx: Server Error

4) the request header:
the User -Agent: user agent (proved to be a request by computer equipment and sent by the browser)
Cookies: real user login information (user prove your target site)
Referer: url on the first visit (to prove that you are jumping from target sites on the web )

5) request body:
POST request will have the request body.
The Data Form
{
'the User': 'Tank',
'pwd': '123'
}


four crawling IMDb
: starting from the current position
*: Find all
:? Find the first not to look

. * ?: non-greedy match
*: Greed match

: extract data in brackets (. *?)

Movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis
<div class = "item"> *.? <EM class = ""> (. *?) </ EM>
. *? <a href="(.*?)">. *? <span class = "title"> (. *?) </ span >
* director:.? (.? *).?.? (.? *) </ the p-> * <span class = "rating_num" *> </ span>
.? (.? *) * <span> people evaluation </ span>. *? <span class = "INQ"> (. *?) </ span>



<div class = "Item">
<div class = "PIC">
<EM class = ""> 226 < / EM>
<a href="https://movie.douban.com/subject/1300374/">
<IMG width = "100" Alt = "The green Mile" src = "https: //img3.doubanio.com/view/photo/s_ratio_poster/public/p767586451.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
href="https://movie.douban.com/subject/1300374/" class=""> <a
<span class = "title"> Green Mile </ span>
<span class = "title"> & nbsp; / & nbsp; of The green mile </ span>
<span class = "OTHER"> & nbsp; / & nbsp; The green mile (units) / green mile </ span>
</a>


<span class = "playable"> [playable ] </ span>
</ div>
<div class = "bd">
<the p-class = "">
director: Frank Darabont & nbsp; & nbsp ; & nbsp; starring: Tom Hanks Tom Hanks / David Morse David M <br> ...
1999 & nbsp; / & nbsp; USA & nbsp; / & nbsp;Fantasy Crime Mystery Drama
</ P>


<div class = "Star">
<span class = "T-rating45"> </ span>
<span class = "rating_num" Property = "V: Average"> 8.7 </ span>
<span Property = "V: Best" Content = "10.0"> </ span>
<span> 141.37 thousand people commented </ span>
< / div>

<the p-class = "quote">
<span class = "INQ"> angel temporarily leave. </ span>
</ P>
</ div>
</ div>
</ div>
import requests
crawling pear Video:
'' ' 
Video options: 
    1. Video pear 
' '' 
Import Requests
 # 
# # to the source address of the video transmission request 
Response = requests.get (
     ' https://video.pearvideo.com/mp4/adshort/20190625/cont-1570302 ad_hd.mp4--14057031_adpkg ' )
 # 
# # print binary stream, such as pictures, video, etc. data 
Print (response.content)
 # 
# # locally stored video 
with Open ( ' video .mp4 ' , ' WB ' ) AS F: 
    f.write (response.content) 

'' ' 
. 1, the first transmission request to the video Home pears 
    https://www.pearvideo.com/ 
     
    resolve all videos acquired id:
        video_1570302
         
        re.findall () 
        

2, to obtain the video details page url: 
    thrilling man robbed on the subway slip, go on foot! 
    https://www.pearvideo.com/video_1570302 
    Secret Karez 
    https://www.pearvideo.com/ video_1570107 
'' ' 
Import requests
 Import Re   # regular, for parsing the text data 
# 1, first transmission request to the video Home pears 
Response = requests.get ( ' https://www.pearvideo.com/ ' )
 # Print (Response. text) 

# Re regular video matching accessories ID 
# parameter 1: regular matching rule 
# parameter 2: parse text 
# 3 parameters: the matching pattern 
(the re.findall res_list = ' . "video _ (*? <the href = A)" ' , Response.text, re.S)
 # Print (res_list) 

# stitching each video detail page url 
for v_id in res_list: 
    detail_url = ' https://www.pearvideo.com/video_ ' + v_id
     # Print (detail_url) 

    # for each video page transmission request before obtaining the video source URL 
    Response = requests.get (URL = detail_url)
     # Print (response.text) 

    # parses and extracts video details page URL 
    # video URL 
    video_url the re.findall = ( ' srcUrl = "(. *?)" ' , response.text, re.S) [0]
     Print (video_url) 

    #Video Title
    = video_name the re.findall (
         ' <h1 of class = "Video-TT"> (. *?) </ h1 of> ' , response.text, re.S) [0] 

    Print (video_name) 

    # to obtain a video transmission request url binary video stream 
    v_response = requests.get (video_url) 

    with Open ( ' % s.mp4 ' % video_name, ' WB ' ) AS F: 
        f.write (v_response.content) 
        Print (video_name, ' video crawling completed ' )

Crawling IMDb TOP250

'' ' 
Https://movie.douban.com/top250?start=0&filter= 
https://movie.douban.com/top250?start=25&filter= 
https://movie.douban.com/top250?start=50&filter = 

1. send request 
2. parse the data 
3. save the data 
'' ' 
Import requests
 Import Re 

# crawler trilogy 
# 1 sends a request 
DEF the get_page (the base_url): 
    Response = requests.get (the base_url)
     return Response 

# 2. parse text 
DEF parse_index (text): 

    RES = the re.findall ( '<div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?导演:(.*?)</p>.*?<span class="rating_num".*?>(.*?)</span>.*?<span>(.*?)人评价</span>.*?<span class="inq">(.*?)</span>', text, re.S)
    # print(res)
    return res

# 3.保存数据
def save_data(data):
    with open('douban.txt', 'a', encoding='utf-8') as f:
        f.write(data)

# main + 回车键
if __name__ == '__main__':
    #10 = NUM 
    # the base_url = 'https://movie.douban.com/top250?start={}&filter='.format(num) 

    NUM = 0
     for Line in Range (10 ): 
        the base_url = F ' HTTPS: // movie.douban.com/top250?start={num}&filter= ' 
        NUM + = 25
         Print (the base_url) 

        # 1. sending a request, the calling function 
        Response = the get_page (the base_url) 

        # 2. parse text 
        movie_list = parse_index (response.text ) 

        # 3. save data 
        # data format 
        for Movie inmovie_list:
             # Print (Movie) 

            # extract the assignment 
            # movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis 
            V_TOP, v_url, v_name, v_daoyan, v_point, v_num, v_desc = Movie
             # V_TOP = movie [0] 
            # v_url = movie [. 1] 
            moive_content = F '' ' 
            Movies ranking: {v_top} 
            film url: {v_url} 
            movie Title: {v_name} 
            director starring: {v_daoyan} 
            movie Rating: {v_point} 
            evaluation number: {v_num} 
            movie Synopsis: v_desc} { 
            \ the n- 
            '' ' 

            Print (moive_content) 

            # save data
            save_data(moive_content)

Summary: The teacher teaching how to crawl, today roughly mastered the content crawling and crawling video movie information.

Guess you like

Origin www.cnblogs.com/Max-xyh-1228/p/11100140.html