Learn day3 ---- reptiles

Today contents:
a reptile principle
two Requests library request

 


A reptile principle
1. What is the Internet?
It refers to a stack of a network device, to the computer station to the Internet together with a called Internet.

2. The purpose of the establishment of the Internet?
The purpose is to establish the Internet transfer and data sharing data.

3. What is the data?
For example ... Taobao, Jingdong product information such as
number of securities investment information East Fortune, snowball network ...
the chain of home, such as availability of information freely ....
12306 ticket information ...

4. Internet whole process:
- Common User:
Open Browser -> sending a request to a target site -> the fetch response data -> renderer to the browser

- crawlers:
Analog Browser -> sending a request to a target site -> the fetch response data -> extract valuable data -> persisted to the data


5. What is the browser sends a request?
request http protocol.

- Client:
the browser is a software -> Client IP and port


- server
https://www.jd.com/
www.jd.com (Jingdong domain name) -> DNS parsing -> Jingdong server IP and port

Client ip and port ------> IP and port to send the request to the server can establish a link to obtain the corresponding data.


6. The crawler whole process
- the transmission request (request requires libraries: Requests database request, requesting the Selenium library)
- fetch response data (as long as the transmission request to the server, the request returns response data)
- parses and extracts data (requires parsing library : Re, BeautifulSoup4, Xpath ...)
- save to a local (file processing, database, MongoDB repository)


Two Requests library request

1. Installation and Use
- open cmd
- Input: pip3 install requests

2. crawling video


3. packet capture analysis
Open developer mode browser (check) ----> select the network
to find pages visited suffix xxx.html (response text)

1) the request url (website address access)
2) request method:
GET:
direct sending a request to obtain data
https://www.cnblogs.com/kermitjam/articles/9692597.html

POST:
need to carry user information to the target address to send a request
https://www.cnblogs.com/login

3) response status codes:
2xx: Success
3xx: Redirection
4xx: Can not find resource
5xx: Server Error

4) request header information:
the User-Agent: User Agent (proved to be a request sent by computer equipment and browser)
Cookies: login user real information (to prove users of your target site)
Referer: a url to access the (prove you are Jump from the target sites on the web)

5) Request body:
POST request will have the request body.
The Data Form1
{
'User': 'Tank',
'pwd': '123'
}


Four crawling IMDb
: starting from the current position
*: Find all
:? Not looking to find the first

* ?: non-greedy matching
*: greedy match

(. *?): Extract data in brackets

Movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis
.? <Div class = "item "> * <em class = ""> </ em> (*.?)
.?.? * <a href="(.*?)"> * <span class = "title"> </ span> (*.?)
* director:.? (.? *) </ p>. *? <span class = "rating_num." *?> (. *?) </ span>
. *? <span> (. *?) people commented </ span>. *? < span class = "inq"> (. *?) </ span >

 

<div class="item">
<div class="pic">
<em class="">226</em>
<a href="https://movie.douban.com/subject/1300374/">
<img width="100" alt="绿里奇迹" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p767586451.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1300374/" class="">
<span class="title">绿里奇迹</span>
<span class="title"> / The Green Mile</span>
<span class="other"> /  The Green Mile (units) / green mile </ span>
</a>


<span class = "playable"> [ play] </ span>
</ div>
<div class = "bd">
< "the p-class =">
Director: Frank Darabont & nbsp; & nbsp ; & nbsp; Starring: Tom Hanks Adams Tom Hanks / David Morse M ... <br> David
1999 & nbsp; / & nbsp; USA & nbsp; / & nbsp; crime Drama Fantasy Mystery
</ p>


<div class="star">
<span class="rating45-t"></span>
<span class="rating_num" property="v:average">8.7</span>
<span property="v:best" content="10.0"></span>
<span>141370人评价</span>
</div>

<P class = "quote">
<span class = "INQ"> Angel temporarily leave. </ span>
</ P>
</ div>
</ div>
</ div>

 

Crawling IMDb charts 250

''''''
'''
https://movie.douban.com/top250?start=0&filter=
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=

1. The transmission request
2. Parse the data
3. Save data
'''
import requests
import re

# Crawler Trilogy 
# 1 sends a request 
DEF the get_page (the base_url):
    response = requests.get(base_url)
    return response

# 2. parse text 
DEF parse_index (text):

    res = re.findall('<div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?导演:(.*?)</p>.*?<span class="rating_num".*?>(.*?)</span>.*?<span>(.*?)人评价</span>.*?<span class="inq">(.*?)</span>', text, re.S)
    # print(res)
    return res

# 3. Save data 
DEF save_data (the Data):
    with open('douban.txt', 'a', encoding='utf-8') as f:
        f.write(data)

# main + 回车键
if __name__ == '__main__':
    # num = 10
    # base_url = 'https://movie.douban.com/top250?start={}&filter='.format(num)

    num = 0
    for line in range(10):
        base_url = f'https://movie.douban.com/top250?start={num}&filter='
        num += 25
        print(base_url)

        # 1 sends a request, the calling function 
        Response = the get_page (the base_url)

        # 2. parse text 
        movie_list = parse_index (response.text)

        # 3. Save data 
        # data format 
        for Movie in movie_list:
             # Print (Movie)

            # Decompression assignment 
            # movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis 
            V_TOP, v_url, v_name, v_daoyan, v_point, v_num, v_desc = Movie
             # V_TOP = Movie [0] 
            # Movie = v_url [. 1] 
            moive_content = F '' '
            Movie Ranking: {v_top}
            Film url: {v_url}
            Movie Name: {v_name}
            Director Starring: {v_daoyan}
            Movie rating: {v_point}
            Number of Evaluation: {v_num}
            Movie Synopsis: {v_desc}
            \n
            '''

            print(moive_content)

            # Save data 
            save_data (moive_content)

Crawling video

''''''
'''
Video Options:
    1. Pears video
'''
# import requests
#
# # To the source address of the video transmission request 
# Response = requests.get ( 
#      'https://video.pearvideo.com/mp4/adshort/20190625/cont-1570302-14057031_adpkg-ad_hd.mp4')
#
# # Print binary stream, such as pictures, video and other data 
# Print (response.content)
#
# # Save the video to your local 
# with Open ( 'video .mp4', 'wb') AS f: 
#      f.write (response.content)

'''
1, first send a request to the pear Video Home
    https://www.pearvideo.com/
    
    Id get resolved for all videos:
        video_1570302
        
        re.findall()
        

2, access to video details page url:
    Thrilling! Man robbed on the subway slip, go on foot
    https://www.pearvideo.com/video_1570302
    Secret Karez
    https://www.pearvideo.com/video_1570107
'' ' 
Import Requests
 Import Re   # regular, for parsing the text data 
# 1, first transmission request to the Video Home pears 
Response = requests.get ( ' https://www.pearvideo.com/ ' )
 # Print (response.text )

# Re regular matching accessories video ID 
# Parameter 1: regular matching rule 
# Parameter 2: parse text 
# 3 parameters: match mode 
res_list = the re.findall ( ' . "Video _ (*)?" <A the href = ' , Response. text, re.S)
 # Print (res_list)

# Stitching each video detail page url 
for v_id in res_list:
    detail_url = 'https://www.pearvideo.com/video_' + v_id
    # print(detail_url)

    # For each video page transmission request before obtaining the video source URL 
    Response = requests.get (URL = detail_url)
     # Print (response.text)

    # Parse and extract details page video url 
    # video url 
    video_url = re.findall ( ' srcUrl = "(. *?)" ' , Response.text, re.S) [0]
     Print (video_url)

    # 视频名称
    video_name = re.findall(
        '<h1 class="video-tt">(.*?)</h1>', response.text, re.S)[0]

    print(video_name)

    # To a video transmission request url binary acquire a video stream 
    v_response = requests.get (video_url)

    with open('%s.mp4' % video_name, 'wb') as f:
        f.write(v_response.content)
        Print (video_name, ' video crawling completed ' )

The basic use of requests

Import requests   # import request requests Library


# Baidu page transmission request to acquire the response object 
Response = requests.get (URL = ' https://www.baidu.com/ ' )

# Set. 8-character encoding UTF 
response.encoding = ' UTF-. 8 '

# Print the response text 
Print (response.text)

# The local response text written 
with Open ( ' baidu.html ' , ' W ' , encoding = ' UTF-. 8 ' ) AS F:
    f.write(response.text)

 

Guess you like

Origin www.cnblogs.com/xl-123456/p/11094621.html