day3 June 26 study notes python reptile a reptile principle two Requests library request

Today contents: 
a reptile principle
two Requests library request

 


A reptile principle
1. What is the Internet?
It refers to a stack of a network device, to the computer station to the Internet together with a called Internet.

2. The purpose of the establishment of the Internet?
The purpose is to establish the Internet transfer and data sharing data.

3. What is the data?
For example ... Taobao, Jingdong product information such as
number of securities investment information East Fortune, snowball network ...
the chain of home, such as availability of information freely ....
12306 ticket information ...

4. The whole process of the Internet:
- ordinary users:
open browser -> sending a request to a target site -> the fetch response data -> renderer in the browser

- crawlers:
analog browser -> sending a request to a target site -> the fetch response data - -> extract valuable data -> persisted to the data


5. What is the browser sends a request?
request http protocol.

- Client:
the browser is a software -> Client IP and port


- server
https://www.jd.com/

www.jd.com (Jingdong domain name) -> DNS parsing -> IP and port services side of Jingdong

client ip and port ------> IP and port to send the request to the server can establish a link to obtain the corresponding data.


6. The crawler whole process
- the transmission request (request requires libraries: Requests database request, requesting the Selenium library)
- fetch response data (as long as the transmission request to the server, the request returns response data)
- parses and extracts data (requires parsing library : Re, BeautifulSoup4, Xpath ...)
- save to a local (file processing, database, MongoDB repository)


two requests requests library

import
Requests library request
import requests # import request requests Library


# Baidu home page to send a request to obtain the response object
response = requests.get(url='https://www.baidu.com/')

# Set the character encoding to utf-8
response.encoding = 'utf-8'

# Print the response text
print(response.text)

# The text is written in response to local
with open('baidu.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

  



1. Installation and Use
- open cmd
- Input: Requests the install PIP3



2. Video crawling

 

''''''
'''
Video Options:
    1. Pears video
'''
# import requests
#
## to the source address of the video transmission request
# response = requests.get(
#     'https://video.pearvideo.com/mp4/adshort/20190625/cont-1570302-14057031_adpkg-ad_hd.mp4')
#
# # Print binary stream, such as pictures, video and other data
# print(response.content)
#
# # Save the video to your local
# with open('视频.mp4', 'wb') as f:
#     f.write(response.content)

'''
1, first send a request to the pear Video Home
    https://www.pearvideo.com/
    
    Id get resolved for all videos:
        video_1570302
        
        re.findall()
        

2, access to video details page url:
    Thrilling! Man robbed on the subway slip, go on foot
    https://www.pearvideo.com/video_1570302
    Secret Karez
    https://www.pearvideo.com/video_1570107
'''
import requests
import re # regularization, for parsing text data
# 1, first send a request to the pear Video Home
response = requests.get('https://www.pearvideo.com/')
# print(response.text)

# Re regular matches to get all video id
Parameter # 1: Regular matching rules
Parameter # 2: parse text
Parameter # 3: Match mode
res_list = re.findall('<a href="video_(.*?)"', response.text, re.S)
# print(res_list)

# Stitching each video detail page url
for v_id in res_list:
    detail_url = 'https://www.pearvideo.com/video_' + v_id
    # print(detail_url)

    # Sending a request to obtain the video source url for each video detail page
    response = requests.get(url=detail_url)
    # print(response.text)

    # Parse and extract details page video url
    # Video url
    video_url = re.findall('srcUrl="(.*?)"', response.text, re.S)[0]
    print(video_url)

    # Name video
    video_name = re.findall(
        '<h1 class="video-tt">(.*?)</h1>', response.text, re.S)[0]

    print(video_name)

    # Binary stream to acquire a video transmission request video url
    v_response = requests.get(video_url)

    with open('%s.mp4' % video_name, 'wb') as f:
        f.write(v_response.content)
        print (video_name, 'video crawling Complete')

  


3. packet capture analysis
Open developer mode browser (check) ----> select the network
to find pages visited suffix xxx.html (response text)

1) the request url (website address access)
2) request method:
GET :
directly sending a request to obtain data
https://www.cnblogs.com/kermitjam/articles/9692597.html

the POST:
the need to carry the user information transmission request to the target address
https://www.cnblogs.com/login

. 3) the response status code :
2xx: success
3xx: redirection
4xx: Can not find resource
5xx: server error

4) request header information:
the user-agent: user agent (proved to be a request sent by computer equipment and browser)
Cookies: real login user information ( prove that users of your target site)
Referer: on a visit to the url (to prove that you are jumping from target sites on the web)

5) Request body:
POST request will have the request body.
The Data Form
{
'the User': 'Tank',
'pwd': '123'
}


four crawling IMDb
: starting from the current position
*: Find all
:? Not looking to find the first

* ?: non-greedy match
. *: greed matches

(. *?): extract data in brackets

''''''
'''
https://movie.douban.com/top250?start=0&filter=
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=

1. The transmission request
2. Parse the data
3. Save data
'''
import requests
import re

# Reptile three-part song
# 1 sends a request
def get_page(base_url):
    response = requests.get(base_url)
    return response

# 2. parse text
def parse_index(text):

    res = re.findall('<div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?导演:(.*?)</p>.*?<span class="rating_num".*?>(.*?)</span>.*?<span>(.*?)人评价</span>.*?<span class="inq">(.*?)</span>', text, re.S)
    # print(res)
    return res

# 3. Save data
def save_data(data):
    with open('douban.txt', 'a', encoding='utf-8') as f:
        f.write(data)

# Main + Enter key
if __name__ == '__main__':
    # A = 10
    # base_url = 'https://movie.douban.com/top250?start={}&filter='.format(num)

    a = 0
    for line in range(10):
        base_url = f'https://movie.douban.com/top250?start={num}&filter='
        a = + 25
        print(base_url)

        # 1 sends a request, the calling function
        response = get_page(base_url)

        # 2. parse text
        movie_list = parse_index(response.text)

        # 3. Save data
        # Formatted data
        for movie in movie_list:
            # print(movie)

            # Decompression assignment
            Ranked # movie, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis
            v_top, v_url, v_name, v_daoyan, v_point, v_num, v_desc = movie
            # v_top = movie[0]
            # v_url = movie[1]
            moive_content = f'''
            Movie Ranking: {v_top}
            Film url: {v_url}
            Movie Name: {v_name}
            Director Starring: {v_daoyan}
            Movie rating: {v_point}
            Number of Evaluation: {v_num}
            Movie Synopsis: {v_desc}
            \n
            '''

            print(moive_content)

            # save data
            save_data(moive_content)

  

Guess you like

Origin www.cnblogs.com/zhylouqi/p/11094500.html