python Learning python reptile third day

A reptile principle

What is the Internet

It refers to a bunch of network equipment, Internet computer to a called station to the Internet together

The purpose of the establishment of the Internet

The purpose is to establish the Internet and transfer data to share data

What is Data

Such as Taobao, Jingdong product information

Some securities investment information East Fortune, snowball network

Chain of home, such as availability of information freely

12306 ticketing information

 

4 whole process online

-general user

Open Browser ===> to a target site sends a request ===> fetch response data ===> renderer to the browser

- crawler:

Analog browser ===> to a target site sends a request ===> fetch response data ===> extract valuable data ===> persisted to the data

5   sent by the browser what request

http request protocol

 

- Client

Browser is a software ===> Guests end IP and port

 

- server

https://www.jd.com/

www.jd.com ( Jingdong domain ) ===> DNS parsing ===> Jingdong service side of the IP and port

http+ssl://www.jd.com/

 

The client's IP and port ===> service side of the IP and port to send the request to obtain the corresponding data link can be established

 

whole process reptiles

  A transmission request (request requires libraries: Requests request library, the Selenium request library)

  2 the fetch response data (as long as the transmission request to the server, the request will be returned by the response data)

  3 parse and extract data (requires parsing library: Re, BeautifulSoup4 , Xpath ....... )

  4 saved locally          

(File processing, database, MongoDB repository)

 

Two  requests requesting library

 

Reptile single page

Import requests         # import request requests library  # # Baidu site sends a request to obtain the response object requests.get = Response (URL = ' https://www.baidu.com ' ) # Set character encoding utf-8 response.encoding = ' UTF-. 8 ' # printing text in response Print (response.text) # the local response text written Open with ( ' baidu.html ', ' W ', encoding = ' UTF-. 8 ' ) AS F: f.write ( response.text)

 

Installation

       - Open cmd

  - Input: PIP3 install Requests

2 Use

① first send a request to the pear Video Home

https://www.pearvideo.com/

Get all of the analytical video id: video_1570302

re.findall()

② acquire video detail page url

       Title: thrilling! Man robbed on the subway slip

       https://www.pearvideo.com/video_1570302

 

 

Reptile single video

Import Requests 


video_url = ' https://www.pearvideo.com/video_1570302 ' Response = requests.get (URL = video_url) Print (response.text) # to the video source address (right-click blank video, checkpointing, right sh'pin) = requests.get transmission request Response ( ' https://video.pearvideo.com/mp4/adshort/20190625/cont-1570302-14057031_adpkg-ad_hd.mp4 ' ) # printing a binary stream, such as pictures, video, etc. data Print (response.content) # save local video Open with ( ' video .mp4 ', ' WB ' ) AS F: f.write (response.content)

 

 

Pear video on the home page of all reptiles

Import Requests
 Import Re     # regular, for explaining the text data 
# # to send a request to the Video Home pears 
Response = requests.get ( ' https://www.pearvideo.com/ ' )
 # # Print (response.text) 
# 
# 
# #re regular matching accessories video ID 
# # parameter 1: regular matching rule 
# # parameter 2: parse text 
# # 3 parameters: match mode 
res_list = the re.findall ( ' . "video _ (*)?" <A the href = ' , response.text, re.S)
 Print (res_list)
 # # stitching each video detail page url 
for v_id in res_list: 
     detail_url =' Https://www.pearvideo.com/video_ ' + v_id
      Print (detail_url)
 # 
# # for each video page before acquiring data transmission request 
Response = requests.get (URL = detail_url)
  # Print (response.text) 
# 
# # parse and extract details page video url 
# # video url 
video_url = re.findall ( ' srcUrl = "(. *?)" ' , response.text, re.S) [0]
 Print (video_url) 

# video name 
video_name = the re.findall (
      ' <h1 of class = "Video-TT"> (. *?) </ h1 of> ' , response.text, Re.
S)[0]
print(video_name)
v_response=requests.get(video_url)
with open('%s.mp4'%video_name,'wb')as f:
     f.write(v_response.content)
     print(video_name,'视频爬取成功')

 

IMDb reptiles first ten pages (including movies rankings, starring, etc.)

import requests
import re
# #爬虫三部曲
# # 1.发送请求
def get_page(base_url):
    response=requests.get(base_url)
    return response
 # 2解析文本
def parse_index(text):
    res = re.findall('<div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?导演:(.*?)</p>.*?<span class="rating_num".*?>(.*?)</span>.*?<span>(.*?)人评价</span>.*?<span class="inq">(.*?)</span>',
        text, re.S) 
# save data. 3#RES
return
##
    
def save_data(data):
    with open('double.txt','a',encoding='utf-8') as f:
        f.write(data)
#
# #main +回车键
if __name__=='__main__':
    #num=10
    num=0
    for line in range(10):
         base_url=f'https://movie.douban.com/top250?star={num}&filter='
         num+=25
         print(the base_url)
 #          # send the request, the calling function 
         Response = the get_page (the base_url)
 #          # parsing variables 
         movie_list = parse_index (response.text)
 #          # save data. 3 
#          # data format 
         for Movie in movie_list:
              # Print (Movie) 
#              # decompression assignment 
#              # movie rankings, movies url, name of the movie, starring, ratings, number, Profile 
             V_TOP, v_url, v_name, v_daoyan, v_point, v_num, v_desc = movie
              # V_TOP = movie [0] 
             # v_url = movie [1] 
             F = movie_content '' '
             Movie Ranking: {v_top} 
             movies url: {v_url} 
             Movie Name: {v_name} 
             Movie Starring: {v_daoyan} 
             Movie Ratings: {v_point} 
             Number of movies: {v_num} 
             The movie: {v_desc} 
             \ the n- 
             '' ' 
             Print (movie_content ) 
             save_data (movie_content)

 

Guess you like

Origin www.cnblogs.com/101720A/p/11093895.html