A reptile principle
1 What is the Internet
It refers to a bunch of network equipment, Internet computer to a called station to the Internet together
2 The purpose of the establishment of the Internet
The purpose is to establish the Internet and transfer data to share data
3 What is Data
Such as Taobao, Jingdong product information
Some securities investment information East Fortune, snowball network
Chain of home, such as availability of information freely
12306 ticketing information
4 whole process online
-general user
Open Browser ===> to a target site sends a request ===> fetch response data ===> renderer to the browser
- crawler:
Analog browser ===> to a target site sends a request ===> fetch response data ===> extract valuable data ===> persisted to the data
5 sent by the browser what request
http request protocol
- Client
Browser is a software ===> Guests end IP and port
- server
www.jd.com ( Jingdong domain ) ===> DNS parsing ===> Jingdong service side of the IP and port
http+ssl://www.jd.com/
The client's IP and port ===> service side of the IP and port to send the request to obtain the corresponding data link can be established
6 whole process reptiles
A transmission request (request requires libraries: Requests request library, the Selenium request library)
2 the fetch response data (as long as the transmission request to the server, the request will be returned by the response data)
3 parse and extract data (requires parsing library: Re, BeautifulSoup4 , Xpath ....... )
4 saved locally
(File processing, database, MongoDB repository)
Two requests requesting library
Reptile single page
Import requests # import request requests library # # Baidu site sends a request to obtain the response object requests.get = Response (URL = ' https://www.baidu.com ' ) # Set character encoding utf-8 response.encoding = ' UTF-. 8 ' # printing text in response Print (response.text) # the local response text written Open with ( ' baidu.html ', ' W ', encoding = ' UTF-. 8 ' ) AS F: f.write ( response.text)
1 Installation
- Open cmd
- Input: PIP3 install Requests
2 Use
① first send a request to the pear Video Home
https://www.pearvideo.com/
Get all of the analytical video id: video_1570302
re.findall()
② acquire video detail page url
Title: thrilling! Man robbed on the subway slip
https://www.pearvideo.com/video_1570302
Reptile single video
Import Requests
video_url = ' https://www.pearvideo.com/video_1570302 ' Response = requests.get (URL = video_url) Print (response.text) # to the video source address (right-click blank video, checkpointing, right sh'pin) = requests.get transmission request Response ( ' https://video.pearvideo.com/mp4/adshort/20190625/cont-1570302-14057031_adpkg-ad_hd.mp4 ' ) # printing a binary stream, such as pictures, video, etc. data Print (response.content) # save local video Open with ( ' video .mp4 ', ' WB ' ) AS F: f.write (response.content)
Pear video on the home page of all reptiles
Import Requests Import Re # regular, for explaining the text data # # to send a request to the Video Home pears Response = requests.get ( ' https://www.pearvideo.com/ ' ) # # Print (response.text) # # # #re regular matching accessories video ID # # parameter 1: regular matching rule # # parameter 2: parse text # # 3 parameters: match mode res_list = the re.findall ( ' . "video _ (*)?" <A the href = ' , response.text, re.S) Print (res_list) # # stitching each video detail page url for v_id in res_list: detail_url =' Https://www.pearvideo.com/video_ ' + v_id Print (detail_url) # # # for each video page before acquiring data transmission request Response = requests.get (URL = detail_url) # Print (response.text) # # # parse and extract details page video url # # video url video_url = re.findall ( ' srcUrl = "(. *?)" ' , response.text, re.S) [0] Print (video_url) # video name video_name = the re.findall ( ' <h1 of class = "Video-TT"> (. *?) </ h1 of> ' , response.text, Re. S)[0] print(video_name) v_response=requests.get(video_url) with open('%s.mp4'%video_name,'wb')as f: f.write(v_response.content) print(video_name,'视频爬取成功')
IMDb reptiles first ten pages (including movies rankings, starring, etc.)
import requests import re # #爬虫三部曲 # # 1.发送请求 def get_page(base_url): response=requests.get(base_url) return response # 2解析文本 def parse_index(text): res = re.findall('<div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?导演:(.*?)</p>.*?<span class="rating_num".*?>(.*?)</span>.*?<span>(.*?)人评价</span>.*?<span class="inq">(.*?)</span>', text, re.S) # save data. 3#RES return ## def save_data(data): with open('double.txt','a',encoding='utf-8') as f: f.write(data) # # #main +回车键 if __name__=='__main__': #num=10 num=0 for line in range(10): base_url=f'https://movie.douban.com/top250?star={num}&filter=' num+=25 print(the base_url) # # send the request, the calling function Response = the get_page (the base_url) # # parsing variables movie_list = parse_index (response.text) # # save data. 3 # # data format for Movie in movie_list: # Print (Movie) # # decompression assignment # # movie rankings, movies url, name of the movie, starring, ratings, number, Profile V_TOP, v_url, v_name, v_daoyan, v_point, v_num, v_desc = movie # V_TOP = movie [0] # v_url = movie [1] F = movie_content '' ' Movie Ranking: {v_top} movies url: {v_url} Movie Name: {v_name} Movie Starring: {v_daoyan} Movie Ratings: {v_point} Number of movies: {v_num} The movie: {v_desc} \ the n- '' ' Print (movie_content ) save_data (movie_content)