A reptile principle
1. What is the Internet?
It refers to a stack of a network device, to the computer station to the Internet together with a called Internet.
2. The purpose of the establishment of the Internet?
The purpose is to establish the Internet transfer and data sharing data.
3. What is the data?
For example ... Taobao, Jingdong product information such as
number of securities investment information East Fortune, snowball network ...
the chain of home, such as availability of information freely ....
12306 ticket information ...
4. The whole process of the Internet:
- ordinary users:
open browser -> sending a request to a target site -> the fetch response data -> renderer in the browser
- crawlers:
analog browser -> sending a request to a target site -> the fetch response data -> extract valuable data -> persisted to the data
5. What is the browser sends a request?
request http protocol.
- Client:
the browser is a software -> Client IP and port
- server
https://www.jd.com/
www.jd.com (Jingdong domain name) -> DNS parsing -> IP server and Jingdong port
client ip and port ------> IP and port to send the request to the server can establish a link to obtain the corresponding data.
6. The whole process of reptiles
- sending a request (need to request the library: Requests request library, the Selenium request library)
- fetch response data (just to the server a send request after by return response data)
- parses and extracts data (requires parsing library: re, BeautifulSoup4, Xpath ...)
- save to a local (file processing, database, MongoDB repository)
two requests requests library
1. Installation and Use
- open cmd
- Input: Requests PIP3 install
2. crawling video
3. packet capture analysis
Open developer mode browser (check) ----> select the network
to find pages visited suffix xxx.html ( response text)
1) the request url (website address access)
2) request method:
gET:
direct sending a request to obtain data
https://www.cnblogs.com/kermitjam/articles/9692597.html
POST:
the need to carry user information to target send the requested address
https://www.cnblogs.com/login
. 3) response status code:
2xx: success
3xx: redirection
4xx: resource not found
5xx: server error
4) the request header:
User-Agent: User Agent (proved to be a request sent by computer equipment and browser)
Cookies: real user login information (user prove your target site)
Referer: url on the first visit (to prove that you are jumping up from the target site a)
5) request body:
POST request will have the request body.
The Data Form
{
'the User': 'Tank',
'pwd': '123'
}
four crawling IMDb
: starting from the current position
*: Find all
:? Not looking to find the first
* ?: non-greedy match
. *: greed match
Crawling video
import requests import re response = requests.get('https://www.pearvideo.com/') res_list = re.findall('<a href="video_(.*?)"',response.text, re.S) for v_id in res_list: detail_url = 'https://www.pearvideo.com/' + v_id response = requests.get(url=detail_url) video_url = re.findall('srcUrl="(.*?)"',response.text, re.S)[0] video_name = re.findall('<h1 class="video-tt">(.*?)</h1>',response.text, re.S)[0] print(video_name) v_response = requests.get(video_url) with open('%s.mp4' % video_name, 'wb') as f: f.write(v_response.content) print(video_name, '视频爬取完成')
(. *?): Extract data in brackets
Movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis
.? <Div class = "item "> * <em class = ""> </ em> (*.?)
.?.? * <a href="(.*?)"> * <span class = "title"> </ span> (*.?)
* director:.? (.? *) </ p>. *? <span class = "rating_num." *?> (. *?) </ span>
. *? <span> (. *?) people commented </ span>. *? < span class = "inq"> (.? *) </ span>
<div class = "Item">
<div class = "PIC">
<EM class = ""> 226 </ EM>
<A href = "HTTPS: //movie.douban. COM / Subject / 1,300,374 / ">
<IMG width =" 100 "Alt =" The green Mile "src =" https://img3.doubanio.com/view/photo/s_ratio_poster/public/p767586451.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1300374/" class="">
<span class = "title"> Green Mile </ span>
<span class = "title"> nbsp &; / & nbsp; at The Green Mile </ span>
<span class = "OTHER"> & nbsp; / & nbsp; The Green Mile (units) / green mile </ span>
</a>
<span class = "playable"> [play] </ span>
</ div>
<div class = "BD">
<P class = "">
director : Frank Darabont & nbsp; & nbsp ; & nbsp; starring: Tom Hanks Tom Hanks / David Morse David M ... <br>
1999 & nbsp; / & nbsp; USA & nbsp; / & nbsp; crime Drama Fantasy Mystery
</ p>
<div class = "star">
<span class="rating45-t"></span>
<span class="rating_num" property="v:average">8.7</span>
<span property="v:best" content="10.0"></span>
<span>141370人评价</span>
</div>
<p class="quote">
<span class="inq">天使暂时离开。</span>
</p>
</div>
</div>
</div>
Movie ranking
if _name_ == '_main_': num = 0 for line in range(10): base_url = f'https://movie.douban.com/top250?start={num}&filter=' num +=25 print(base_url) response = get_page(base_url) movie_list = parse_index(response.text) for movie in movie_list: v_top, v_url, v_name, v_director, v_point, v_num, v_desc = movie movie_content = f '' ' movie rankings: {v_top} movies url: {v_url} Movie Name: {v_name} filmmaker: {v_director} Movie Ratings: {v_point} Number of evaluators: {v_num} The movie: {v_desc} \ the n- '' ' Print (movie_content) save_data (movie_content)