Summer Training python 3

Today contents: 
a reptile principle
two Requests library request




a reptile principle
1. What is the Internet?
It refers to a stack of a network device, to the computer station to the Internet together with a called Internet.

2. The purpose of the establishment of the Internet?
The purpose is to establish the Internet transfer and data sharing data.

3. What is the data?
For example ... Taobao, Jingdong product information such as
number of securities investment information East Fortune, snowball network ...
the chain of home, such as availability of information freely ....
12306 ticket information ...

4. The whole process of the Internet:
- ordinary users:
open browser -> sending a request to a target site -> the fetch response data -> renderer in the browser

- crawlers:
analog browser -> sending a request to a target site -> the fetch response data - -> extract valuable data -> persisted to the data


5. What is the browser sends a request?
request http protocol.

- Client:
the browser is a software -> Client IP and port


- server
https://www.jd.com/
www.jd.com (Jingdong domain name) -> DNS parsing -> IP and port services side of Jingdong

client ip and port ------> IP and port to send the request to the server can establish a link to obtain the corresponding data.


6. The crawler whole process
- the transmission request (request requires libraries: Requests database request, requesting the Selenium library)
- fetch response data (as long as the transmission request to the server, the request returns response data)
- parses and extracts data (requires parsing library : Re, BeautifulSoup4, the Xpath ...)
- saved locally (file processing, database, MongoDB repository)


two requests requests library

1. installation and use
- open cmd
- input: requests the install PIP3

2. crawling video


3. grip packet analysis
open developer mode browser (check) ----> select the network
to find the page suffix xxx.html access (response text)

1) the request url (website address access)
2) request method:
GET:
direct send request data
https://www.cnblogs.com/kermitjam/articles/9692597.html

the POST:
the need to carry the user information transmission request to the target address
https://www.cnblogs.com/login

. 3) Response Status Code:
2xx: Success
3xx: redirection
4xx: Can not find resource
5xx: server error

4) request header information:
the user-agent: user agent (proved to be a request sent by computer equipment and browser)
Cookies: real user login information (to prove your target website users )
Referer: url on the first visit (to prove that you are jumping from target sites on the web)

5) request body:
POST request will have the request body.
The Data Form1
{
'User': 'Tank',
'pwd': '123'
}


Four crawling IMDb
: starting from the current position
*: Find all
:? Not looking to find the first

* ?: non-greedy match.
*: Greedy match

(. *?): Extract data in brackets

movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis
.? <div class = "Item"> * <EM class = ""> </ EM> (*.?)
* <.? .? A href = "(.? *)"> * <span class = "title"> </ span> (*.?)
* director:.? (.? *) </ p> * <span.? class = "rating_num." *?> (. *?) </ span>
. *? <span> (. *?) people commented </ span>. *? < span class = "inq"> (. *? ) </ span>



<div class = "Item">
<div class = "PIC">
<EM class = ""> 226 </ EM>
<A the href = "HTTPS: //movie.douban.com/subject/1300374/">
<img width="100" alt="绿里奇迹" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p767586451.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1300374/" class="">
<span class="title">绿里奇迹</span>
<span class="title"> / The Green Mile</span>
<span class="other"> / 绿色奇迹(台) / 绿色英里</span>
</a>


<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: Frank Darabont   主演:Tom Hanks Tom Hanks / David Morse M ... <br> David <div class = "Star"> </ the p->
1999 & nbsp; / & nbsp; USA & nbsp; / & nbsp; Crime Drama Fantasy Mystery




<span class="rating45-t"></span>
<span class="rating_num" property="v:average">8.7</span>
<span property="v:best" content="10.0"></span>
<span>141370人评价</span>
</div>

<p class="quote">
<span class="inq">天使暂时离开。</span>
</p>
</div>
</div>
</div>

Guess you like

Origin www.cnblogs.com/marcelo1212hala/p/11100797.html