day03 beginner reptile

First, the principle of reptiles 
1, what is the Internet:
refers to a bunch of network equipment, the Internet to a computer with the Internet called
2, the purpose of the establishment of the Internet?
The purpose is to establish the Internet data sharing and transfer of data
3. What is the data:
such as Taobao, Jingdong, product information. .
Some securities investment information East Fortune, snowball network. . .
Chain of home, such as availability of information freely. . .
12306 ticketing information. . .
4, the whole process of the Internet:
the average user: Open Browser - to a target site transmission request - fetch response data - renderer to the browser
crawlers: Analog Browser - to a target site transmission request - fetch response data - to extract effective data - persisted to the data
5, the browser sends a request for what?
http protocol request of
the client:
browser is a software - client IP and port of
the server:
http://www.jd.com/
www.jd.com (Jingdong domain name) -DNS resolve - Jingdong server IP and port
http + ssl: //www.jd.com
client ip ip and port and port to send the request server can establish a link, fetch response data
6, the whole process reptiles:
sending a request (need to request the library: Requests request library, the Selenium request library)
fetch response data (as long as the transmission request to the server, the request will be returned in response to data through)
parses and extracts data (requires parsing library: re , BeautifulSoup4, Xpath ...)
saved to a local (file processing, database, MongoDB repository)
two, requests request library
1, the installation and use of
open cmd
input: request PIP3 install
2, grab video
3, packet capture analysis
to open the browser Developer mode filter ---> select the network
to find pages visited suffix xxx.html (response text)
1) the request url (website address visited

2) request method
gET:
direct access to resources sends a request
https:. / / www cnblogs .. COM / KE rmitj AM / Articles / 9692597 HTML
the POST:
the need to carry the user information transmission request to the target address
HTTPS:.. CNB logs WWW // COM / Login
{
'User': 'Tank', '
' pwd ':' 123 '
}

. 3) the response status code
2xx: Success
3xx: Redirection
4xx: resource not found
5xx: server error

4) request header
user-agent: user agent (proved to be a request by computer equipment and sent by the browser)
Cookies: real user login information (to prove your target user networks)
url on the first visit (proof: Referer your goal is to jump from site over)

5) body request
POST request body will have a request
Form the Data
{
'the User': 'Tank', '
' pwd ':' 123 '
}

Four crawling IMDb

: starting from the current position
*; Find all

: find the first not looking?

* * ?: non-greedy match: greedy match.

: Extract data in brackets (. *?)

Movie rankings, movies url, film name, director - starring - type, movie score, number of reviews, film synopsis <div class = "Item">. *? <EM class = ""> (. *?) </ EM>

. *? <a href="(.*?)"> .? (.? *) * <span class = "title"> </ span>

* director:.? (.? *) </ p> * <span class = "rating_ num" *.?> (.? .?.?.?.? .? *) </ span> * <span> (*) people commented </ span> * <span class = "INQ"> (*) </ span>



Requests:

Import Requests # import requests library request
# Baidu transmission request to the Home, to obtain the corresponding objects 
Response = requests.get (URL = 'HTTPS: //www.baidu.com/')
# Set. 8-character encoding UTF
response.encoding = 'UTF-. 8'
# Print response text
Print (response.text)
# the local response text written
with Open ( 'baidu.html', 'W', encoding = 'UTF-. 8') AS F:
f.write (response.text)


crawling small video:
'' ' 
Pear video
' ''


#import Requests
#video_url = 'https://www.pearvideo.com/video_1570107'
#response = requests.get (URL = video_url)
#Print (response.text)

# to the video source address transmission request
#response = requests.get ( 'https://video.pearvideo.com/mp4/adshort/20190625/cont-1570107-14054821_adpkg-ad_hd.mp4')
# printing a binary stream such as pictures, video, etc.
#print (response .content)
# locally stored video
#with open ( 'video 1.mp4', 'WB') AS F:
# f.write (response.content)


'' '
. 1, the first transmission request to the video Home pears
https: / /www.pearvideo.com/
resolve all of the video obtain the above mentioned id
video_1570302
re.findall ()
2, to obtain the video details page url:
thrilling! Man robbed on the subway slide, on foot went in
https: // www.
Puzzle Qanats
https://www.pearvideo.com/video_1570107
'' '

Import Requests
Import Re # regular, for parsing the text data
# 1, first transmission request to the Video Home pears
response = requests.get (' https: // www.pearvideo.com/ ')
Print (response.text)

#re regular video matching accessories ID
# parameter 1: regular matching rule
# 2 parameters: parse text
# 3 parameters: pattern matching
res_list = re.findall (' <a = href "(.? *) video _" ', response.text, re.S)
Print (res_list)

# stitching each video detail page url
for v_id in res_list:
detail_url =' https://www.pearvideo.com/ _ video '+ v_id
# Print (detail_url)

# send a request for each page for a video before the video source video URL
Response = requests.get (URL = detail_url)
#Print (response.text)

# parses and extracts video details page URL
# video url
= re.findall video_url ( 'srcUrl = "(. *?)"', response.text, re.S) [0]
Print (video_url)

# video name
video_name = re.findall ( '<h1 class = "video- TT "> (. *?) </ h1 of> ', response.text, re.S) [0]
Print (video_name)

# url video transmission request to acquire a video stream of binary
v_response = requests.get (video_url)
with Open ( 's.mp4%'% video_name, 'WB') AS F:
f.write (v_response.content)
Print (video_name, 'video creeping complete')




watercress crawling:
'' ' 
1, sends a request
2, analysis data
3, to maintain local
' ''

Import Requests
Import Re
# crawler Trilogy
# 1, sends a request
DEF the get_page (the base_url):
Response = requests.get (the base_url)
return Response

# 2 , parse text 55
DEF parse__index (text):
.?.?.? RES = re.findall ( '<div class = "Item"> * <EM class = ""> (*) </ EM> * <A href = "(*.?)"> * <span class = "title"> </ span>.? (*.?)

* director:.? (.? *) .? </ p> * <span class = "rating_ num". *?> (. *?) </ span>. *? <span> (. *?) people commented </ span>. *? < span class = "inq"> (. *?) </ span> ')
#print(res)
return res

#3、保存数据
def save_data(data):
with open('douban.txt','a',encoding='utf-8') as f:
f.write(data)

# main + ENTER
if_name == _ '_ main_'
# 10 NUM =
# = f'https the base_url:? //movie.douban.com/top250} {Start = & filter = ', formate (NUM0
NUM = 0
for Line in Range (10):
the base_url = f'https:? //movie.douban.com/top250 NUM} {Start = & filter = '
NUM = + 25
Print (the base_url)

#. 1, the transmission request, the calling function
Response = the get_page (the base_url)
# 2, parse text
movie_list = parse_index (response.text)
# 3, save the data
# data format
for movie in movie_list:
#Print (movie)

# extract the assignment
# movie rankings, movies url, film name, director - starring - type , the film division, the number of evaluators, film synopsis
v_top, v_url, v_name, v_daoyan, v_point, v_num, v_desc = movie
Movie = #v_top [0]
#v_url Movie = [. 1]
movie_content = F
'' '
Ranking: V_TOP {}
...
' ''
Print (movie_content)
# save data
save_data (movie_content)
 

Guess you like

Origin www.cnblogs.com/123456wyf/p/11093904.html