03python reptiles

Today's content:

First, the principle of reptiles

Two, Requests request library

First, the principle of reptiles

1. What is the Internet?

In fact, the Internet is a bunch of network equipment (such as : network cables, routers, switches, firewalls etc ... connection) and a computer station is made like a spider's web.
2 , the purpose of the establishment of the Internet

The core value of the Internet : data is stored in a single computer, and the Internet is interconnected to the computer together, which means that the data in a computer are tied together, and the purpose is to be able to facilitate each computer data transfer and sharing of data between each other, otherwise you can only get U copy on disk or removable hard disk to the computer data of others.

3. What is the data?

Such as Jingdong Taobao commodity information ...

Eastern wealth, snowball net securities investment information ...

Chain of home, such as availability of information freely ...

12306
4, the whole process of the Internet:

1 , ordinary users access to data mode :
the browser to submit a request ---> Download page code ---> parsing / rendering into pages.

2 , crawler access to data mode :
Analog browser sends a request ---> Download page code ---> only to extract useful data ---> stored in the database or file.

The difference between regular users and crawlers are :
ordinary users : ordinary users is by opening a browser to access the Web page, the browser will receive all the data.

Crawler : crawlers to retrieve only the code page of our valuable data.

5 , the browser sends a request for what?

http request protocol .

- Client :

Browser is a software -> client IP and port

- server

www.jd.com ( domain name ) -> DNS resolve -> Server Ip and port

http+ssl://jd.com

Client IP and port -> server IP and port to send the request to establish a connection to obtain the corresponding data.

6, the whole process of reptiles

- sends a request ( need to request library: Requests request library, the Selenium request library )

- fetch response data (as long as the transmission request to the server, the request will be returned by the response data)

- parse and extract data (requires parsing library: re.beautifulSoup4 , Xpath ... )

- saved to a local (file processing, database, MongoDB repository)

Two, requests requesting library

1, installation and use

- Open cmd

- Input: PIP3 install Requests

'' '

1 , the response status :

200 : on behalf of success

301 : Jump on behalf of

404 : file does not exist

403 : Permissions

502 : Server Error

2 , the first response : Respone header

cookie-the SET : There may be more, it is to tell the browser, the cookie preserved.

3 , page source : preview

The most important part, contains the contents of the requested resource , such as a web page HTML , images and binary data.

'''

2. crawling video

3. packet capture analysis

Open your browser's developer mode ( check ) ----> select the network to find pages visited suffix xxx. Html ( response text )

1) the request url ( visit website address )

2) Request mode :

GET:

Acquiring data transmission request directly

https://ww. cnb logs. com/ kermitjam/art icles/ 9692597.html

POST:

Need to carry user sends a request message to the destination address

https: // www. cnblogs. com / Login

3) Response Status Code :

2xx: Success

3xx: Redirection

4xx: Can not find resource 5xx: Server Error

4) the request header :

User-Agent: User Agent ( proved to be a request sent by computer equipment and browser ) Cookies: Log real user information ( to prove your target user site )

Referer: on a visit to the url ( to prove that you are jumping from target sites on the web )

5) the request body :

POST request will have a request body.

Form Data

{

POST:

You need to carry user information is sent to the destination address

https://www.cnblogs.com/login

{

‘user’:’tank’,

‘pwd’:’123’

}

# # Import requests # Import REQUEST request library
### sends a request to the Home Baidu, acquiring the response object
## requests.get Response = (URL = 'HTTP: //www.baidu.com/')
# # # Set Character Encoding It is. 8-UTR
# # response.encoding = 'UTF-. 8'
# # # print response text
# # Print (response.text)
# # # text written in response to a local
# # with open ( 'baidu.html' , ' W ', encoding =' UTF-. 8 ') AS F:
## f.write (response.text)
# # video options
# 1 # pear video
# Import Requests
#
# = requests.get Response (' HTTPS: // video .pearvideo.com / MP4 / adshort / 20,190,625 / 1569974-14053625_adpkg-ad_hd.mp4-CONT ')
# Print (response.content)
# # locally stored video
# with open (' video .mp4 ',' wb ') as f:
# f.write (the Response.Content)
Import Requests
Import regular Re #, module for parsing text

# 1, first transmission request to the Video Home pears
Response = requests.get ( 'https://www.pearvideo.com/')
#Print (response.text)
# Re regular video matching accessories ID
# 1 Parameters: Match Regular rule
# 2 parameters: parse text
# 3 parameters: pattern matching
res_list the re.findall = ( '. "? Video _ (*)" <the href = A', response.text, re.S)
#Print (res_list)
# splicing per a video details page URL
for v_id in res_list:
    detail_url = 'https://www.pearvideo.com/video_' + v_id
    #Print (detail_url)

# send a request for details page for each of the video data
response = requests.get (url detail_url =)
video_url the re.findall = ( 'srcUrl = "(. *?)"', response.text, re.S) [0]
Print (video_url)
video_name the re.findall = ( '<h1 of class = "Video- tt "> (. *?) </ h1> ', response.text, re.S)[0]
print(video_name)
v_response=requests.get(video_url)
Open with ( '% s.mp4'% video_name, 'WB') AS F:
    f.write (v_response.content)
    Print (video_name, 'video crawling Complete')

Fourth, crawling IMDb

: Starting from the current position *: Find all

?: Find the first not to look

* ?: non-greedy matching *: greedy match

(. *?): Extract data in brackets

Movie rankings, movies url , film name, director - starring - the type of movie scores, number of reviews, film synopsis .? <Div class = "item "> * <em class = ""> </ em> (*.?)

.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>

.*?导演:(.*?)</p>.*?<span class="rating_ num.*?>( .*?)</span>.*?<span>(.*?)人评价</span> .*?<span class=" inq">(.*?)</span>

# 1、发送请求
# 2.解析数据

# 3.保存数据
# https://movie.douban.com/top250?start=50&filter=
# https://movie.douban.com/top250?start=125&filter=
# https://movie.douban.com/top250?start=150&filter=

import requests
import re

def get_page(base_url):
    response = requests.get(base_url)
    return response

def parse_index(text):
    res = re.findall('<div class="item">.*?<em class="">( .*?)</em>.*?<a href="(.*?)">.'
    return res
def save_ data(data):
    with open('douban.txt', 'a', encoding='utf-8') as f:
    f.write(data)

Guess you like