Day03 python

       Today is my third day of practice, today's learning curve compared to the previous two days has been greatly improved. But today's very interesting to learn, I learned how to crawl URLs, videos, pictures. But it feels difficult, has not been fully accepted. There are not many places to get to know.

The following is my study notes:
 
Reptile principle:
What is the Internet?
In fact, the Internet is a bunch of network equipment (such as: network cables, routers, switches, firewalls etc ...), the computer Internet a table together called the Internet.
The purpose of the establishment of the Internet?
The purpose is to establish the Internet transfer and data sharing data.
What is the data?
For example ... Taobao, Jingdong product information such as
number of securities investment information ... Eastern wealth, snowball network
home chain, such as availability of information freely ...
12306
 
online the whole process:
  - ordinary users:
     open the browser -> sending a request to a target site -> the fetch response data -> renderer in the browser
  - crawlers:
 analog browser -> sending a request to a target site -> the fetch response data -> extract valuable data - > persisted to the data
 
sent by the browser what request?
request http protocol.
 
Client:
Browser is a software -> Client IP and port of
 
the server:
   https://www.jd.com/
www.jd.com (Jingdong domain name) -> DNS parsing -> server IP and port Jingdong
client ip and port ------> IP and port to send the request to the server can establish a link fetch response data.
 
The whole process reptiles
     - sending a request (need to request the library: Requests request library, the Selenium request library)
     - fetch response data (as long as the transmission request to the server, requesting the adoption will return response data)
     - parses and extracts data (requires parsing library: re , BeautifulSoup4, Xpath ...)
     - save to a local (file processing, database, MongoDB repository)
 
 
 
two. Requests requesting libraries
to install and use
open cmd
enter: | pip3 install requests
 
crawling video
video options:
pear video
Xiaohua network video
 
    before sending a request to the Home Video pear
     https://www.pearvideo.com/
   
     resolve all of the video obtain id:
        video_1570302
       
         re.findall ()
Gets the video details page
     Huawei Mate20X won debut 5G equipment into the network license
https://www.pearvideo.com/video_1570259
 
packet capture
     developer mode to open the browser (check) ----> select network
found visited pages suffix xxx.html (response file)
 
1) the request url (website address access)
2) request method:
gET:
direct sending a request to obtain data
https://www.cnblogs.com/kermitjam/p/10863916. HTML
    POST:
the need to carry user information to the target address to send a request
https://www.cnblogs.com/login
{
'User': 'Tank',
'pwd': '123'
}
. 3) Response Status Code:
2xx: Success
3xx: Redirection
4xx: resource not found
5xx: Server Error
4) the request header:
the User-- Agent: user agent (proved to be a request sent by computer equipment and browser)
Cookies: real user login information (user prove your target site)
Referer: url on the first visit (to prove that you are jumping from target sites on the web)
5) request body:
POST request will have the request body.
The Data Form1
{
'User': 'Tank',
'pwd': '123'
}
import requests

response=requests.get(url='https://www.baidu.com/')
response.encoding='utf-8'

print(response.text)

with open('baidu.html','w',encoding='utf-8')as f:
    f.write(response.text)


import requests

response=requests.get(
    'https://video.pearvideo.com/mp4/adshort/20190625/cont-1570353-14057692_adpkg-ad_hd.mp4'
)

print(response.content)

with open('视频.mp4','wb')as f:
    f.write(response.content)

 

Guess you like

Origin www.cnblogs.com/lishine/p/11094391.html