First, the basic principles of reptiles
1, what is reptiles?
Reptile is crawling data.
2. What is the Internet?
A stack of a network device, to a station
computer network called the Internet together.
3, the purpose of the establishment of the Internet
to share and transfer data to the data.
4. What is the data?
For example:
Product Information electricity supplier platform (Taobao, Jingdong, Amazon)
chain of home, comfortable rental housing information platform
stock portfolio information (Eastern wealth, snowball network)
...
12306, ticket information (grab votes)
5. What is the Internet?
Ordinary users:
open a browser
---> enter the URL
---> host sends a request to a target
---> returns the response data
---> the rendering data to the browser in
the crawler:
Analog browser
---> To target host sends a request
---> return response data
---> parse and extract the valuable data
---> saved data (files written to the local,
persistent to the database)
6, the whole process crawler
1. The transmission request (Library: Requests / the Selenium)
2. Get response data
3. analysis data (parsing library: BeautifulSoup4)
4. save the data (repository: file save / MongoDB)
summary: we can put the data in the Internet likened to a treasure,
reptile is in fact digging treasure.
Two, requests requesting library
1, installation and use
PIP3 the install Requests
2, process analysis request (analog browser)
- Baidu:
1. Request URL
https://www.baidu.com/
2. Request embodiment
the GET
the POST
3. Response status code
4. The request header information
We reptiles use to request library requests:
installation:
1
|
pip3 install requests
|
Each time used in the previous code with
import requests
Import Requests # crawler Trilogy # 1 sends a request DEF the get_page (URL): Response = requests.get (URL) return Response # 2. Analytical Data Import Re DEF parse_index (HTML): # findAll match all # the re.findall ( 'regular matching rule', 'matching text', 'matching mode') # re.S: searching for all text matching detail_urls the re.findall = ( ' <div class = "items"> <A class = "imglink" the href = "(*.?)" ' , HTML, re.S) return detail_urls # resolve details page DEF parse_detail (HTML): movie_url Re.findall = ( ' <Source src = "(*).?"> ' , HTML, re.S) IF movie_url: return movie_url [0] # returns only the details of the video URL # 3. Save the Data Import uuid # uuid .uuid4 () stamp generating section according to the only string DEF save_video (Content): with Open (F ' . uuid.uuid4 {()} MP4 ' , ' WB ' ) AS F: f.write (Content) Print ( ' video downloaded ' ) # test IF __name__ == ' __main__ ': For Line in Range (. 6 ): URL = F ' http://www.xiaohuar.com/list-3-{line}.html ' # transmission request Response = the get_page (URL) # Print (Response) # # Returns response status code # Print (response.status_code) # # return a response text # Print (response.text) # parse Home page detail_urls = parse_index (response.text) # loop through the details page url for detail_url in detail_urls: Print (detail_url) #Details of each page to send a request detail_response = the get_page (detail_url) # Print (response.text) # resolve details page for a video url movie_url = parse_detail (detail_response.text) # determines the presence of video and print url IF movie_url: Print (movie_url) # video transmission request url to acquire a video stream of binary movie_response = the get_page (movie_url) # binary video stream saved locally to save_video function save_video (movie_response.content)
Three, post request automatic sign github
"""
请求URL:https://github.com/login
请求头:
cookies
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36
请求体:
Form Data:
commit: Sign in
utf8: ✓
authenticity_token: mOyooOxE7c/H7krQKEli+cwwy0+okuPOZqGEBSyu/kRfLHT2mNGu9RQcDnb1ovua1zQe3LOyYXxrWxFL+2aAcg==
login: gr6g5r
password: huhuhhge
webauthn-support: supported
"""
import requests import re login_url = 'https://github.com/login' login_header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36' } login_res = requests.get(url=login_url, headers=login_header) # 解析提取token字符串 authenticity_token = re.findall( '<input type="hidden" name="authenticity_token" value="(.*?)"', login_res.text, re.S ) [0] Print (authenticity_token) # acquires login information cookies page login_cookies = login_res.cookies.get_dict () # request URL session_url = ' https://github.com/session ' # request header session_headers = { ' the User-- Agent ' : ' mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 ' } # request-information form_data = { " the commit " : " Sign in " , "utf8": "✓", "authenticity_token": authenticity_token, "login": "*******", "password": "*******", "webauthn-support": "supported" } session_res = requests.post( url=session_url, headers=session_headers,= cookieslogin_cookies, data=form_data ) with open('github.html', 'w', encoding='utf-8') as f: f.write(session_res.text)