Learning Link http://stu.ityxb.com/openCourses/detail/238
What are reptiles:
Web crawler is to simulate a network browser sends a request to accept a request response information automatically crawl the Internet according to certain rules
Reptile uses:
Data Acquisition (Baidu news, headlines today), 12306 grab votes, the network automatically vote,
Debugging tools:
Fn + F12
Browser request process:
URL rules
、
http request
http request is an important part of
Request URL, the request mode (post, GET), the request header, the request body
http response format
http an important part of the response
Response status code: 404, 500, 200 (successful)
Responsive head,
Response body (html content)
Ruquests module
Python is a module that can simulate the browser sends the request acquisition response
Learning materials:
http://cn.python-requests.org/zh_CN/latest/
installation
pip install requests
Website crawling steps:
Step one: analysis
Request url, mode request, the first request, request parameters
Step II: Analog browser sends a request acquisition response
'' ' URL https://www.baidu.com/baidu?wd=%E7%9F%B3%E5%AE%B6%E5%BA%84%E5%AD%A6%E9%99%A2 request method get request header User-Agent: Mozilla / 5.0 ( Windows NT 10.0; Win64; x64; rv: 74.0) Gecko / 20100101 Firefox / 74.0 request parameter wd =% E7% 9F% B3 % E5% AE% B6% E5% BA? the AD E5% 84%%%% 99% E9% A6 A2 '' ' # 1. import module import requests # 2. analog transmission request acquirer response = requests.get ( URL = " https://www.baidu.com / baidu / S " , headers = { " the User-- Agent " : "Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-; RV: 74.0) the Gecko / Firefox 20,100,101 / 74.0 " , } ) # 3. The results of the content of the response processing with Open ( ' fetch response content .html ' , ' W ' , = encoding ' UTF8 ' ) AS F: f.write (response.text)
Implement a custom request parameter