- Preparations
Install pycharm, requests library - requests crawler (template)
import requests
def getHTMTText(url):
try:
r=requests.get(url,timeout=30)
r.raise_for_status() #如果状态不是200,引发HTTPError异常
r.encoding=r.raise_for_status()
return r.text
except:
return "产生异常"
if __name__=="__main__":
url="http://www.baidu.com"
print(getHTMTText(url)) #打印url页面内容
When starting to write a crawler, we must pay attention to the response status. If it returns 404, we can make changes in time.
-
The attribute
r.status_code of the Response object : the return status of the HTTP request, 200 indicates a successful connection, and 404 indicates a failure.
r.text: The string form of the HTTP response content, that is, the page content corresponding to the URL.
r.encoding: The response content encoding method guessed from the HTTP header.
r.apparent_encoding: Analyze the response content encoding method (alternative encoding method) from the content
r.content: Binary form of HTTP response content. -
Understand the abnormal
requests.ConnectionError of the requests library : network connection error exceptions, such as DNS query failure, connection refused, etc.
requests.HTTPError: HTTP error exception.
requests.URLRequired: URL missing exception.
requests.TooManyRedirects: exceeding the maximum number of redirects, a redirect exception is generated.
requests.ConnectTimeout is abnormal when connecting to a remote server. :
requests.Timeout: The request URL timed out, resulting in a timeout exception. -
The 7 main methods of the Requests library
requests.request (): construct a request to support the basic methods of the following methods.
requests.get (): The main method for obtaining HTML pages, which corresponds to HTTP GET.
requests.head (): A method to obtain HTML Wang Yuting header information, corresponding to HTTP HEAD.
request.post (): A method to submit a POST request to an HTML webpage, corresponding to HTTP POST.
requests.put (): A method for submitting PUT requests to HTML web pages, corresponding to HTTP PUT.
requests.patch (): Submit a partial modification request to the HTML page, corresponding to HTTP PATCH.
requests.delete (): Submit a delete request to the HTML page, corresponding to HTTP DELETE. -
HTTP protocol
HTTP, Hypertext Transfer Protocol.
HTTP is a stateless application layer protocol based on the "request and response" mode. The HTTP protocol uses a URL as an identifier for locating network resources. The format of the URL is as follows:
http: // host [: port] [path]
host: legal Internet host domain name or IP address
port: port number, default port is 80
path: path to request resources
- HTTP protocol operation on resources
GET: request to obtain the resource at the URL location.
HEAD: Request to obtain the response message report of the URL location resource, that is, to obtain the resource header information.
POST: Request to append new data to the resource at the URL location.
PUT: Request to store a resource to the URL location, overwriting the resource at the original URL location.
PATCH: Request to partially update the resource of the URL location, that is, change part of the content of the resource there.
DELETE: Request to delete the resource stored in the URL location. - requests.request(method,url,**kwatgs)
method: request method, corresponding to 7 kinds of
urls such as get / put / post : url link of the page to be obtained
** kwargs: access control parameters, a total of 13
method: request method
r=requests.request('GET',url,**kwargs)
r=requests.request('HEAD',url,**kwargs)
r=requests.request('POST',url,**kwargs)
r=requests.request('PUR',url,**kwargs)
r=requests.request('PATCH',url,**kwargs)
r=requests.request('delete',url,**kwargs)
r=requests.request('OPTIONS',url,**kwargs)
** kwargs: The parameters that control access are optional
params: dictionary or byte sequence, added to the url as a parameter.
data: Dictionary, byte sequence or file object, as the content of the Request.
json: JSON singer data, as the content of the Request.
headers: dictionary, HTTP custom headers,
cookies: dictionary or CookieJar, cookies in Request.
auth: tuple, supports HTTP authentication.
files: Dictionary type, transfer files.
timeout: Set the timeout time in seconds.
proxies: dictionary type, set access to proxy server, you can add login authentication.
allow_redirects: True / False, the default is True, redirect switch.
stream: True / False, the default is True, download the switch immediately when you get the content.
verify: True / False, default True, authentication SSL certificate switch.
cert: Local SSL certificate path.