1. Obtain --- requests web content library
"We need to understand the HTTP protocol."
> Requests the library seven main method
method
|
Explanation
|
requests.requests()
|
A configuration request, a method of supporting at the basis of the method
|
requests.get()
|
Get HTML pages of the main methods, corresponding to the GRT and HTTP |
requests.head()
|
HTML page header information obtaining method corresponding to the HTTP HEAD
|
requests.post()
|
POST request methods to submit HTML pages corresponding to the HTTP POST
|
requests.put()
|
Method PUT request to submit HTML pages corresponding to the HTTP PUT
|
requests.patch()
|
Local modification request to submit HTML pages corresponding to the HTTP PATCH
|
requests.delete()
|
Submit a request to delete HTML pages, corresponding to the HTTP DELETE
|
> Understand requests library exception
abnormal
|
Explanation
|
requests.ConnectionError |
Network connection error exceptions, such as DNS query failed, refused connections
|
requests.HTTPRrror | HTTP connection error exception |
requests.URLRrror
|
URL missing abnormal |
requests.TooManyRedirecrs
|
Exceeds the maximum number of redirects, redirect produce abnormal
|
requests.ConnectTimeout
|
Connect to a remote server timeout exception
|
requests.Timeout
|
URL request times out, resulting in a timeout exception
|
> Requests the get () method
Return value get () method is a Response object, as a result of the server Response to get () response, with its own properties and methods.
The main property of the Response object as follows:
Attributes
|
Explanation
|
status_code
|
Returns the requested HTTP status code indicates a successful link 200, 404 represents a failure
|
text
|
String HTTP response content, i.e., the corresponding URL page content
|
encoding
|
Encoding HTTP response
|
content
|
HTTP response binary form content
|
headers
|
It returns a dictionary, the content server in response to head
|
url
|
Return URL request
|
apparent_encoding
|
Analysis of the content of the response from the content encoding (encoding alternatively)
|
The difference between encoding and apparent_encoding
encoding: if charset header does not exist, encoding is considered to ISO-8859-1
apparent_encoding the analyzed content of the page encoding
apparen_encoding more accurate
> Crawled pages generic code frame
import requests
def getHTMLText(url):
try:
r = requests.get(url,timeout=30)
r = raise_for_status () # If the status is not 200, exception caused HTTPError
r.encoding = r.apparent_encoding
return r.text
except:
return 'abnormal'
url = "http://www.baidu.com"
print(getHTMLText(url))