HTTP protocol:
HTTP (Hypertext Transfer Protocol): Hypertext Transfer Protocol. URL is Internet access path resources via the HTTP protocol, a URL corresponding to a data resource.
HTTP protocol operations for resources:
Requests library provides all the basic HTTP request method. Official description: http://www.python-requests.org/en/master
Requests library of six main methods:
Requests library exception:
Two important objects Requests Library: Request (request), Response (corresponding). Request object supports multiple request methods; the Response object contains all the information returned by the server, Request also contains information request.
Property Response object:
Wherein, r.encoding means: if charset header does not exist, encoding is considered as ISO-8859-1.
r.raise_for_status () can know whether r.status_code equal to 200.
HTTP protocol and Requests library comparison:
Climbing frame taken generic code page:
The try. 1: 2 = R & lt requests.get (URL, timeout = 30) . 3 r.raise_for_status () . 4 # If the state is not 200, exception HTTPError initiator . 5 r.encoding = r.apparent_encoding . 6 return r.text . 7 the except: . 8 return 'abnormal'
For example, access to information PMCAFF home page:
1 import requests 2 3 def getHtmlText(url): 4 try: 5 r = requests.get(url,timeout = 30) 6 r.raise_for_status() 7 r.encoding = r.apparent_encoding 8 return r.text 9 except: 10 return '产生异常' 11 12 if __name__ == '__main__': 13 url = 'https://www.pmcaff.com/' 14 print(getHtmlText(url))
爬取网页的通用代码框架:操作环境:win,Python 3.6
参考资料:中国大学MOOC课程《Python网络爬虫与信息提取》