HTTP protocol
HTTP, Hypertext Transfer Protocol Hypertext Transfer Protocol
HTTP is based on a "request and response" mode, stateless application layer protocol
HTTP protocol uses the URL as identification locate network resources.
URL format: HTTP: // Host [: POST] \ [path]
URL is Internet access path resources via the HTTP protocol, a URL corresponding to a data resource.
host: legal Internet host domain name or IP address
port: port number, the default port 80 (default is 80)
path: the path of the requested resource
HTTP protocol operations for resources
The main master get, head
Web links risk, exception handling is very important:
#检测代码是否可以爬取
import requests
def getHTMLTEXT(url):
try:
r = requests.get(url,timeout = 30)
r.raise_for_status() #如果状态码不是200,会出错
r.encoding = r.apparent_encoding
return r.text # 获取头部
except Exception as e:
return e
if __name__ == '__main__':
url = "http://baidu.com"
print(getHTMLTEXT(url))
reptile
Web crawler process:
Gets the page: Get page is a web page to send the request, the page will return data for the entire page, type a similar website in the browser and press Enter, and then you can see the whole content of the site.
Parsing website (information extraction required): is to extract the desired data from the data of the entire web page, similar to the site to see the whole page in a browser to go in, but you need to extract the information you need
Stored message: Usually csv or may be stored in a database.
Web crawler technology:
- Get pages:
- Web page acquiring basic technologies: requests, urllib and selenium (analog browser can use Google simulation)
- Web page acquiring advanced technologies: multi-process multi-threaded crawl, crawl landing, breaking the ban and IP servers crawl.
- Parsing website:
- The basis of technical analysis pages: re regular expressions, Beautiful Soup and lxml
- Advanced analytical web technology: Solving the Chinese garbage problem
- Storing data:
- The underlying technology to store data: txt file and stored into csv file
- Advanced technology to store data: Mysql database into a database and stored in MongoDB
Directional control network data and web crawling ability to resolve the basic
The website as API
requests
Automatic crawling html page, automatic network requests submitted
Best reptile library
-
method Explanation requests.request() A configuration request, the following method of supporting foundation method requests.get() Get HTML pages of the main methods of the corresponding HTTP GET requests.head() 获取HTML网页头信息的方法,对应于HTTP的HEAD request.post() 向HTML网页提交POST请求的方法,对应于HTTP的POST request.put() 向HTML网页提交PUT请求的方法,对应于HTTP的PUT request.patch() 向HTML网页提交局部修改请求,对应HTTP的PATCH requests.delete() 向THTML网页提交删除请求,对应HTTP的DELETE - requests.get(url,params = None,**kwargs)
- url:网页的url链接
- params:url的额外参数,字典或者字节流格式,是可选择的
- **kwargs:12个控制访问的参数,是可选择的。
- requests.get(url,params = None,**kwargs)
Requests库的2的重要对象
r = requests.get(url)
Response(包含了爬虫返回的所有内容)
-
Response对象的属性 说明 r.status_code HTTP请求的返回状态,200表示链接成功,404表示失败(一般来说除了200,其余都是失败的) r.text HTTP响应内容的字符串形式,即,url对应的页面内容 r.encoding 从HTTPheader中猜测的响应内容编码方式 r.apparent_encoding 从内容中分析出响应内容的编码方式(备选编码方式) r.content HTTP响应内容的二进制形式(主要是用于图片等) Request
r.status_code 状态码,如果为200,证明能爬虫成功,其余失败
r.headers 返回get请求获得页面的头部信息
如果r.header中不存在charset,可以认为编码方式维ISO-8859-1(这个可以通过r.encoding获得)
如果出现乱码,可以用r.apparent_encoding获得备用编码,然后用r.encoding = '获得的编码',也可以直接尝试r.encoding = 'utf-8'
Requests异常
-
异常 说明 r.raise_for_status() If not 200, resulting in abnormal requests.HTTPError
#检测代码是否可以爬取
import requests
def getHTMLTEXT(url):
try:
r = requests.get(url,timeout = 30)
r.raise_for_status() #如果状态码不是200,会出错
r.encoding = r.apparent_encoding
return r.text # 获取头部
except Exception as e:
return e
if __name__ == '__main__':
url = "http://baidu.com"
print(getHTMLTEXT(url))