reuqests request

HTTP protocol

HTTP, Hypertext Transfer Protocol Hypertext Transfer Protocol

HTTP is based on a "request and response" mode, stateless application layer protocol

HTTP protocol uses the URL as identification locate network resources.

URL format: HTTP: // Host [: POST] \ [path]

URL is Internet access path resources via the HTTP protocol, a URL corresponding to a data resource.

  1. host: legal Internet host domain name or IP address

  2. port: port number, the default port 80 (default is 80)

  3. path: the path of the requested resource

  4. HTTP protocol operations for resources

  5. 1566219086050

The main master get, head

Web links risk, exception handling is very important:

#检测代码是否可以爬取
import requests


def getHTMLTEXT(url):
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status()   #如果状态码不是200,会出错
        r.encoding = r.apparent_encoding
        return r.text  # 获取头部
    except Exception as e:
        return e



if __name__ == '__main__':
    url = "http://baidu.com"
    print(getHTMLTEXT(url))

reptile

Web crawler process:

  1. Gets the page: Get page is a web page to send the request, the page will return data for the entire page, type a similar website in the browser and press Enter, and then you can see the whole content of the site.

  2. Parsing website (information extraction required): is to extract the desired data from the data of the entire web page, similar to the site to see the whole page in a browser to go in, but you need to extract the information you need

  3. Stored message: Usually csv or may be stored in a database.

Web crawler technology:

  1. Get pages:
    1. Web page acquiring basic technologies: requests, urllib and selenium (analog browser can use Google simulation)
    2. Web page acquiring advanced technologies: multi-process multi-threaded crawl, crawl landing, breaking the ban and IP servers crawl.
  2. Parsing website:
    1. The basis of technical analysis pages: re regular expressions, Beautiful Soup and lxml
    2. Advanced analytical web technology: Solving the Chinese garbage problem
  3. Storing data:
    1. The underlying technology to store data: txt file and stored into csv file
    2. Advanced technology to store data: Mysql database into a database and stored in MongoDB

Directional control network data and web crawling ability to resolve the basic

The website as API

requests

  1. Automatic crawling html page, automatic network requests submitted

  2. Best reptile library

  3. method Explanation
    requests.request() A configuration request, the following method of supporting foundation method
    requests.get() Get HTML pages of the main methods of the corresponding HTTP GET
    requests.head() 获取HTML网页头信息的方法,对应于HTTP的HEAD
    request.post() 向HTML网页提交POST请求的方法,对应于HTTP的POST
    request.put() 向HTML网页提交PUT请求的方法,对应于HTTP的PUT
    request.patch() 向HTML网页提交局部修改请求,对应HTTP的PATCH
    requests.delete() 向THTML网页提交删除请求,对应HTTP的DELETE
    1. requests.get(url,params = None,**kwargs)
      1. url:网页的url链接
      2. params:url的额外参数,字典或者字节流格式,是可选择的
      3. **kwargs:12个控制访问的参数,是可选择的。
  4. Requests库的2的重要对象

    1. r = requests.get(url)

    2. Response(包含了爬虫返回的所有内容)

    3. Response对象的属性 说明
      r.status_code HTTP请求的返回状态,200表示链接成功,404表示失败(一般来说除了200,其余都是失败的)
      r.text HTTP响应内容的字符串形式,即,url对应的页面内容
      r.encoding 从HTTPheader中猜测的响应内容编码方式
      r.apparent_encoding 从内容中分析出响应内容的编码方式(备选编码方式)
      r.content HTTP响应内容的二进制形式(主要是用于图片等)
    4. Request

    5. r.status_code 状态码,如果为200,证明能爬虫成功,其余失败

    6. r.headers 返回get请求获得页面的头部信息

    7. 如果r.header中不存在charset,可以认为编码方式维ISO-8859-1(这个可以通过r.encoding获得)

    8. 如果出现乱码,可以用r.apparent_encoding获得备用编码,然后用r.encoding = '获得的编码',也可以直接尝试r.encoding = 'utf-8'

  5. Requests异常

  6. 异常 说明
    r.raise_for_status() If not 200, resulting in abnormal requests.HTTPError
#检测代码是否可以爬取
import requests


def getHTMLTEXT(url):
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status()   #如果状态码不是200,会出错
        r.encoding = r.apparent_encoding
        return r.text  # 获取头部
    except Exception as e:
        return e



if __name__ == '__main__':
    url = "http://baidu.com"
    print(getHTMLTEXT(url))

Guess you like

Origin www.cnblogs.com/SkyOceanchen/p/12168174.html