Python web crawler knowledge

A, Requests entry

Installation (a) Requests library

  1. Installation Requests library: "Run as administrator" cmd, execution pip install requests
  2. pip is a modern, generic Python package management tool. Python provides a lookup of the packet, download, install, uninstall function.
  3. Requests library installation quiz: status_code of 200 indicates a successful return
    >>> import requests
    >>> r =requests.get("http://www.baidu.com")
    >>> print(r.status_code)
    200

 Pay attention to slash forward slash or backslash, the page file path is generally a forward slash \ local general path with a backslash

7 main method (b) Requests library

requests.request () Constructs a request, the following method of supporting foundation methods
main methods of obtaining HTML page requests.get (), corresponding to the HTTP GET
method to get the HTML page header information requests.head (), corresponding to the HTTP the HEAD
requests.post () to submit HTML page POST request method corresponding to the HTTP POST
requests.put () method PUT request to submit HTML page corresponding to the HTTP PUT
requests.patch () to submit to the local modification of the HTML page request corresponding to the HTTP PATCH
requests.delete () to submit a request to delete HTML pages, corresponding to the HTTP dELETE

 

Requests.get () Response Request object to construct a target resource request to the server, the server returns a resource

requests.get (url, the params = None, ** kwargs)
∙ url: url link page acquired intended
∙ params: additional parameters in the url, dictionary or byte stream format, optionally
∙ ** kwargs: 12 access control parameters

 Property (c) Response object

r.status_code HTTP request returns the status 200 indicates successful connection, 404 represents a failure
r.text string HTTP response content, i.e., url corresponding page content
r.encoding guess header from the HTTP response Content encoding
.apparent_encoding analysis of the content of the response from the content encoding (encoding alternatively)
r.content the HTTP response content binary form

(D) Requests exception

requests.ConnectionError network connection error exceptions, such as DNS query failed, refused connections
requests.HTTPError HTTP error exception
requests.URLRequired URL missing abnormal
requests.TooManyRedirects exceeds the maximum number of redirects, redirect produce abnormal
requests.ConnectTimeout connect to a remote server timeout exception
requests .Timeout request URL timeout, timeout exception generated

You can use the following frame

try
   r = requests.get(url,timeout=30)
   r=raise_for_status()
   r.encoding = r.apparent_encoding
   return t.text
except:
   return "产生异常"

   Specific problems (e) the actual operation          

  1. Some sites on the Web crawler restrictions: (1) Source Review: judge visiting HTTP protocol header User-Agent field limit (2) announcement: Robots agreement requiring reptiles comply with (can be ignored, but not for commercial profit, or download large amounts of data, taking up server resources), announcement robots.txt at the root of the site
  2. encoding form of coding may be incorrect, because it only analyzes the head, resulting encoding type, and apparent_encoding is to analyze the results of the encoding type page, relatively speaking, more accurate, so you can use r.encoding = page content r.apparent_encoding statement to change the coding scheme, so you can call r.text () to get the correct compiled
  3. In practice, access to the page content may be excessive, leading to calls text () function IDLE when Ben collapse, so you can use r.text [: 1000] to get the first 1000 characters, or use r.text [: - 1000 ] to obtain its final 1000 characters
  4. Source Review encounter site, the use of r.request.headers [ "user-agent"] = "Mozilla / 5.0" statement in the user-agent header into Mozilla / 5.0 (most browsers are Mozilla / 5.0 words) or used directly in the acquisition statement r = requests .get (url, headers = { 'user-agent': 'Mozilla / 5.0'}) statement
  5. Picture reads: Use get get pictures, URL example: http: XXX // XXX / XXX.jpg, f = open (path, 'wb') write binary open the local storage path, f.write (r.content) save image

Second, the library Beautiful Soup Starter

Installation (a) BeautifulSoup library

Command line execution pip install beautifulsoup4 download and install

(Ii) the use bs4 library

  1. Use import BeautifulSoup introduced from bs4
  2. soup = BeautifulSoup (r.text, "html, parser") This statement indicates, to form a tag r.text parsed into a tree by the HTML parser bs4
  3. The basic elements of the class BeautifulSoup

 

 

Third, the regular expression Getting Started

 (A) the use of regular expressions

  1. Compile: will meet the regular expression syntax conversion string into a regular expression features: Example
    regex='p(Y|YT|YTH|YTHO)?N'  #正则表达式字符串
    p=re.compile(regex)     #字符串编译转换成正则表达式特征

2. The regular expressions are used operators

 

(B) the classic regular expressions

(C) Re library Introduction

For example: r '[1-9] \ d {5}', if not at this time r ', the need to write to two slash slash:' [1-9] \\ d {5} ' 

 

 

 

 

  1. The concept: Re library is a Python standard library, mainly used for string matching
  2. raw string Type: string type that is native (escape character string does not contain the escaped), when re library using raw string type, denoted r'text ',
  3. The main function of Re library functions:
  4.  

 

Published 42 original articles · won praise 16 · views 3414

Guess you like

Origin blog.csdn.net/qq_41542638/article/details/95225504