Python crawler 1-----urllib module

1. Load the request of the urllib module

from urllib import request

2. Related functions:

(1) urlopen function: read webpage

  • webpage= request.urlopen (url, timeout=1) [Read the webpage, the parameter timeout indicates that it will timeout after 1 second, and it can be skipped when encountering an invalid webpage]
  • data=webpage.read ()   [Read page content]

  [The text content of the page content read using webpage.read() is bytes-object, and the print content is b'...']

  • data= data.decode ('utf-8') [decode]

  [text is bytes-object, convert it to the string text.decode(), the default parameter is empty, the encoding method parameter can also be used, and the format is decode("gb2312").

  • pat='<div class="name">(.*?)</div>'

  res=re.compile(pat).findall(str(data))【记得str(data)】

  [Re.search() cannot be used directly, it needs to be converted to string type before use. res is the content obtained]

(2) urlretrieve function: read the web page and save it locally to become a local web page

  • urllib.request.urlretrieve( url , filename=" local file address //1.html" )

(3) urlcleanup() function: Using the urlretrieve function will cause some caches, which can be cleared using it.

  • urllib.request.urlcleanup()

(4) info() function: returns some information about the web page.

(5) getcode(): If it returns 200, it means that the crawling is normal

(6) geturl(): returns the web page being crawled

(7) (you can check the llib.request.Request function) post and get requests

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325897442&siteId=291194637