2. urllib package learning (anti-crawler web source code crawling)

Python3 integrates all the functions of urllib and urllib2 in python2. All the functions of urllib and urllib2 can be used by introducing the urllib library in the python3 environment.

All packages used in this blog content: urllib (urllib.request.urlopen, urllib.parse), pickle

Let me talk about the little knowledge points you must know: HTTP status code

To summarize briefly: [100, 102] is a message, [200,207] is a success, [300,307] is a redirection, [400,418] and [421,426] and [449,451] are errors, and [500,510] plus 600 is a server error

Get method to simply crawl the source code of Baidu homepage

1. First introduce the request under urllib
import urllib.request
2. Use the get method to request to get the request
response = urllib.request.urlopen("http://www.baidu.com") 
print(response)

After printing, it is found that the response is http.client.HTTPResponse object, and urlopen encapsulates the obtained webpage source code into HTTPResponse object and returns

3. Get the source text information and decode it into utf-8
print(response.read().decode('utf-8'))

If only print(response.read()) directly prints out the source code obtained, it is found that special characters such as'\n' in the source code will be directly regarded as characters, and the source code indentation and tree structure of the web page cannot be preserved

Baidu homepage source code crawled by 2021.01.28: https://wwx.lanzoui.com/iTrhMl7h3tc

Testing skills

Generally to test your own crawler, you can use the http://httpbin.org website for testing. The post request form can have user name, password, cookie, etc., which is used to simulate user login and used for anti-crawler mechanism. You can use the above-mentioned website for post test


Encapsulate form data for POST mode

1. lead package
import urllib.parse  # 解析器按一定规则解析数据包
import pickle  # 保存html文件
2. Define the form data
key_values = {"count":"[email protected]", "pwd":"ynagzy0203"}
3. Parse the form data into network byte form data
data = bytes(urllib.parse.urlencode(key_values), encoding="utf-8")

The urlencode in urllib.parse encodes the form data of the dictionary type of the first parameter into data of the character encoding type of the second parameter

4. The core step, urlopen

response = urllib.request.urlopen(“https://mail.qq.com/”, data= data)

As long as urlopen has form data, the way to request a web page is POST. The above method can submit the account and password to the server to simulate the browser login to obtain more web access permissions, so as to achieve the purpose of crawling more network resources.

5. Save the crawled source code to a file
pickle.dump(response.read(),open('./QQmail.html','wb'))

Crawling webpage timeout processing

Show timeout
response = urllib.request.urlopen("https://mail.qq.com/", timeout=0.01)
print(response.read().decode("utf-8"))

The second parameter timeout of urlopen specifies the response time tolerance, 0.01 seconds is definitely not a response, 0.01 is given here to deliberately time out, see the case of timeout

The solution, try-catch to catch and handle exceptions
try:
    response = urllib.request.urlopen("http://httpbin.org/get", timeout=0.01)  # 第二个参数指定响应时间宽容度
    print(response.read().decode("utf-8"))
except urllib.error.URLError as e:  # 进行超时处理
    print("time out!")
Testing skills

When writing a crawler, you will encounter many unknown errors in many cases, so you must be good at debugging using various techniques. In addition to using the httpbin website to debug your own crawler, you can also use the RESTer plug-in 1

Of course, you can also output various states and variables in the code to achieve the purpose of debugging.

  • Get the status code
    response = urllib.request.urlopen("https://mail.qq.com/")
    print(response.status)
    # 豆瓣有反爬虫机制,就会得到418返回状态码
    response = urllib.request.urlopen("https://www.douban.com/")
    print(response.status)  # 418是所访问的服务器发现你是爬虫之后返回的状态码

The return code for visiting the homepage of the QQ mailbox, the return code is 200

Douban has an anti-crawler mechanism, and you will get a 418 return status code

  • Remove response headers

    response = urllib.request.urlopen(“http://www.baidu.com”)
    print(response.getheaders()) # Send the response header of the request to the server
    print(response.getheader(“Server”)) # Get a single Attributes

Summary: The return value of urlopen is an HTTPResponse 2 object, which encapsulates status and response headers


Simulate browser crawling anti-crawler website source code

Simulate the browser to visit the website

First use the httpbin website to test

1. The URL of the website to be crawled
url = "http://httpbin.org/post"  # 先进行测试
2. Encapsulate form data
data = bytes(urllib.parse.urlencode({"count":"[email protected]", "pwd":"yzy0203"}), encoding="utf-8")
3. Encapsulate request headers
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}  # 自行封装请求头
4. Encapsulate the URL, form, request header and method into the Requset object
req = urllib.request.Request(url=url, data=data, headers=headers, method= "POST")  # Request类封装的是url、数据、请求头
5. urlopen send request
reponse = urllib.request.urlopen(req)  # 发送请求,使用Request作为参数
6. Output crawled web pages
print(reponse.read().decode("utf-8"))

POST submission on httpbin website found that the output content of the above code is basically the sameInsert picture description here

Actually crawl the Douban homepage with anti-climb mechanism
url = "https://www.douban.com"
req = urllib.request.urlopen(url=url)
print(req.status)

Crawl directly on the Douban homepage, error return status code 418

Manually encapsulate the request header (mainly encapsulate the user-agent, which can be viewed through the browser developer mode), and simulate the browser to access Douban, thereby crawling the data using the get method

import pickle  # 保存html文件
url = "https://www.douban.com"
# 请求头自定义用户代理信息:
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
req = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(req)  # 使用Request作为参数
pickle.dump(response.read(),open('./douban.html','wb'))  #后缀.pkl可加可不加
print(response.read().decode("utf-8"))

Can be successfully crawled:
douban2021.02.02 web source code crawler crawling results


Task-driven development: Crawling the source code of 25 pages of Douban Top250 movies with anti-climbing mechanism

import urllib.request, urllib.error  # 指定url,获取网页数据


def main():
    # 爬取的网页
    baseurl = "https://movie.douban.com/top250?start="
    # 爬取25个网页源码
    getData()


def getData(baseurl):
    datalist = []
    for i in range(0, 10):  # 一页25条电影
        url = baseurl + str(i*25)
        html = askURL(url)  # 保存获取到的网页源码
        print(html)
    # 解析过程    
    
    return datalist


def askURL(url):
    # 头部信息 其中用户代理用于伪装浏览器访问网页
    head = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/87.0.4280.88 Safari/537.36"}
    req = urllib.request.Request(url, headers=head)
    html = ""  # 获取到的网页源码
    try:
        response = urllib.request.urlopen(req)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):  # has attribute
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html


if __name__ == '__main__':
    main()

  1. Insert picture description here ↩︎

  2. http.client.HTTPResponse ↩︎

Guess you like

Origin blog.csdn.net/qq_43808700/article/details/113549044