Crawler basics: python gets web page content

Under python3x, we can get web page content in two ways:

Get address: National Geographic Chinese website

url = 'http://www.ngchina.com.cn/travel/'

urllib library

1. Import the library

from urllib import request

2. Get web content

with request.urlopen(url) as file:
    data = file.read()
    print(data)

Running found an error:

urllib.error.HTTPError: HTTP Error 403: Forbidden

Mainly because the website prohibits crawlers, you can add header information to the request, disguised as a browser to access User-Agent

Then we add a 'User-Agent' field to the request header

headers = {'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36'}

# 创建请求
req = request.Request(url=url, headers=headers)
with request.urlopen(req) as response:
    # 读取response里的内容,并转码
    data1 = response.read().decode('utf-8') # 默认即为 utf-8
    print(data1)

Regarding User-Agent, we can use Google Chrome's developer tools to capture and view

request information


requests

1. Import the library

import requests

2. Get web content

with requests.get(url=url, headers=headers) as response:
    # 读取response里的内容,并转码
    data2 = response.content.decode()
    print(data2)

Replenish:

The response of requests can also obtain more information, including cookies, headers, status, url and other information. For more information, please refer to other materials.

response.cookies
response.headers
response.status_code
response.url


Under python2x, you can refer to this article

python open webpage to get webpage content method summary

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325697840&siteId=291194637