Under python3x, we can get web page content in two ways:
Get address: National Geographic Chinese website
url = 'http://www.ngchina.com.cn/travel/'
urllib library
1. Import the library
from urllib import request
2. Get web content
with request.urlopen(url) as file:
data = file.read()
print(data)
Running found an error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
Mainly because the website prohibits crawlers, you can add header information to the request, disguised as a browser to access User-Agent
Then we add a 'User-Agent' field to the request header
headers = {'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36'}
# 创建请求
req = request.Request(url=url, headers=headers)
with request.urlopen(req) as response:
# 读取response里的内容,并转码
data1 = response.read().decode('utf-8') # 默认即为 utf-8
print(data1)
Regarding User-Agent, we can use Google Chrome's developer tools to capture and view
requests
1. Import the library
import requests
2. Get web content
with requests.get(url=url, headers=headers) as response:
# 读取response里的内容,并转码
data2 = response.content.decode()
print(data2)
Replenish:
The response of requests can also obtain more information, including cookies, headers, status, url and other information. For more information, please refer to other materials.
response.cookies
response.headers
response.status_code
response.url
Under python2x, you can refer to this article