Crawler basics--requests, basic use of .content

First import the module

import requests

Secondly, set the request header (take a Zhihu user as an example)

request_headers = {
    
    
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch, br',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Cache-Control':'max-age=0',
    'Connection':'keep-alive',
    'Cookie':'',
    'Host':'www.zhihu.com',
    'Referer':'https://www.zhihu.com/',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
 } 
 #这是字典格式

Then you can send the request

1.requests.get(url,headers = request_headers) #这一步是像这个链接发送请求,得到的结果是,这个页面的源码
2.html = requests.get(url,headers = request_headers) 
#  设置一个变量承接返回的数据
3. print(html.content)

About .content and .text

.content returns a byte string that can be freely processed (encoded in the encoding format you want), which is a data storage unit that is only higher than binary data.

.text() returns the processed data based on the guess of the content type. This data format is relatively intuitive, but sometimes garbled codes are generated due to incorrect decoding methods

The recommended practice in the answers seen on the Internet is

.content.decode('utf-8')

Guess you like

Origin blog.csdn.net/weixin_47249161/article/details/113876299