python crawling html Chinese garbled

surroundings:

python3.6

Crawling URL: https://www.dygod.net/html/tv/hytv/

Crawling Code:

import requests
url = 'https://www.dygod.net/html/tv/hytv/'
req = requests.get(url)
print(req.text)

Crawling results:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<META http-equiv=Content-Type content="text/html; charset=gb2312">
<title>µçÊÓ¾ç / »ªÓïµçÊÓ¾ç_µçÓ°ÌìÌÃ-ѸÀ×µçÓ°ÏÂÔØ</title>
<meta name="keywords" content="ѸÀ×µçÓ°£¬Ñ¸À×ÏÂÔØ£¬Ãâ·ÑµçÓ°">
<meta name=description content="Ãâ·ÑѸÀ×µçÓ°ÏÂÔØ,ѸÀ×ÏÂÔØ£¬×îºÃµÄѸÀ×ÏÂÔØÕ¾£¬ÊÇÓ°ÃÔµÄÊ×Ñ¡">
<link href="/css/dygod.css" rel="stylesheet" type="text/css" />

Above, title text is garbled, I felt it should be coded questions, but do not know how to solve, so the Internet to view

Reference website:

https://www.cnblogs.com/bw13/p/6549248.html

The problem is found, the original is reqponse header only specifies the type, but does not specify the encoding (usually coded pages are now directly in the html page), you can see the original page to find

 

 

In the content-type attribute, the encoding format is not provided, disposed below normal

 

 So using the default encoding format

"HTTP Definitive Guide" in Chapter 16 of the International mentioned, if the HTTP response Content-Type field is not specified charset, the default page is the 'ISO-8859-1' encoding.

This process English pages of course no problem, but Chinese page, there will be garbled!

print(req.apparent_encoding)

Results: GB2312

So only need to add

req.encoding = req.apparent_encoding

This can be a!

Code:

import requests
url = 'https://www.dygod.net/html/tv/hytv/'
req = requests.get(url)
req.encoding = req.apparent_encoding
print(req.text)

The results will not be garbled Chinese

Guess you like

Origin www.cnblogs.com/bingchuan-study/p/11487164.html