surroundings:
python3.6
Crawling URL: https://www.dygod.net/html/tv/hytv/
Crawling Code:
import requests
url = 'https://www.dygod.net/html/tv/hytv/'
req = requests.get(url)
print(req.text)
Crawling results:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<META http-equiv=Content-Type content="text/html; charset=gb2312">
<title>µçÊÓ¾ç / »ªÓïµçÊÓ¾ç_µçÓ°ÌìÌÃ-ѸÀ×µçÓ°ÏÂÔØ</title>
<meta name="keywords" content="ѸÀ×µçÓ°£¬Ñ¸À×ÏÂÔØ£¬Ãâ·ÑµçÓ°">
<meta name=description content="Ãâ·ÑѸÀ×µçÓ°ÏÂÔØ,ѸÀ×ÏÂÔØ£¬×îºÃµÄѸÀ×ÏÂÔØÕ¾£¬ÊÇÓ°ÃÔµÄÊ×Ñ¡">
<link href="/css/dygod.css" rel="stylesheet" type="text/css" />
Above, title text is garbled, I felt it should be coded questions, but do not know how to solve, so the Internet to view
Reference website:
https://www.cnblogs.com/bw13/p/6549248.html
The problem is found, the original is reqponse header only specifies the type, but does not specify the encoding (usually coded pages are now directly in the html page), you can see the original page to find
In the content-type attribute, the encoding format is not provided, disposed below normal
So using the default encoding format
"HTTP Definitive Guide" in Chapter 16 of the International mentioned, if the HTTP response Content-Type field is not specified charset, the default page is the 'ISO-8859-1' encoding.
This process English pages of course no problem, but Chinese page, there will be garbled!
print(req.apparent_encoding)
Results: GB2312
So only need to add
req.encoding = req.apparent_encoding
This can be a!
Code:
import requests
url = 'https://www.dygod.net/html/tv/hytv/'
req = requests.get(url)
req.encoding = req.apparent_encoding
print(req.text)
The results will not be garbled Chinese