[python] Identify character encoding based on chardet

For characters that can be recognized by humans, the computer will convert them into binary form for storage according to a certain correspondence. This correspondence is the character encoding table, that is, what kind of character corresponds to what kind of binary encoding. This kind of character encoding table is often diverse. Therefore, if we want to convert an unknown encoded binary file into readable text for display, we need to consider what type of character encoding it uses. See the article Character Sets and Character Encodings for further information on character encodings .

In reality, it is often guessed what type of character encoding the current file uses based on the characteristic characters of various character encodings. However, many characters are common to different character encodings. The difference is that each encoding may use a different sequence of bytes to store the same character, and further processing is performed according to this characteristic. In Python, the chardet library provides functions for automatic character encoding detection. chardet supports recognition of most common character encodings, see its official warehouse: chardet . The chardet installation instructions are as follows:

pip install chardet

1 use

basic use

Chardet provides the detect function interface to realize the automatic detection of character encoding. The detect function takes one argument, a non-Unicode string. It returns a dictionary with auto-detected character encodings and confidences in the range 0 to 1, and also the language type.

# 导入库
import urllib.request
import chardet
# 读取网站
rawdata = urllib.request.urlopen('http://baidu.com/').read()
# 可以看到使用的是ascii编码
chardet.detect(rawdata)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
# 读取网站
rawdata = urllib.request.urlopen('http://en.people.cn/').read()
# 可以看到使用的是utf-8编码
chardet.detect(rawdata)
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
# 创建utf-8字节类型数据
data = bytes('hello, world', encoding='utf-8')
print(data)
chardet.detect(data)
b'hello, world'





{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
# bytes类型可以直接通过python的decode函数进行解码
data.decode('ascii')
'hello, world'
# 创建utf-8字节类型数据,这里可以看到utf-8是最高效的编码方式。
data = bytes('hello, world!你好世界!', encoding='utf-8')
print(data)
chardet.detect(data)
b'hello, world!\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x81'





{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': ''}
# bytes类型可以直接通过python的decode函数进行解码
data.decode('utf-8')
'hello, world!你好世界!'
data = bytes('你好世界', encoding='GBK')
# 识别可能错误
chardet.detect(data)
{'encoding': None, 'confidence': 0.0, 'language': None}
# 需要更丰富的字符数据提高识别率
data = bytes('你好世界,你好', encoding='GBK')
chardet.detect(data)
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

large amount of text recognition

If you are dealing with a lot of text, you can call UniversalDetector to speed up the recognition. The following code first creates a UniversalDetector object, and then recognizes large text blocks, and each text block uses its detection method feed. If the detector reaches a minimum confidence threshold, it will set detector.done to True and output the character encoding of the current text.

import urllib.request
from chardet.universaldetector import UniversalDetector

usock = urllib.request.urlopen('http://baidu.com/')
detector = UniversalDetector()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print(detector.result)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

For multiple files or multiple strings, UniversalDetector can also be used to speed up the recognition.


from chardet.universaldetector import UniversalDetector

texta = bytes('hello, world', encoding='utf-8')
textb = bytes('你好世界,你好', encoding='GBK')

detector = UniversalDetector()
for data in [texta,textb]:
    # 检测器重置
    detector.reset()
    detector.feed(data)
    if detector.done: break
    detector.close()
    print(detector.result)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

Use of UnicodeDammit

UnicodeDammit is beautifulsoup's built-in library for guessing character encodings. Integrating the chardet module in UnicodeDammit allows us to quickly obtain character encodings.

from bs4 import UnicodeDammit
 
data = bytes('你好世界,你好', encoding='GBK')
dammit = UnicodeDammit(data)
# 解码结果
print(dammit.unicode_markup)
# 打印编码结果
print(dammit.original_encoding)
# 或直接调用chardet
print(dammit.detector.chardet_encoding)
你好世界,你好
gb2312
GB2312

2 Reference

Guess you like

Origin blog.csdn.net/LuohenYJ/article/details/125972334