Some solutions to the garbled problem

In the past few days, I have learned more about system transcoding, and I will write a little bit more.

In PYTHON, all text on the screen is UNICODE text, and the system only recognizes UNICODE text. Python's default encoding for reading files is UTF-8.
The default encoding of the office family bucket is GBK. (In China) The
default encoding for windows is GBK.

The smallest code is GB2312, then GBK, and the largest is BIG-5.

When the system reads texts with other code rates, 100% Chinese will report an error, and some English codes will not be wrong.


To save the text on the screen to a file with other code rates, it is necessary to encode. If you read the content of other code rates from the file, you need to decode it.

To put it another way, encode is compression, and decode is decompression.

Common encodings include UNICODE for screen display, UTF-8, Chinese GB2312 and Chinese extensions GBK and BIG5.
UTF-8 supports Chinese and emoticons (such as smiling faces).

The emoji will report an error when UTF-8 is converted to GBK format.

UTF8 in mysql database is not regular UTF-8, UTF8 is three bytes, and UTF-8 is four bytes. mysql corresponds to UTF8mb4.

When PYTHON encounters an unrecognized bit rate, it will report an error.

UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 2: illegal multibyte sequence

If all cannot be decoded, an error is reported: multiple strings cannot be decoded. Unable to decode bytes.
This kind of error will jump out of the program body directly, it is error instead of warn.

If it is a small number of errors, you can write:

 file=open(root_log,'w',encoding='gb2312'errors='replace')

Such garbled content will be displayed? ? ? ? ? ? ?

This is the text I converted from UTF-8 to GBK format by replacing.

你好:? ? ? ? 姓名:赵林 ? ? ? ?emos账号:zhaolin? ? ? ? 青岛市移动公司 ? ?网络部 ? ?网优中心? ? ? ? ? 登录这个系统:http://10.24.170:7090/drp/? ? ? ? 登录维修物料账套? ? ? ? 我应该是新增账户,之前没有账号。? ? ? ? 删除青岛网优中心weiyi账号。[email protected]?使用O
tip: Obviously UTF-8 encoding is larger than the character set of GBK encoding...

If you use IGNORE, a variety of garbled characters will be displayed. Such as plum blossom stars and so on.

If you want to recognize text with different bitrates, I usually use try……except……

try:
    f=open('XXXXX','r',encoding='utf-8')
except:
    f=open('XXXXX','r',encoding='GBk',errors='replace')

If not, then write a few more judgments... Normally, there are more GBK and UTF-8.

The default decoding of PYTHON environment is UTF-8, and the default environment of China's WINDOWS is GBK (or GB2312). So the default decoding is required in the handwriting environment.

In other words: if you write a PYTHON program with f=open (XXXX), then an encoding="UTF-8" is written by default.
You can write a default comment line before the file:

-- mode: python ; coding: utf-8 --

By default, the str string in python is UNICODE decoding, and the type is TYPES decoding.

There are two kinds of decoding for mail by default: UTF-8 and GB2312 decoding. This depends on the mood of the mailbox... the
Insert picture description here
mailbox will specify the format of the file to be decoded.

When importing data to the database, you must first identify the code before importing, otherwise an error will be reported.
Navicat has the ability to automatically recognize the file encoding, and UES also has it. If you don't want to install it, the default TXT reader in the system (the one below) is fine.

Insert picture description here

dbeaver does not have the ability to automatically decode, so you have to encode it manually.
Of course, I think dbeaver is easy to use...

PYTHON has a software chardet that recognizes codes, but it is a bit slow to be complained...
Then it will be garbled when recognizing text/html (think it is a different code rate), I found a code online to solve it (not tried)

import re
from html import unescape
 
def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

The effect is pretty good, but I don’t understand...

Probably the content is as above.

Guess you like

Origin blog.csdn.net/weixin_45642669/article/details/112916263