History of the most popular and thoroughly get to know the characters garbled nature of the problem

1 Introduction

IM applications such as social development, the garbage problem is also common, such as:

1) IM chat messages sent Emoji Why look after the back-end MySQL database will be messy;

2) file name after large file send chat messages with the Chinese, the other party to see the name of the text is garbled;

3) When Http rest interface calls, read the back-end to end APP pass over the parameters are Chinese garbage problem;

... ...

Well, for this seemingly insignificant garbled, but not twelve words can clear the problem, it is necessary to understand the root causes of character sets and coding theory, know why they know these basic literacy is obviously a good number of farmers Therefore, there was this article, hoping to help you.

2, text outline

Character set and encoding is undoubtedly even novice IT headaches large variety of God. When faced with complex character sets, and a variety of Martian garbled, locate the problem often becomes very difficult.

In this article it will be a brief introduction to science and coded character sets from the principle of respect, but also introduce some common methods of garbage fault location to facilitate the reader to locate more relaxed after related issues.

This article is an introduction by bloggers post their own understanding digested and converted into easy to understand plain statements later, will endeavor to provide clear and simple words to explain the character set from the source to the concept of character encoding, as well as some common diagnosis in the event of garbled skills, hoping to help you to "garbage" to have a better understanding of the problem.

3. What is the character set

Before introducing the character set, we first understand why we need character set.

We see on the computer screen is an entity of the text, and stored in a computer storage medium is actually a binary bit stream. Then the conversion rules between the two would require a uniform standard, otherwise our U disk into the boss's computer, the document garbled; junior partner over QQ upload files open in our local and garbled .

So in order to achieve the conversion of standards, character set standards emerged.

Simply put: the character set was subject to a binary digital storage methods (coding) and a string of binary values corresponding to a text which represents text (decoded) conversion relationship. 

So why are there so many character set standards?

The actual problem is very easy to answer. Ask yourself why our plug can not be used to get the British out? Why monitor while DVI, VGA, HDMI, DP interfaces so much of it? Many norms and standards in the initial formulation and not realize this would be after the global universal norms, or in the interests of the organization wanted to distinguish itself from the nature of existing standards. Thus, it produces so much but not compatible with the same standard effects. 

Said so much we look at a practical example, the following is the "Cock" is the word in hex and binary coded result of a variety of encoding, how there is a very Cock feeling?

 

4. What is the character encoding

Charset name just a set of rules, which corresponds to real life, the character set is the term for a language. For example: English, Chinese, Japanese.

For a properly coded character sets for transcoding a character requires three key elements:

1) Table font (character repertoire): it is the equivalent of a character may be displayed or all of the readable database, determines the font table to show the range of the entire character set for all characters of the representation;

2) the coded character set (coded character set): i.e. encoded with a value represented by a character code point in the font position;

3) character codes (character encoding form): The conversion between coded character set and the actual value is stored.

Generally code point values are directly encoded as the value stored directly. For example "A" in the 65th row of the table in ASCII, and the encoded value A  01000001  i.e. decimal binary conversion results 65.

See here, many readers will have and I had the same question: character coded character set table and appears to be essential, since the character table that each character has its own serial number, the serial number as a direct store content just fine. Why bother to pass character encoding to convert numbers into another storage format it?

Actually, the reason is easy to understand: unity of purpose font table is to be able to cover all of the characters in the world, but the actual use of the process will discover the true use of the characters on the relative proportion of the whole font table is very low. Chinese regions such as program hardly needs Japanese characters, and some English-speaking countries and even simple ASCII character table will be able to meet their basic needs. And if each character fonts are used to store the number in the table, then each character will need three bytes (here in Unicode font, for example), so the English original ASCII code for countries in the region with only one character obviously an additional cost (storage volume tripled). Some direct count, the same hard disk, you can store the ASCII 1500 article, and with No. 3-byte Unicode memory can store 500. So there have been such a UTF-8 variable length coding. In UTF-8 encoding, originally only one byte ASCII characters, is still only one byte. And like the Chinese and Japanese characters so complex you need 2-3 bytes to store.

Explain in detail about the character encoding of knowledge, see: " character encoding that something: a quick understanding of ASCII, Unicode, GBK-8 and UTF ."

5, the relationship between the Unicode UTF-8 and

After reading the two concepts explained above, then explain the relationship between Unicode and UTF-8 is relatively simple.

Unicode is mentioned hereinabove coded character set, and UTF-8 character encoding is, i.e., a rule Unicode font Realization.

With the development of the Internet, the same font set requirements for more and more urgent, Unicode standard will naturally appear. It covers almost symbolic and literal language of each country that may arise, and their number. See: Unicode Wikipedia introduction .

Unicode numbers from  0000  until the beginning 10FFFF  divided into 17 Plane, Plane each have 65,536 characters. The UTF-8 only implemented the first Plane, visible UTF-8, although a degree of acceptance of today's most widely used character set encoding, but it does not cover the entire Unicode font, which also resulted in some scenarios it for difficult to handle special characters (hereinafter, it will be referred to).

6, UTF-8 encoding Introduction

For a better understanding of the practical application of the back, here we simply introduce UTF-8 encoding implementation. I.e., the relationship between physical storage and conversion Unicode UTF-8 of the sequence number. 

UTF-8 encoding is variable length coding, the smallest coding unit (code unit) is one byte. 1-3 a front bit part bytes of description, the back part of the actual number:

  • 1) If the first byte is 0, representing the current character is a single-byte characters, one byte of space. All parts (7 bit) representative of the serial number in Unicode after 0;
  • 2) If a byte begins with 110, then the representative of the current character is double-byte characters occupy two bytes of space. Section (6 bit) in Unicode number representative of the outer portion 10 after all of the 110 (5 bit) byte plus an addition. And beginning with the second byte 10;
  • 3) If a byte to the beginning of 1110, the representatives of the current character is three-byte characters, occupies three bytes of space. Portion (12 bit) representative of the serial number in the Unicode two-byte outer 10. With all the sections (5 bit) after addition of 110 plus. And the second, the third byte to the beginning of the 10;
  • 4) If a byte begins with 10, then the representative of the current byte to multi-byte characters in the second byte. All parts (6 bit) before and after a portion 10 together in the composition of the Unicode number.

Specific features of each byte can in the table below, where "x" stands part number, all the individual bytes x spliced ​​together to form the number of the Unicode character. As shown below.

 
 

We were to look at three bytes from a three-byte UTF-8 encoding Examples: 

 

A careful reader can easily come to the following rules from the above simple introduction:

1)3个字节的UTF-8十六进制编码一定是以E开头的;

2)2个字节的UTF-8十六进制编码一定是以C或D开头的;

3)1个字节的UTF-8十六进制编码一定是以比8小的数字开头的。

7、为什么会出现乱码

乱码也就是英文常说的mojibake(由日语的文字化け音译)。

简单的说乱码的出现是因为:编码和解码时用了不同或者不兼容的字符集。

对应到真实生活中:就好比是一个英国人为了表示祝福在纸上写了bless(编码过程)。而一个法国人拿到了这张纸,由于在法语中bless表示受伤的意思,所以认为他想表达的是受伤(解码过程)。这个就是一个现实生活中的乱码情况。

在计算机科学中一样:一个用UTF-8编码后的字符,用GBK去解码。由于两个字符集的字库表不一样,同一个汉字在两个字符表的位置也不同,最终就会出现乱码。 

 

我们来看一个例子,假设我们用UTF-8编码存储“很屌”两个字,会有如下转换:

 

于是我们得到了E5BE88E5B18C这么一串数值,而显示时我们用GBK解码进行展示,通过查表我们获得以下信息: 

解码后我们就得到了“寰堝睂”这么一个错误的结果,更要命的是连字符个数都变了。

8、如何识别乱码的本来想要表达的文字

要从乱码字符中反解出原来的正确文字需要对各个字符集编码规则有较为深刻的掌握。但是原理很简单,这里用以MySQL数据库中的数据操纵中最常见的UTF-8被错误用GBK展示时的乱码为例,来说明具体反解和识别过程。

8.1 第1步:编码

假设我们在页面上看到“寰堝睂”这样的乱码,而又得知我们的浏览器当前使用GBK编码。那么第一步我们就能先通过GBK把乱码编码成二进制表达式。

当然查表编码效率很低,我们也可以用以下SQL语句直接通过MySQL客户端来做编码工作:

mysql [localhost] {msandbox} > selecthex(convert('寰堝睂'using gbk));

+-------------------------------------+

| hex(convert('寰堝睂'using gbk))    |

+-------------------------------------+

| E5BE88E5B18C                        |

+-------------------------------------+

1 row inset(0.01 sec)

8.2 第2步:识别

现在我们得到了解码后的二进制字符串E5BE88E5B18C。然后我们将它按字节拆开。

 

然后套用之前UTF-8编码介绍章节中总结出的规律,就不难发现这6个字节的数据符合UTF-8编码规则。如果整个数据流都符合这个规则的话,我们就能大胆假设乱码之前的编码字符集是UTF-8。

8.3 第3步:解码

然后我们就能拿着 E5BE88E5B18C 用UTF-8解码,查看乱码前的文字了。

当然我们可以不查表直接通过SQL获得结果:

mysql [localhost] {msandbox} ((none)) > selectconvert(0xE5BE88E5B18C using utf8);

+------------------------------------+

| convert(0xE5BE88E5B18C using utf8) |

+------------------------------------+

| 很屌                               |

+------------------------------------+

1 row inset(0.00 sec)

9、常见的IM乱码问题处理之MySQL中的Emoji字符

所谓Emoji就是一种在Unicode位于 \u1F601-\u1F64F 区段的字符。这个显然超过了目前常用的UTF-8字符集的编码范围 \u0000-\uFFFF。Emoji表情随着IOS的普及和微信的支持越来越常见。

下面就是几个常见的Emoji(IM聊天软件中经常会被用到):

那么Emoji字符表情会对我们平时的开发运维带来什么影响呢?

最常见的问题就在于将他存入MySQL数据库的时候。一般来说MySQL数据库的默认字符集都会配置成UTF-8(三字节),而utf8mb4在5.5以后才被支持,也很少会有DBA主动将系统默认字符集改成utf8mb4。

那么问题就来了,当我们把一个需要4字节UTF-8编码才能表示的字符存入数据库的时候就会报错:ERROR 1366: Incorrect string value: '\xF0\x9D\x8C\x86' for column 。 

如果认真阅读了上面的解释,那么这个报错也就不难看懂了:我们试图将一串Bytes插入到一列中,而这串Bytes的第一个字节是 \xF0 意味着这是一个四字节的UTF-8编码。但是当MySQL表和列字符集配置为UTF-8的时候是无法存储这样的字符的,所以报了错。 

那么遇到这种情况我们如何解决呢?

有两种方式:

  • 1)升级MySQL到5.6或更高版本,并且将表字符集切换至utf8mb4;
  • 2)在把内容存入到数据库之前做一次过滤,将Emoji字符替换成一段特殊的文字编码,然后再存入数据库中。之后从数据库获取或者前端展示时再将这段特殊文字编码转换成Emoji显示。

第二种方法我们假设用 -*-1F601-*- 来替代4字节的Emoji,那么具体实现python代码可以参见Stackoverflow上的回答

10、参考文献

[1] 如何配置Python默认字符集

[2] 字符编码那点事:快速理解ASCII、Unicode、GBK和UTF-8

[3] Unicode中文编码表

[4] Emoji Unicode Table

[5] Every Developer Should Know About The Encoding

Guess you like

Origin www.cnblogs.com/imteck4713/p/12056465.html