Always encountered garbled question: how it is produced, and how to solve it?

Foreword

Chinese garbage problem in our daily development commonplace, then the garbage problem is how to generate it? And how to solve the garbage problem? This will combine the basic concepts and examples set forth unfold, I hope you have a harvest.

An example of a simple garbled

package whx;

import java.io.UnsupportedEncodingException;

public class TestEncodeAndDecode {
    public static void main(String[] args) throws UnsupportedEncodingException {

        String str = "测试中文乱码";
        byte[] b = str.getBytes("GBK");
        System.out.println(new String (b,"UTF-8"));
    }
}

With GBK encoding, decoding with utf-8, garbled, results are as follows:

Related basic concepts

To understand the root cause of garbled, we need to understand clearly bits, bytes, characters, character set -related concepts.

Bit (bit)

Bit is the smallest unit of computer data storage, or 1 represents 0 to 1, as it represents a binary number 10010010 8 bits.

byte

Byte is a unit of measurement techniques for computer information storage capacity measurement, a string of binary digits handled as a unit, is a small unit configuration information.

1 B = 8 bit (1字节等于8位)
1 KB = 1024 B = 1024 字节
1 MB = 1024 KB 
1 GB = 1024 MB
1 TB = 1024 GB

character

Characters are letters, numbers, and symbols used in computers, the data structure is the smallest unit of data access. As a, A, B, b, large, +, *, and so represents a% character;

在 ASCII 编码中,一个英文字母字符存储需要1个字节。
在 GB 2312 编码或 GBK编码中,一个汉字字符存储需要2个字节。
在UTF-8编码中,一个英文字母字符存储需要1个字节,一个汉字字符储存需要3到4个字节。
在UTF-16编码中,一个英文字母字符或一个汉字字符存储都需要2个字节
在UTF-32编码中,世界上任何字符的存储都需要4个字节

character set

Character set is a collection of a plurality of characters, the character set more kinds and the number of characters included in each character set. Common Character Set Name:

ASCII字符集
GB2312字符集
Unicode字符集

encode decode

The computer only recognize binary 1 and 0, and humans all have their own language, the two sides should be able to exchange information, there must be transformed from text to 0,1 and 0,1 to text conversion.

Code: is converting text characters into the computer recognizes 0,1 machine code.

Decoding: The parsing binary numbers stored in the computer to text characters.

Common character set and encoding

Common character set ASCII, GBK, Unicode, etc.

ASCII character set

ASCII character set: it includes letters, digits and symbols of Western printable characters, and the Enter key, backspace and other control characters.

ASCII code: it is the United States to develop character encoding for English characters into binary, provides coding 128 characters.

GBXXXX character set

GBXXXX series include GB2312, GBK, GB18030 , suitable for character processing and information exchange between characters communications systems.

GB2312

  • Full name is "Information exchange with the Chinese coded character set" to support more than six thousand Chinese characters.
  • National Simplified Chinese character set, compatible with ASCII, mainland China and Singapore have adopted this code.
  • Each character and symbol in two bytes to represent.
  • High byte from A1 ~ F7, the low byte from A1 ~ FE. The low byte and high byte, respectively, can be obtained by adding 0XA0 coding.

GBK

  • GBK full name "Chinese Internal Code Specification", extends the GB2312, adding support for traditional Chinese characters, support for more than twenty thousand Chinese characters.
  • Each character and symbol are represented by two bytes.
  • High byte from 81 ~ FE, from the low byte 40 ~ FE.

GB18030

  • GB 18030, the full name of "IT Chinese coded character set", compatible with GBK encoding GB2312,, can support 27,484 words
  • Using multi-byte variable length encoding, each word may consist of 1, 2 or 4 bytes.
  • 1 byte from 00 ~ 7F; 2 high byte from byte 81 ~ FE, the low byte of 7E and from 40 to 80 to FE; 4 three bytes from a first byte 81 ~ FE, the second four bytes from 30 to 39.

Unicode character set

Unicode is a character encoding scheme can accommodate all signs and symbols of the world elaborated by international organizations. UNICODE character set various encoding, respectively, UTF-8, UTF-16 and UTF-32.

UTF-8

  • Is directed to a variable-length character encoding is Unicode.
  • It can be used to represent any character in the Unicode standard, and its coding in the first byte is still compatible with ASCII, making the software after the original deal with ASCII characters without or only a small part of the changes, you can continue to use.
  • UTF-8 is 1 to 4 bytes for each character encoding (ASCIl only 1 byte encoded characters, Latin, Greek and other encoding requires two bytes, three bytes are used CJK coding, very few other language characters using a four byte code number)

UTF-16

  • Abstract code bit Unicode character set is mapped to 16-bit long integer (i.e., symbols) sequence for data storage or transfer.
  • UTF-16 than UTF-8, most of the benefits that character byte (2 bytes) to store fixed length, but can not compatible with UTF-16 to ASCII.

UTF-32

  • A Unicode character encoding protocol used for each code bit just 32 yuan Unicode, Unicode encoding other variable length coding is used.
  • A 4-byte coding process faster, but a waste of space, transmission speed.

An example of encoding and decoding appreciated face Lushan

We knocked code programmers, most contact is "hello word". Computer known only 0 and 1, it is how to show hello word of it?

The previous section, we already know that knowledge encoded character set. We can use ASCII code, the "hello word" translated into 0,1 computer knowledge. Interested friends can go to look up
ASCII table

Hello world is stored in the computer's binary code 0, the first decoded into binary code corresponding to the character, and then rendered on the screen, we see that the hello world

Garbled how to generate it?

The main reason for garbled are the two characters of text encoding process and decoding process uses a different encoding , the second is to use some kind of garbled missing font library character set caused .

Encoding and decoding use different coding

Example, with a utf-8 encoded, using GBK decoding result garbled. Since in utf-8, with a three-byte code characters, and in GBK, each character is represented by two bytes, a distortion is generated.

Use the lack of a font character set library

We know that is not supported by traditional Chinese GB2312, so use the lack of a font character set encoding library, will produce garbled.

Garbled and how to solve it

Use support to show font character set encoding and codec use the same encoding scheme , you can solve the garbage problem.

Then list what garbled classic scenes and solutions

IntelliJ Idea garbage problem

Chinese garbled IDE project? File-> settings-> Editor-> File Encodings, set about encoding utf-8

IDE console Chinese garbled? Try this way, open the IDE installation directory, find

add -Dfile.encoding = UTF-8 at the end of the text

Database garbage problem

View database code:

show variables like 'character_set%'


设置session、global范围的编码方式

//session 范围
set character_set_server=utf8;
set character_set_database=utf8;
//global 范围
set global character_set_database=utf8;
set global character_set_server=utf8;

session、global范围编码,重启mysql可能编码又变回去了,可以尝试另外一种方式。在mysql(windows环境)的my.ini配置文件中修改或添加下列内容

[mysql]
default-character-set=utf8 
[mysqld]
default-character-set=utf8 
[client]
default-character-set=utf8

编码角度的乱码问题

写代码的时候出现中文乱码?追踪定位到编码解码的地方,设置用同一种编码方式。

参考与感谢

个人公众号

  • 如果你是个爱学习的好孩子,可以关注我公众号,一起学习讨论。
  • 如果你觉得本文有哪些不正确的地方,可以评论,也可以关注我公众号,私聊我,大家一起学习进步哈。

Guess you like

Origin www.cnblogs.com/jay-huaxiao/p/12148622.html