Character set, encoding method and Java program garbled problem

Table of contents

1. Character encoding

2. Three major character sets and encoding methods

2.1, ASCII character set and encoding method

2.2, GBK character set and encoding method

2.3, Unicode character set and encoding method

 3. Program garbled problem


1. Character encoding

The only memory in a digital computer can store bits, so if you want to process information on a computer, you must store it bit by bit. In order to represent text as numbers, we need to build a system that gives each letter a unique code. Numbers and punctuation marks also count as a form of text, so they must have their own encoding.

All letters and numbers (Alphanumeric) represented by symbols need to be encoded. A system with this function is called a Coded Character Set, and each independent code in the system is called a Character Code. 1

2. Three major character sets and encoding methods

2.1, ASCII character set and encoding method

ASCII code is a widely used character encoding standard. It is called American Standard Code for Information Interchange (American Standard Code for Information Interchange), referred to as ASCII code, and is pronounced like ASS-key. Since its official announcement in 1967, it has been the most important standard in the computer industry.

The ASCII code character set only contains the mapping relationship between English letters, numbers, punctuation marks and special characters, and does not include Chinese characters. The number of the characters is called a code point, and the corresponding code point is encoded ( decimal to binary - ASCII Code encoding method ), uses one byte to store the binary code, and the first bit of the ASCII code is 0 . As shown below:

Note : There is a small detail here. After Americans map all their characters, the largest code point is 127, so its corresponding binary encoding is 1111111, which is only 7 bits. However, we know that the smallest storage unit at the bottom of the computer is a Bytes, so Americans pad zeros in front of codes that are less than eight digits, so the code corresponding to 127 is 01111111. 

2.2, GBK character set and encoding method

As shown below: What should be noted here is that in GBK encoding, a Chinese character is stored by two bytes, and letters and numbers are still stored using one byte. That is to say, the GBK character set is compatible with the ASCII character set . .

2.3, Unicode character set and encoding method

Unicode, Unicode, also called Universal Code, its scientific name is "Universal Multiple-Octet Coded Character Set", abbreviated as UCS. What is currently used is UCS-2, which is a 2-byte encoding, and UCS-4 was developed to prevent 2 bytes from being insufficient in the future.

Compared with ASCII's 7-bit encoding, Unicode currently uses 16-bit encoding, and each character requires 2 bytes . In other words, the character encoding range of Unicode is 0000h ~ FFFFh, which can represent a total of 65,536 different characters. All human languages ​​around the world, especially those that often appear in computer communication, can use the same encoding system, and this system is also highly scalable.

The encoding methods based on the Unicode character set include utf-8, utf-16 and utf-32. Here is a brief introduction to utf-32 and utf-8.

utf-32: This encoding method is the earliest encoding method proposed by international organizations, but it is relatively useless because it uses four bytes to store each Unicode character set number. This storage method is wasteful. Storage space also affects the addressing speed. As shown in the figure below, 01100001 is encoded.


utf-8:

 3. Program garbled problem

The problem of garbled characters occurs because the encoding method and decoding method are different.

Java provides methods for the String class to decode and encode strings.

 Here we encode and decode the string "a I b", but at the end, when we decoded bytes1, we found that Chinese garbled characters appeared. This is because the encoding using GBK uses utf- 8 decoding, the platform provides decoding by utf-8 by default, so garbled characters appear. Then I only need to change the code to String s2 = new String(bytes1, "GBK"); and there will be no more garbled characters.

This blog’s introduction to character sets and encoding methods ends here. If this blog is helpful to you, please click a like to support it. Thank you! See you in the next blog! 

Guess you like

Origin blog.csdn.net/weixin_64972949/article/details/131608361