Principle and Practice of Character Encoding Unicode

table of Contents

1. The development of character set encoding

1.1, the difference and relationship between character set and character encoding

1.2, the development of character set encoding

 1.3. Summary

Two, character set related commands

2.1, what is BOM

 2.2. Character set related commands

 Third, the programming implementation of character set conversion

3.1, compile and install iconv under vs2015 under windows

3.2, compile and convert in VS2015


1. The development of character set encoding

1.1, the difference and relationship between character set and character encoding

  • Character set: a collection of multiple characters. For example, GB2312 is the Chinese national standard simplified Chinese character set. GB2312 includes simplified Chinese characters (6763) and general symbols, serial numbers, numbers, Latin letters, Japanese kana, Greek letters, Russian letters, Hanyu Pinyin symbols, and Chinese phonetic letters. 7445 graphic characters.
  • Character encoding: Encoding (mapping) the characters in the character set to an object in the specified set (for example: bit pattern, natural number sequence, electric pulse), so that the text can be stored in the computer and transmitted through the communication network.

The relationship between character set and character encoding: A character set is a collection of letters and symbols in the writing system, and a character encoding is a rule that maps characters to a specific byte or byte sequence.

Usually a specific character set adopts a specific encoding method (that is, a character set corresponds to a character encoding (for example: ASCII, IOS-8859-1, GB2312, GBK, both of which represent the character set and the corresponding character encoding) , But Unicode is not, it uses a modern model))

1.2, the development of character set encoding

(1) Single byte

ASCII (American Standard Code for Information Interchange), 128 characters, expressed in 7-bit binary (00000000-01111111 is 0x00-0x7F); EASCII (Extended ASCII), 256 characters, expressed in 8-bit binary (00000000-11111111 is 0x00 -0xFF). When computers arrived in Europe, the International Organization for Standardization expanded on the basis of ASCII and formed the ISO-8859 standard, which is similar to EASCII and compatible with ASCII, with a difference in the upper 128 code points. However, due to the complex language environment in Europe, many sub-standards have been formed according to the language of each region, ISO-8859-1, ISO-8859-2, ISO-8859-3,..., ISO-8859-16.

(2) Double byte

When computers arrived in Asia, 256 code points were not enough. So we continue to expand the two-dimensional table, changing single byte to double byte, 16-bit binary number, 65536 code points. Many codes have appeared in different countries and regions, such as GB2312 in mainland China, BIG5 in Hong Kong and Taiwan, Shift JIS in Japan, and so on.

Note that 65536 code points is an ideal situation, because double-byte encoding can be variable length, that is to say, some characters in the same encoding are single-byte representation, and some characters are double-byte representation. The advantage of this is that on the one hand, it can be compatible with ASCII, on the other hand, it can save storage capacity, at the cost of losing some code points.

GBK (Chinese Internal Code Specification) is an extension of GB2312 (gbk encoding can be used to represent traditional and simplified characters at the same time). It stands to reason that they are all double-byte encodings. The code points are the same. There is no extension at all, but in fact it is The reserved space is working. For example, the following picture shows the encoding space of GBK. GBK/1 and GBK/2 are areas of GB2312, GBK/3, GBK/4, and GBK/5 are areas of GBK, red is a user-defined area, and white may be due to lengthening Area of ​​coding loss. The full name of GBK is "Chinese Characters Inner Code Extension Specification", which supports all Chinese, Japanese and Korean Chinese characters in the international standard ISO/IEC10646-1 and the national standard GB13000-1.

All Chinese characters and full-width symbols in the GBK character set occupy 2 bytes, and letters and half-width symbols occupy one byte. There is no special encoding method, it is used to call GBK encoding. Generally used in China when there are many Chinese characters.

When the Internet has swept the world, geographical restrictions have been broken. When computers in different countries and regions exchange data, there will be garbled problems, that is, for the same set of binary data, different encodings will parse different characters.

The UNICODE character set is an international standard character set, which defines a unique code for each character in various languages ​​in the world to meet the requirements of cross-language and cross-platform text information conversion. There are multiple encoding methods, namely UTF-8, UTF-16, and UTF-32 encoding.

Example: The UNICODE numbers corresponding to "Chinese characters" are 0x6c49 and 0x5b57, and the coded program data is:

UTF8 encoding: E6B189 E5AD97

UTF16BE encoding: 6C49 5B57

UTF32BE encoding: 0006C490 0005B57

(3) Multi-byte

There are three encodings that can be used in the Unicode character set:

  • UFT-8: A variable length coding scheme, using 1~6 bytes to store;
  • UFT-32: A fixed-length coding scheme, regardless of the size of the character number, always uses 4 bytes for storage;
  • UTF-16: between UTF-8 and UTF-32, using 2 or 4 bytes to store, the length is both fixed and variable.

UTF is the abbreviation of Unicode Transformation Format, which means "Unicode Transformation Format", and the following number indicates at least how many bits are used to store characters.

The GB18030 character set uses single-byte, double-byte and four-byte three ways to encode characters. Compatible with GBK and GB2312 character sets.

(4)UTF-8

UTF-8: is a variable-length character encoding that is defined to encode a code point from 1 to 4 bytes, depending on the number of significant bits in the code point value.

Note: UTF-8 is not an encoding specification, but an encoding method.

unicode encoding (hexadecimal) 

UTF-8 byte stream (binary)

000000 - 00007F

0xxxxxxx

000080 - 0007FF

110xxxxx 10xxxxxx

000800 - 00FFFF

1110xxxx 10xxxxxx 10xxxxxx

01 0000 - 10 FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For example: the strict Unicode code of Chinese characters is 4E25 converted into binary, which is 01001110 00100101, a total of 15 bits. According to the above table, it can be seen that the UTF-8 character encoding occupies 3 bytes, so the first 3 digits are 1, the 4th digit (n+1 Bit) is 0, the first two bits of each byte in the next two bytes are 10, that is, 1110 xxxx 10 xxxxxx 10xxxxxx. After filling in, it becomes 1110 0100 10 111000 10 100101, a total of 24 bits occupying 3 bytes.

000800 - 00FFFF (4E25)

‭ 0100 111000 100101‬

11100100 10111000 10100101  -> E4B8A5

(5) UFT-16

UFT-16 is rather weird, it uses 2 or 4 bytes to store.

For characters whose Unicode number ranges from 0 to FFFF, UTF-16 uses two bytes for storage, and stores the Unicode number directly without encoding conversion, which is very similar to UTF-32.

For characters with a Unicode number in the range of 10000~10FFFF, UTF-16 uses four bytes for storage. Specifically, it divides all the bits of the character number into two parts, and the higher bits use a value between Double-byte storage between D800~DBFF, the lower bits (the remaining bits) are stored with a double-byte value between DC00~DFFF.

Unicode number range (hexadecimal)

 

Specific Unicode number (binary)

UTF-16 encoding

Number of bytes after encoding

0000 0000 ~ 0000 FFFF

xxxxxxxx xxxxxxxx

xxxxxxxx xxxxxxxx

2

0001 0000---0010 FFFF

yyyy yyyy yyxx xxxx xxxx

110110yy yyyyyyyy 110111xx xxxxxxxx

4

UTF-16BE, its suffix is ​​BE, which means big-endian, meaning big-endian. Big-endian means placing the high-order byte at the low address. UTF-16LE, its suffix is ​​LE, which means little-endian, meaning little-endian. Little endian means putting the high byte at the high address. UTF-16, no suffix is ​​specified, that is, it is not known whether it is big or small endian, so the first two bytes indicate whether the byte array is big endian or little endian. That is, FE FF means big endian, and FF FE means little endian.

(6)UTF-32

UTF-32 is a fixed-length encoding, which always occupies 4 bytes, which is enough to hold all Unicode characters, so it is enough to store the Unicode number directly without any encoding conversion. Space is wasted and efficiency is improved.

 1.3. Summary

How does the program identify whether it is UTF-8 or UTF-16 when opening a file? Whether there is a sign to make a sign, the first few bytes of the file are the sign.

  • EF BB BF means UTF-8
  • FE FF means UTF-16BE
  • FF FE means UTF-16LE
  • 00 00 FE FF means UTF32-BE
  • FF FE 00 00 means UTF32-LE

Only UTF-8 is compatible with ASCII, UTF-32 and UTF-16 are not compatible with ASCII, because they do not have a single-byte encoding.

View the complete Unicode character set and various encoding methods: https://unicode-table.com/cn/

Unicode and UTF encoding conversion: https://www.qqxiuzi.cn/bianma/Unicode-UTF.php

Two, character set related commands

2.1, what is BOM

BOM (Byte Order Mark) byte order (identification of byte order), in fact, is to use big endian (BE) or little endian (LE).

The storage of UTF in the file. UTF format always has a fixed file header in the file:

UTF encoding

Byte Order Mark

UTF-8

EF BB BF

UTF-16LE

FF FE

UTF-16BE

FE FF

UTF-32LE

FF FE 00 00

UTF-32BE

00 00 FE FF 

 UTF-8 does not have a BOM by default. There is one byte in UTF-8. In this case, there is no argument at both ends. As for the other two, three, and four-byte cases, take three bytes as an example. If you have to figure out the endianness, it’s not impossible. For example, the little-endian is "small-medium-large", big-endian. The law is "big-medium-small". But the reality is that UTF-8 only uses one endian method, which is big endian.

 2.2. Character set related commands

How to check the encoding of a file?

example:

file -i chatset.cpp

How to convert the encoding method?

The iconv command is used to convert the encoding of a file. For example, it can convert UTF8 encoding to GB18030 encoding, and vice versa. The iconv development library under Linux includes C functions such as iconv_open, iconv_close, and iconv, which can be used to easily convert character codes in C/C++ programs.

语法: iconv -f encoding [-t encoding] [inputfile]...

Options:

-f encoding: start the conversion of characters from the encoding encoding.

-t encoding: Convert characters to encoding.

-l: List the known set of coded characters

-o file: Specify the output file

-c: Ignore illegal characters output

-s: Suppress warning messages, but not error messages

--verbose: display progress information

The legal characters that can be specified by -f and -t are listed in the command of the -l option.

(1) Formats supported by iconv

Here are some of the formats: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF8, UTF16, UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE GB2312 GBK ISO- 8859-1

(2) Convert various formats supported by iconv

iconv -f UTF-8  -t UTF-8 utf8.txt -o UTF-8.txt 
iconv -f UTF-8  -t UTF8  utf8.txt -o UTF8.txt

iconv -f UTF-8  -t UTF-16   	UTF-8.txt -o UTF-16.txt
iconv -f UTF-8  -t UTF-16BE 	UTF-8.txt -o UTF-16BE.txt
iconv -f UTF-8  -t UTF-16LE 	UTF-8.txt -o UTF-16LE.txt

iconv -f UTF-8  -t UTF16   	UTF-8.txt -o UTF16.txt
iconv -f UTF-8  -t UTF16BE 	UTF-8.txt -o UTF16BE.txt
iconv -f UTF-8  -t UTF16LE 	UTF-8.txt -o UTF16LE.txt

iconv -f UTF-8  -t UTF-32   UTF-8.txt -o UTF-32.txt
iconv -f UTF-8  -t UTF-32BE UTF-8.txt -o UTF-32BE.txt
iconv -f UTF-8  -t UTF-32LE UTF-8.txt -o UTF-32LE.txt

iconv -f UTF-8  -t GB2312   UTF-8.txt -o GB2312.txt
iconv -f UTF-8  -t GBK 		UTF-8.txt -o GBK.txt
iconv -f UTF-8  -t ISO-8859-1 UTF-8.txt -o ISO-8859-1.txt

 Third, the programming implementation of character set conversion

With the advent of the Internet era, text communication via the Internet has gradually increased: when browsing foreign websites, the conversion of character encoding becomes particularly important at this time . This brings about a problem, that is, many characters are not in a certain encoding method. In order to solve this confusion, the Unicode encoding method was established. Unicode is a super encoding that contains all the character sets of these encodings , so the default encoding of some new text formats like XML is Unicode.

But many old computers are still using local traditional character encoding methods. And some programs, such as mail programs and browsers, must be able to switch between these different user codes. Some other programs have built-in support for Unicode to smoothly support the processing of internationalization, but there is still a need for conversion between Unicode and other traditional encodings. GNU's libiconv is an encoding conversion library designed for these two applications.

3.1, compile and install iconv under vs2015 under windows

Here you can directly use the libiconv library I have compiled, the specific link:

Link: https://pan.baidu.com/s/1FAqkN9ggxSpLlRhvPtMIig  Extraction code: 6jpt

Reference link: compile iconv under Windows

3.2, compile and convert in vs2015

  • Get conversion handle

函数:iconv_t iconv_open (const char* tocode, const char* fromcode);

Among them: tocode: target encoding method; fromcode: source encoding method

In addition, you can also set the destination code, such as TRANSLIT : find characters that cannot be converted, and replace them with similar characters;  IGNORE  : skip characters that cannot be converted. In specific use, you can set it like this: char *encTo = "UNICODE//TRANSLIT", UNICODE represents the encoding method, and the TRANSLIT encoding setting.

范例:iconv_t cd = iconv_open(“UTF−8”, “UTF−16”);

  • Make the conversion

函数:size_t iconv (iconv_t cd, const char* * inbuf, size_t * inbytesleft, char* * outbuf, size_t * outbytesleft);

among them:

cd: handle generated by iconv_open()

inbuf: the string to be converted

inbytesleft: how many characters are stored without conversion

outbuf: store the converted string

outbytesleft: store the remaining space of tempoutbuf after conversion

Return value: Return -1 to indicate an exception. Error code: E2BIG: Outbuf does not have enough space; EILSEQ: Invalid multi-byte sequence encountered; EINVAL: Incomplete multi-byte sequence encountered.

范例:size_t ret = iconv(cd, &srcstart, &srclen, &tempoutbuf, &outlen);

  • Close handle

函数:int iconv_close (iconv_t cd);

Example: iconv_close(cd);

Here is a basic example to illustrate, the specific code is as follows:

#include<iostream>
#include<string>
#include<iconv.h>

using namespace std;

int main(){
	/* 目的编码, TRANSLIT:遇到无法转换的字符就找相近字符替换
	*          IGNORE  :遇到无法转换字符跳过*/
	//char *encTo = "UNICODE//TRANSLIT";
	// char *encTo = "UNICODE//IGNORE";
	char *encTo = "UTF-16";
	/* 源编码 */
	char *encFrom = "UTF-8";

	/* 获得转换句柄
	*@param encTo 目标编码方式
	*@param encFrom 源编码方式
	*
	* */
	iconv_t cd = iconv_open(encTo, encFrom);
	if (cd == (iconv_t)-1)
	{
		perror("iconv_open");
	}

	/* 需要转换的字符串 */
	const char inbuf[1024]="学习unicode编码";
	size_t srclen=strlen(inbuf);
	cout<<"srclen= "<<srclen<<endl;

	/* 存放转换后的字符串 */
	size_t outlen = 1024;
	char outbuf[1024];
	size_t utf16_len = outlen;
	memset(outbuf, 0, outlen);

	/* 由于iconv()函数会修改指针,所以要保存源指针 */
	const char *srcstart = inbuf;
	char *tempoutbuf = outbuf;

	printf("utf8:\n");
	for (int i = 0; i < strlen(inbuf); i++)
	{
		printf("%02x ", inbuf[i]);
	}
	printf("\n");

	/* 进行转换
	*@param cd iconv_open()产生的句柄
	*@param srcstart 需要转换的字符串
	*@param srclen 存放还有多少字符没有转换
	*@param tempoutbuf 存放转换后的字符串
	*@param outlen 存放转换后,tempoutbuf剩余的空间
	*
	* */
	outlen = 20;
	printf("1 srcstart=%p, tempoutbuf=%p, srclen=%ld, outlen=%ld\n",
		srcstart, tempoutbuf, srclen, outlen);
	size_t ret = iconv(cd, &srcstart, &srclen, &tempoutbuf, &outlen);
	printf("2 srcstart=%p, tempoutbuf=%p, srclen=%ld, outlen=%ld\n",
		srcstart, tempoutbuf, srclen, outlen);

	if (ret == -1)
	{
		perror("iconv");
	}
	utf16_len = utf16_len - outlen;
	printf("inbuf=%s, srclen=%d, outbuf=%s, outlen=%d,  ret = %d\n",
		inbuf, srclen, outbuf, utf16_len, ret);

	printf("utf16:\n");
	for (int i = 0; i < utf16_len; i++)
	{
		printf("%02x ", outbuf[i]);
	}
	printf("\n");
	/* 关闭句柄 */
	iconv_close(cd);

	return 0;
}

The final result is as follows:

Reference link:

Development documentation: http://www.gnu.org/software/libiconv/documentation/libiconv-1.13/

 

Guess you like

Origin blog.csdn.net/wxplol/article/details/104921569