Unicode and multi-byte character sets

Today, I write to find a path to the program output using unicode character set only a single output of the first character, asked about my colleagues, instead using multi-byte character sets, problem solved

So the Internet saw their differences: Mo Kanwan a lot, but at least understand the evolution of the character set,

Reprint:

1.https://blog.csdn.net/SarahZhang0104/article/details/51346999

2.https://blog.csdn.net/stephen1315/article/details/7476236

 

 

In the case of the program compile and run correctly, the results of the following occurs:

Write pictures described here

Visible, appear in the list of strange characters, I open the Project Properties - Configuration Properties - General - character set.

Write pictures described here

Found that the use of the Unicode character set. Then, I switched up the option byte character sets. Program results are as follows:

Write pictures described here

Obviously, the program works well, the problem appears in the Windows default character set as Unicode.

## Unicode and multi-byte character sets:

In the computer characters are often not saved as an image, each character is represented using a code, and find out which each character encoding for use by delegates, depending on the use to which the character set (charset) .

In the beginning, the Internet is only one character set --ANSI the ASCII character set , which uses 7 bits to represent a character, represents a total of 128 characters, including letters, numbers, punctuation marks, and other commonly used characters. After that, it expands, a character represented by using 8 bits, can represent 256 characters, mainly on the basis of the original characters on 7 bits added some special symbols such as tabs.

Later, due to the addition of various languages, ASCII has been unable to meet the needs of information exchange, therefore, to be able to represent characters in other countries, each country developed its own character set based on ASCII, and these are derived from the ANSI standard character set is used to collectively referred to as the ANSI character set, their official name should be MBCS (multi-byte Chactacter system, namely multi-byte character system) . These characteristics derive the character set is ASCII 127 bits, based on compatible ASCII 127, they use greater than 128 encoded as a Leading Byte, Leading Byte immediately after the second (or even third) character as together with Leading Byte the actual coding. Such characters have a lot of, our common GB-2312 is one of them.

For example, in GB-2312 character set, "communicating" the number C1 AC CD A8, where C1 and CD is the Leading Byte. The first 127 coding standard ASCII reserved, for example, "0" code is 30H (30H represents a hexadecimal 30). When reading software, if you see 30H, know the ASCII standard which is less than 128, indicating "0", see C1 know it is greater than 128 followed by a further coding, thus the AC together constitute a whole C1 code, in GB-2312 character set that "even."

Because each developed their own language character sets, leading to the final presence of various character sets too many in the international exchange to the frequent change of the character set is very inconvenient. Thus, the proposed Unicode character set , which is fixed using 16 bits (two bytes, one word) to represent a character, a total of 65536 characters can be represented. The world's commonly used characters which included almost all languages, to facilitate the exchange of information. Unicode standard called UTF-16 . Later, in order to correct the two-byte Unicode can be transmitted over the existing single-byte system, there has been UTF-. 8 , using a similar manner MBCS Unicode encoding. Note UTF-8 encoding, it belongs to the Unicode character set. Unicode character set encoding a variety of forms, and ASCII only one, most MBCS (including the GB-2312) is also only one.

Finally, Windows Unicode data type to be defined Description:

  1. WCHAR Unicode character
  2. Unicode string pointer PWSTR
  3. PCWSTR pointer to a string constant Unicode
  4. ANSI corresponding data type CHAR, LPSTR and LPCSTR.
  5. ANSI / Unicode is a universal data type TCHAR, PTSTR, LPCTSTR.

For details, see the original site: http://blog.csdn.net/stephen1315/article/details/7476236

 

 

 

 

 

 

 

Unicode character sets and multi-byte character sets relations
in the computer characters are not usually saved as an image, each character is represented using a code, and find out which each character encoding for use by delegates, depending on which character set to use (charset).

In the beginning, the Internet is only one character set --ANSI ASCII character set, which uses 7 bits to represent a character, represents a total of 128 characters, including letters, numbers, punctuation marks, and other commonly used characters. After that, it expands, a character represented by using 8 bits, can represent 256 characters, mainly on the basis of the original characters on 7 bits added some special symbols such as tabs.

Later, due to the addition of various languages, ASCII has been unable to meet the needs of information exchange, therefore, to be able to represent characters in other countries, each country developed its own character set based on ASCII, and these are derived from the ANSI standard character set is used to collectively referred to as the ANSI character set, their official name should be MBCS (multi-byte Chactacter system, namely multi-byte character system). These characteristics derive the character set is ASCII 127 bits, based on compatible ASCII 127, they use greater than 128 encoded as a Leading
Byte, Leading Byte immediately after the second (or even third) character as together with Leading Byte the actual coding. Such characters have a lot of, our common GB-2312 is one of them.

For example, in GB-2312 character set, "communicating" the number C1 AC CD A8, where C1 and CD is the Leading Byte. The first 127 coding standard ASCII reserved, for example, "0" code is 30H (30H represents a hexadecimal 30). When reading software, if you see 30H, know the ASCII standard which is less than 128, indicating "0", see C1 know it is greater than 128 followed by a further coding, thus the AC together constitute a whole C1 code, in GB-2312 character set that "even."

Because each developed their own language character sets, leading to the final presence of various character sets too many in the international exchange to the frequent change of the character set is very inconvenient. Thus, a Unicode character set is proposed, which is fixed using 16 bits (two bytes, one word) to represent a character, a total of 65536 characters can be represented. The world's commonly used characters which included almost all languages, to facilitate the exchange of information. Unicode standard called UTF-16. Later, in order to correct the two-byte Unicode can be transmitted over the existing single-byte system, there has been UTF-8, using a similar manner MBCS Unicode encoding. Note UTF-8 encoding, it belongs to the Unicode character set. Unicode character set encoding a variety of forms, and ASCII only one, most MBCS (including the GB-2312) is also only one. Unicode initial goal is to provide more than 65,000 characters mapped to a 16-bit code. But that was not enough, it can not cover all of the text on the history, can not solve the problem of transmission
 (implantation head-ache's), especially in those network-based applications. Existing software must do a lot of work to program 16-bit data. Therefore, Unicode characters with some basic reservations formulated three sets of encoding. They are UTF-8, UTF-16 and UTF-32. As the name implies, in UTF-8, in 8-bit character encoding sequence, one or several bytes to represent a character. The maximum benefits of this approach, is UTF-8 encoded retains ASCII characters as part of it, for example, in the ASCII and UTF-8, "A" is 0x41.UTF-16 encoding and UTF-32 are Unicode is a 16-bit and 32-bit encoding. Considering the original purpose, the commonly refers to the Unicode UTF-16.

For example, "communication" standard code word to Unicode UTF-16 (big endian) as: DE 8F 1A 90
and its encoded as UTF-8: E8 BF 9E E9 80 9A
Finally, when a software to open a text, it is to the first thing to do is to determine what exactly is this text coded character sets which use the saved. There are three ways the software determines the character set and encoding text:
most standard approach is to detect the first few bytes of text most, as follows:
the first byte Charset / encoding EF BB BF UTF- 8

The FE the FF UTF-16 / the UCS-2, Little endian
the FF the FE UTF-16 / the UCS-2, Big endian
the FF the FE 00 00 UTF-32 / the UCS-. 4, Little endian.
00 00 the FE the FF UTF-32 / the UCS-. 4 ., big-endian flag is inserted for example, communication "word of UTF-16 (big endian), and UTF-8 locale were:

The FF the FE DE 8F. 1A 90
EF BB the BF E8 the BF 9E E9 80. 9A
but MBCS text without these characters marks at the beginning, but unfortunately, some of the early and some poorly designed software is not inserted in the storage Unicode text those located marks the beginning of the character set. Therefore, the software can not rely on this approach. In this case, the software can take a relatively safe way to determine the character set and encoding that pop up a dialog box to consult users, for example, that "communication" drag files in MS Word, Word will pop up a dialog box .

If the software do not want to bother the user, or it is not convenient to the user to consult, then it can only take their own method of "guess", the software can guess what it might belong charset based on the characteristics of the entire text, which probably allowed the. Use Notepad to open the "connectivity" file a case in point.

We can prove it: After typing "connectivity" in Notepad, select "Save As", will see the show have "ANSI" last drop-down box, then save. When re-open after the "connectivity" garbled file, then click "File" -> "Save As", will see the displayed "UTF-8" last drop-down box, indicating that the current open Notepad text of this It is a UTF-8 encoded text. And we just saved is saved with ANSI character set. This indicates, notepad guess character set "communication" file, that it is more like a UTF-8 text encoding. This is because the "communication" GB-2312 encoded word looks like UTF-8 encoded result, which is a coincidence, not all words are so. Open function can use Notepad, when open "communication" file selection box ANSI, normal display can be a drop-down at the last. Conversely, if saved before saving as UTF-8 encoding, directly open no problem occurs.

If the "communication" in the files in MS Word, Word will think that it is a UTF-8 encoded files, but it can not be determined, so a dialog box asking the user, then select the "Simplified Chinese (GB2312)" , you can normally open. Notepad at this point has done nothing to simplify the comparison, this location and this program is consistent.

Need to remind you that some Windows 2000 fonts can not display all the Unicode characters. If the file is missing some of the characters found, simply change the font to another.
big endian and little endian
big endian and little endian CPU are different ways of handling multi-byte numbers. Unicode encoding such as "Chinese" word is 6C49. When you write to file, what is the 6C EDITORIAL, or the 49 EDITORIAL? EDITORIAL if 6C is big endian. Or the 49 EDITORIAL is little endian.

"Endian" The word comes from "Gulliver's Travels." Lilliput from the civil war when it is actually eating eggs from the bulk of knocking (Big-Endian) or small head (Little-Endian) knocked thus been six rebellions have occurred, one of the emperor lost his life, and the other a lost throne.

We will generally be translated into endian "byte order", the big endian and little endian called "big-endian" and "small tail."
Unicode byte order in text files created on the Big-endian processor (such as Apple Macintosh computers) text bytes (storage unit) the order, and to establish Intel processors in the file: Unicode big endian in contrast. The most important byte has the lowest address, and will first store the text in the larger end. To allow for such a user's computer has access to your document, select Unicode big-endian format.
################################################## ###################

ANSI character, the UNICODE, character width, character narrow, multi-byte character sets
Unicode: wide character set
1. How to obtain a number of characters includes both single-byte characters and double-byte characters in a string comprising?
You can call Microsoft Visual C ++ runtime library contains functions to manipulate _mbslen multi-byte (both single-byte also includes double-byte) string. Calling strlen function, you can not really understand how much of a character string, it can only tell you how many bytes before it reaches the end of 0.

2. How to operate on DBCS (double byte character set) string?
Function Description
PTSTR CharNext (LPCTSTR); Returns the next character string address
PTSTR CharPrev (LPCTSTR, LPCTSTR); return address on a character string
BOOL IsDBCSLeadByte (BYTE); if the first byte is the DBCS characters bytes, non-zero value is returned
3. Why should I use Unicode?
(1) can exchange data between different languages easily.
(2) allows you to assign a single binary supports all languages .exe file or DLL files.

(3) improve the efficiency of the application.
Windows 2000 is to use Unicode developed from scratch, if you call any Windows function and passing it an ANSI string, then the system first convert a string to Unicode, the Unicode string and then passed to the operating system. If you want the function to return an ANSI string, the system will first convert Unicode strings into ANSI strings, then returns the results to your application. Convert these strings need to take up time and memory systems. By starting from scratch to develop applications using Unicode, you can make your applications run more efficiently.
Windows CE operating system itself uses a Unicode, fully function does not support the ANSI Windows
Windows 98 only supports ANSI, ANSI only to develop applications.
Microsoft COM company will convert from 16-bit Windows to Win32, the company decided to require all the strings COM interfaces methods can only accept Unicode strings.
4. How to write Unicode source code?
Microsoft designed the company to Unicode WindowsAPI, so, you can minimize the impact of the code. In fact, a single source code file can be written to or without the use Unicode to be compiled. Only need to define two macros (UNICODE and _UNICODE), you can modify and recompile the source file.

_UNICODE Macros for C runtime header files, and UNICODE macro is for Windows header files. When compiling the source code module, generally two macros must be defined. 5. Windows Unicode-defined data types are there?
Data Type Description
WCHAR Unicode character
PWSTR Unicode string pointer pointing
PCWSTR constant pointer to point to a Unicode string
corresponding ANSI data type is CHAR, LPSTR and LPCSTR.
ANSI / Unicode is a universal data type TCHAR, PTSTR, LPCTSTR.
6. Unicode how to operate?
Examples ANSI character set operation function characteristic begins with str strcpy

Unicode manipulation functions begin with wcscpy WCS
MBCS manipulation functions begin with _mbscpy _mbs
ANSI / Unicode manipulation functions begin with _tcs _tcscpy (C run-time library)
ANSI / Unicode manipulation functions to lstr beginning lstrcpy (Windows function)
to all new and No deprecated functions in Windows2000 have both ANSI and Unicode versions. End function is described in the ANSI version A; end of the Unicode version of the function expressed in W. Windows will be defined as follows:

UNICODE #ifdef
#define the CreateWindowEx CreateWindowExW
#else
#define the CreateWindowEx CreateWindowExA
#endif //! UNICODE
7. how to represent Unicode string constants?
Examples of character sets
the ANSI "String"
the Unicode L "String"
the ANSI / T the Unicode ( "String") or _TEXT ( "String") IF (szError [0] == _TEXT ( 'J'))} {
8. The Why should try using the operating system function?
This will help to slightly improve the operating performance of the application, because the operating system string functions are often used by large-scale applications such as the operating system shell process Explorer.exe. Because these functions use a lot, and therefore, the application is running, they may have been loaded into RAM.

Such as: StrCat, StrChr, StrCmp StrCpy and the like.
9. How to write applications consistent with the ANSI and Unicode?

(1) The text string as character array, or an array of chars rather than byte arrays.

(2) the common data types (e.g., TCHAR and PTSTR) and a string of text characters.
(3) the explicit data type (e.g., BYTE and PBYTE) a byte, byte pointer and data caches.
(4) for the TEXT macro and string literals.
(5) perform a global substitution (e.g. PSTR replaced with PTSTR).
(6) to modify the string operation problems. For example, a transfer function is generally desirable in the character size of the cache, rather than bytes. This means that you should not pass sizeof (szBuffer), and should be passed (sizeof (szBuffer) / sizeof ( TCHAR). In addition, if you need to allocate a block of memory to a string, and has a number of characters in the string, then make a note live Yaoan bytes to allocate memory. that is, should call malloc (nCharacters * sizeof (TCHAR) ), instead of calling malloc (nCharacters).

10. How to compare strings have a choice?
Done by calling CompareString.
Flag Meaning
NORM_IGNORECASE ignore the case of letters
NORM_IGNOREKANATYPE does not distinguish between hiragana and katakana characters
NORM_IGNORENONSPACE not ignore character spacing
NORM_IGNORESYMBOLS ignore symbol
NORM_IGNOREWIDTH does not distinguish between the same characters as single-byte characters and double-byte characters
SORT_STRINGSORT the punctuation symbols as normal processing
11. how to judge a text file is ANSI or Unicode?
If the judgment at the beginning of the text file is two bytes 0xFF and 0xFE, then it is Unicode, otherwise it is ANSI.

12. How to determine a string is ANSI or Unicode?
Judged by IsTextUnicode. IsTextUnicode using a series of statistical methods and qualitative methods in order to guess the contents of the cache. Since this method is not an exact science, so there IsTextUnicode may return incorrect results.

13. How to convert strings between Unicode and ANSI? MultiByteToWideChar Windows function for converting the multi-byte wide string into a string; WideCharToMultiByte function to convert a wide string equivalent multi-byte string.

________________________________________________________________
the UCS, and the UNICODE UTF-8
the UCS, and the UNICODE UTF-8

This paper briefly describes the UCS, UNICODE and UTF-8, the C language and using mutual conversion between UTF-8 and UCS2.
1. What is the UCS and ISO10646?
ISO10646 international standard defines the Universal Character Set (Universal Character Set, UCS). UCS is for all other character set standard a superset of the other character sets but also to ensure that it is compatible with bi-directional, ie between coding conversion will not lose any information. UCS characters U + 0000 to U + 007F and US-ASCII is the same.

2. What is the UNICODE history, there are two separate, try the creation of a single character set. One is the International Organization for Standardization (ISO) in the ISO 10646 project, and the other is by the (initially mostly US) multi-language software manufacturer Unicode project organized by the Association consisting of business. Fortunately, before and after 1991, the participants recognized the two projects, the
world does not need two different single character set. they consolidated results of the work of both sides, and for the creation of a single coding table and work together. both projects still exist and independently release their standards, but the Unicode Consortium and ISO / IEC JTC1 / SC2 have agreed to keep the Unicode and ISO 10646 standards-compliant code table, and adjust any future closely together extension.
3. What is UTF-8 for each character is assigned a corresponding integer (a conveyor and storage format) to the UCS and UNICODE, but did not specify its implementation mechanism so there is a variety of coding, in which two bytes and four bytes to store a method of the character are called UCS-2, UCS-4, to convert an ASCII file into a file as long as the UCS-2 plus a 0X00 byte before each byte, Converted to UCS-4 plus three 0X00 long before each byte.
The internet is a wealth of information on the ASCII code exists, if two bytes are used to store the waste a lot of resources, while using USC-2 and USC-4 can cause serious problems in Unix and Linux, so there UTF 8 (as defined in ISO10646-1). 8-.UTF
(the Unicode UTF-8 Stands for the Format-Transformation. 8. It iS AN OCTET ( 'bit. 8-) Lossless encoding of the Unicode characters.) the UNICODE (the UCS) and UTF-8 is correspondence. 00000000-U - U-0000007F:
0xxxxxxx (most ASCII code reuse)
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx ( common second priority)
the U--00000800 - the U-0000ffff-: 1110xxxx 10xxxxxx 10xxxxxx
the U--00010000 - the U-001FFFFF-: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
the U--00.2 million - the U-03FFFFFF-: 111110xx 10xxxxxx the U-04000000-10xxxxxx 10xxxxxx 10xxxxxx - the U-7FFFFFFF-:
1111110x
10xxxxxx
10xxxxxx 10xxxxxx
10xxxxxx
10xxxxxx (rarely used)
in multi-byte string, the number of the first byte at the beginning of '1' is a whole number of bytes in the string.
the following give the corresponding relationship UCS-2 and UTF-8, and the mutual conversion between the use of the C language.
-------------------------------------------------- -------------------------------------------------- -------
| UCS2 | UTF-8 |
| ----------------------------------- -------------------------------------------------- ---------------------
| | code | 1st byte | 2nd byte | 3rd byte |
| ------------------------------------------------- -------------------------------------------------- -----
| 000000000aaaaaaa | 0000 - 007F | 0aaaaaaa | | |
| --------------------------------- -------------------------------------------------- ---------------------
| 00000bbbbbaaaaaa | 0080 - 07FF | 110bbbbb
| 10aaaaaa | |
| ---------------- -------------------------------------------------- --------------------------------------
| ccccbbbbbbaaaaaa | 0800 - FFFF | 1110cccc | 10bbbbbb | 10aaaaaa |
| ------------------------------------------------ -------------------------------------------------- ------
alphajay question: where is abc should be taken from 01 in one bit of it
because USC-2 16bit each character is represented by two bytes
here I only realized convert a single character, string conversion is the same reason.
1, to convert a UTF-8 character into a UCS-2 character.
If the conversion is successful returns 1 if the character is a UTF-8 character unrecognized, 0 is returned, and deposit a blackbox (U + 22e0) to the ucs2_code_ptr.
UINT16 unsigned Short typedef;
typedef unsigned char UINT8;
typedef unsigned char BOOL;
#define TRURE (BOOL) (. 1)
#define FALSE (BOOL) (0)
BOOL UTF8toUCS2Code (const * utf8_code UINT8, UINT16 * ucs2_code) {
UINT16 temp1, temp2 of ;
BOOL is_unrecognized = FALSE;
UINT16 * in = utf8_code;
IF (utf8_code ucs2_code ||!!) {
return is_unrecognized;
}
IF (0x00 == (* in & 0x80)) {
/ *. 1. 8-byte UTF Charater * /.
ucs2_code = * (UINT16) * in;
is_unrecognized = TRUE;
}
the else IF (0xC0 == (* in & 0xE0) &&
0x80 == (* (in +. 1) & 0xC0)
){
/* 2 bytes UTF-8 Charater.*/
temp1 = (UINT16)(*in & 0x1f);
temp1 <<= 6;
temp1 |= (UINT16)(*(in + 1) & 0x3f);
*ucs2_code = temp1;
is_unrecognized = TRUE;
}
else if( 0xe0 == (*in & 0xf0) &&
0x80 == (*(in +1) & 0xc0) &&
0x80 == (*(in + 2) & 0xc0)
){
/* 3bytes UTF-8 Charater.*/
temp1 = (UINT16)(*in &0x0f);
temp1 <<= 12;
temp2 = (UINT16)(*(in+1) & 0x3F);
temp2 <<= 6;
temp1 = temp1 | temp2 | (UINT16)(*(in+2) & 0x3F);
*ucs2_code = temp1;
is_unrecognized = TRUE;
}
else{
/* unrecognize byte. */
*ucs2_code = 0x22e0;
is_unrecognized = FALSE;
}
is_unrecognized return;
}
2, a converting UCS-2 character into UTF-8 characters. Function returns the converted Length (bytes 1 - 3) UTF-8, and if the destination pointer is null, 0 is returned.
UINT8 UCS2toUTF8Code (UINT16 ucs2_code, UINT8 * utf8_code) {
int length = 0;
UINT8 * OUT = utf8_code;
IF (utf8_code!) {
Return length;
}
IF (0x0080> ucs2_code) {
/ *. 1 byte UTF-. 8 Character * /.
* OUT = (UINT8) ucs2_code;
length ++;
}
the else IF (0x0800> ucs2_code) {
. / * 2 bytes UTF-. 8 Character * /
* OUT = ((UINT8) (ucs2_code >>. 6)) | 0xC0;
* (OUT + 1'd) = ((UINT8) (ucs2_code & 0x003F)) | 0x80;
length + = 2;
}
the else {
/ * UTF-bytes. 8. 3 Character * /.
* OUT = ((UINT8) (>> 12 is ucs2_code)) | 0xE0;
* (OUT +. 1) = ((UINT8) ((ucs2_code & 0x0FC0) >>. 6)) | 0x80;
* (OUT + 2) = ((UINT8) (ucs2_code & 0x003F)) | 0x80;
length + =. 3;
}
return length;
}
conversion between a string is the same.

 

[Overview]
all computers are based on numbers to represent the characters the same. The character encoding is the character set encoding to a digital sequence, in order to allow computer recognition. Various regions and languages used in the country are different, the local language of symbols used to encode get local coded character set. Western European countries such as the native encoding using ISO8859-1, Singapore, mainland China and other regions to use the local encoding is GB2312 or GBK, Chinese Hong Kong and Taiwan native encoding is used BIG5, local coding South Korea and Japan are euc-kr and Shift_JIS . Computer operating system supports a variety of local coded character set, the operating system default encoding and local language versions of the operating system you have installed is the same. Local set only for local use text symbols were encoded text does not include other areas of use, even though both contain the same concentration of local character, the value of this character encoding is different. For example, "medium" or GB2312 GBK encoding is "0xD6D0", the coding is BIG5 "0xA4A4".
Global trends in information exchange and integration required to achieve the unity of the local character set, in April 1984 ISO set up a working group for the same encoding for national characters, symbols, which become encoded Unicode. Unicode in June 1992 by the DIS (DrafInternationalStandard), V2.0 version was released in 1996. Unicode encoding including 6811, 20,902 characters, Korean 11172, and the like. Unicode although to achieve a unified global coding, but in terms of character set number and coding efficiency There is clearly insufficient, and UTF-8, UTF-16 is the conversion or encoding expanded form for Unicode encoding, UTF is Unicode
 abbreviation Translation Format of. [Details]
About ASCII encoding
ASCII encoding is the American Standard Code for Information Interchange, this encoding for the English character. ASCII character encoding used to encode a byte, and the most significant byte is 0, so the character ASCII code set size is 128. Since there are only 26 letters of the alphabet, plus some other symbols, the total size of not more than 128, so the ASCII code space is sufficient. For example, character "a" is coded as 0x61, character "b" is coded as 0x62 and the like. Note that in sometimes refers to a local ASCII encoding, such as a text editor such as UltraEdit has "ASCII transfer Unicode" function, where it refers to a local ASCII encoding, if the local code is GBK, this function is performed by GBK coding to Unicode code conversion.
About encoding ISO8859-1
ISO8859-1 Western European countries is a common character set encoding, ISO8859-1 using a byte characters, encoding range is 0x00-0xFF. Wherein, 0x00-0x1f used as a control word, 0x20-0x7F represent letters, numbers, and symbols for the graphic character, 0xA0-0xFF as an additional part. Because ASCII encoding only the low byte of seven, only 0-127 coding range, although you can accommodate some English characters and other symbols, but it can not contain other letters Western European languages other than English, so ASCII coding is not common in Western European countries. ISO conducted for this issue on the basis of ASCII coding on the expansion, developed a coding ISO8859-1, ISO8859-1 encoding uses all eight byte encoding range is 0-255, can contain all the letters of Western European and symbol.
About GB2312, GBK and BIG5 Chinese character encoding GB2312 code is a national information exchange using the code People's Republic of China, stands for "exchange of information using Chinese characters coded character set - the basic set", issued by the State Administration of standards, implementation May 1, 1981, mainland China Singapore and other places to use this code. GB2312 contains simplified Chinese characters, symbols, letters, Japanese kana characters and a total of 7445, which accounts for 6763 characters. The code table partition GB2312 94 region (0xA1-0xFE), corresponding to the first byte, 94 bits per region (0xA1-0xFE), corresponding to the second byte, two-byte value of the code are and the value of the bit number plus 32 (0x20), is also known as area code. GB2312 encoding range is around 7 0x2121-0x777E, the ASCII overlap, it is generally the highest position of the two-byte code distinction 1 GB. GBK is an extension of GB2312-80, upward-compatible, contains 20,902 Chinese characters, encoding range is 0x8140-0xFEFE, excluding word 0x80 bit high, the other characters can be one mapping to Unicode2.0. GB18030-2000 (GBK2K) on the basis of an increase of GBK character Tibetan, Mongolian and other ethnic minorities, GBK2K solve word bit is not enough, inadequate shape of the problem fundamentally. GBK2K first asked to fully implement all the glyphs mapped to Unicode3.0 standards, there is not any operating system support GBK2K. BIG5 code is called Big-5, is the Chinese character encoding used in Hong Kong and Taiwan. TW-BIG5 all code words are divided into two groups, i.e. common word area and sub-area common word, each word are used to distinguish the sort of strokes, the stroke of the same word by sorting radical. TW-BIG5 each word consists of two bytes

 

Guess you like

Origin www.cnblogs.com/MCSFX/p/12657325.html