Unicode (Unicode, Unicode, single)

Unicode (Unicode, Unicode, single) is for use on a computer character encoding. It is each language for each character set a unified and unique binary code, in order to meet the cross-language, cross-platform text conversion processing requirements. 1990 R & D in 1994 officially announced. With the enhancement of the capacity of the computer work, more than a decade since the launch in Unicode also gained popularity.

Coding and implementation of Unicode

Probably it is, Unicode encoding system can be divided into coding and implementation of the two levels.

1. encoding

Unicode encoding the universal character set ISO10646 (also known as [the Universal Character Set]) (Universal Character Set, UCS ) corresponds to the concept, the current for practical Unicode version corresponds to UCS-2, 16-bit encoding space, that is, each character occupies two bytes .

Unicode is a character encoding standard sixteen, and ASCII is a seven-digit code, only applies to English. Another eight ISO Latin-1 encoding used in Western European countries. The benefits of using Unicode is a character set can be used to solve the language can be written on all the world today. A total of 2 ^ 16 Unicode encoding 65536, of which nearly 39,000 kinds have been defined, and the Chinese word alone accounted for 21,000 kinds!

The above-mentioned 16-bit Unicode characters form the basic multilingual plane (Basic Multilingual Plane, referred to as (BMP). The latest (but not actually widely used) Unicode version of the definition of the 16 auxiliary plane, both of which together occupy at least need 21-bit encoding space, slightly less than 3 bytes. but the fact remains supplementary plane characters occupy 4 bytes of code space, consistent with the UCS-4. future versions will expand to ISO10646-1 achieve level 3, which covers all the characters of UCS-4 .UCS-4 is not yet completely filled a larger 31-bit character set, plus the constant 0 as the first, the total of occupied 32, i.e. 4 bytes. theoretically can represent up to 2 ^ 31 characters can It covers everything symbolic language used.

Unicode character encoding BMP expressed as U + hhhh, where each h represents a hexadecimal digit. UCS-2 encoded the identical. 4 bytes corresponding to the same UCS-4 encoding two bytes, the first two bytes of all the bits are 0.

2. implementation

Unicode implementation is different from the encoding. Unicode character encoding is determined. But in the actual transfer process, due to the different system platform design is not necessarily the same, and for the purpose of space-saving, implementation of Unicode encoded differently. Unicode implementation called Unicode Transformation Format (Unicode Translation Format, referred to as UTF).

Again, if the direct use is consistent with the Unicode encoding (only BMP characters) in UTF-16 encoding, because each character occupies two bytes in the Macintosh (Mac) and PC, is the understanding of the byte order inconsistent. In this case the same byte stream may be interpreted as different content, such as a hexadecimal character encoding 4E59, 4E split into two bytes, and 59, read in from the low order byte Mac, then the Mac OS would think this 4E59 is encoded as 594E, find the character is "Kui" and in the Windows start reading from the high byte is encoded as U + 4E59 character is "b." That under Windows with UTF-16 encoding to save a character "B", opened with the Mac OS will be displayed as "Kui." Such note coding sequence if the UTF-16 may be arbitrarily defined confusion, then uses a large-endian (Big-Endian, the BE abbreviated as UTF-16) encoded in UTF-16 implementations, the little-endian ( Little-Endian, abbreviated as UTF-16 LE) concept, as well as additional BOM (Byte Order Mark) solutions, currently on a PC Windows systems and Linux systems for UTF-16 encoding is used by default UTF-16 LE. (See specific embodiments UTF-16)

In addition Unicode implementation also includes UTF-7, Puny code, CESU-8, SCSU, UTF-32 and so on, some of these implementations use only in certain countries and regions, some of the future approach to planning. Currently common implementation is UTF-16 little-endian (BOM), UTF-16 big endian (BOM), and UTF-8. In Microsoft Notepad that comes with Windows XP operating system, the "Save As" dialog box to select four non-Unicode encoding to remove the outer coding of ANSI, the other three "Unicode", "Unicode Big Endian" and "UTF- 8 "that correspond to the three implementations.

Auxiliary plane is currently working mainly in the second and third planes CJK Unified ideographs, so including GBK, GB18030, Big5 Simplified Chinese, etc., and various encoding of Unicode Traditional Chinese, Japanese, Korean, and Vietnamese characters coordination is focused. Taking into account the final to cover all Unicode characters, in some sense, these coding also be regarded as a fait accompli before the Unicode implementation of its appearance, as extended ASCII and Latin-1, as the latter two the 16-bit Unicode character encoding in encoding a first byte of space you are all 0, and the second byte encoding exactly the original encoding. But these East Asian language encoding and Unicode encoding corresponding relationship is much more complex.

Non-Unicode environment

In the non-Unicode environment, due to different countries and regions, the use of character sets, it is possible not display properly if all the characters. Microsoft uses code page (Code page) conversion table techniques transitional partially solve this problem, i.e., the conversion table designated by non-Unicode character encoding is converted to the corresponding Unicode character encoding the same internal system use. Can select a code page in the "language and locale" as the default encoding non-Unicode encoded employed, such as 936 Simplified Chinese GBK, 950 for the Traditional Chinese Big5 (refers to both the PC). In this case, some written in European languages ​​other than English software and documentation may appear very garbled. The code page set to the appropriate language Chinese will deal with problems, this situation can not be avoided. Basically, completely unified coding is the solution, but the current can not do this.

Code page technology is now widely used for a variety of platforms. Code page UTF-7 is 65,000, UTF-8 code page is 65001.

XML sum Unicode

XML and HTML using a subset of UTF-8 character set as the standard, in theory, we can show anywhere on the page text on a variety of supported XML standard browser, as long as the computer itself can be fitted with the appropriate font. You can use & # nnn; format display certain characters. nnn represents the decimal Unicode code for the character. If the hex code before coding can add character x. However, some older versions of browsers may not recognize the hexadecimal code.

However, due in part to the development of Unicode version of reasons, many browsers can display only a small subset of the Unicode version of the full UCS-2 character set that is currently used in.

Enter Unicode

In addition to external input, the operating system provides several ways to input Unicode. Like Windows system after Windows2000 provides a clickable table. For example under the Microsoft Word, pressing the Alt key, and a Unicode character code input 0 (decimal), then release the Alt key to get the character, such as Alt + 033865 leaves will be Unicode characters. Further press Alt + X key combination, MSWord character before the cursor will also be converted to each other with their four hexadecimal Unicode encoding.

There are already Unicode Version 5.0. There are a large number of specialized computers, linguistics and other scientists Unicode, the Unicode standard has now not only a coding standard, or record a huge database of human language written texts in the world, while engaged in excavation and protection of cultural heritage of mankind.

For the Chinese, Unicode16 coding which already contains all the characters inside GB18030 (27,484 words), the current standard Unicode characters all ready to put into Unicode32bit Kangxi coding.

UTF-8

UTF-8 (8 Weiyuan Universal Character Set / Unicode Transformation Format) is a variable length for the Unicode character encoding. It can be used to represent any character in the Unicode standard, and the encoding of the first byte is still compatible with ASCII, which makes software handling the original ASCII characters or simply do not need to modify a small part, you can continue to use . Therefore, it has gradually become encoded e-mail, web pages and other applications stored or transmitted text, the priority adopted.

UTF-8 uses one to four bytes for each character encoding:

128 US-ASCII characters only one byte encoding (Unicode range of U + 0000 to U + 007F).

Latin with additional symbols, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and thaana you need two-byte coding (Unicode range of U + 0080 to U + 07FF ).

Other basic multilingual plane (BMP) characters (which includes most commonly used word) three bytes encoding.

Unicode character other auxiliary plane rarely used in a four-byte coding.

The above mentioned fourth character terms, UTF-8 encoding uses four bytes to the resource-intensive seems too. However, UTF-8 can be represented by three bytes for all characters commonly used, and it is another choice, UTF-16 encoded, the fourth character of the four bytes required to encode the same, so to decide UTF-8 or UTF-16 which coding more efficient, but also depending on the distribution range of characters may be used. However, if you use some traditional compression systems, such as DEFLATE, the differences between these different coding systems becomes negligible. Because the traditional compression algorithms little effect in the short text compression, consider using Standard Compression Scheme For Unicode (SCSU).

Bit Unicode character is divided into several parts, and assigned to the UTF-8 byte string position the lower bits. In the following character U + 0080 uses the contents of its encoding single-byte characters. These codes correspond exactly seven yuan ASCII characters. In other cases, it can take up to four groups of characters to represent a character. These multi-byte MSB will be set to 1 to prevent the 7 yuan ASCII characters confused and maintain the standard of the leading byte strings (standard byte-oriented string) running smoothly.

Guess you like

Origin www.cnblogs.com/youpeng/p/10991271.html
Recommended