Wide character and Unicode (c Chinese language string length)

In the C language, we use the char to define a character, one byte, can only represent 128 characters, which is the ASCII characters. Computer originated in the United States, char can represent all English characters, in English-speaking countries is no problem.

But there are many different languages in the world, such as Chinese, Chinese, Japanese, etc. There are thousands of characters, you need to use multiple bytes to represent, called wide characters (Wide Character). Unicode is a character encoding wide, has been designated a modern computer as the default encoding, Windows 2000 later operating systems, including Windows 2000, XP, Vista, Win7 , Win8, Win10, Windows Phone, Windows Server , etc. (collectively referred to as Windows NT) supports Unicode from the ground, access efficiency is higher than the char.

For more information, please see: ASCII encoding and Unicode encoding

C language of wide characters

In the C language, a wchar.hheader file wchar_tto define wide character, such as:

wchar_t ch = 'A';

wchar_t is defined as typedef unsigned short wchar_t, and as an unsigned integer, occupies two bytes.

If the definition wide string, prefixed with L, for example:

wchar_t * str = L "C language Chinese net";

LIs needed to be added, and no spaces between the strings and the only way the compiler to know each character occupies two bytes.

Wide character Example:

  1. #include <stdio.h>
  2. #include <wchar.h>
  3. int main () {
  4. char ch = 'A';
  5. wchar_t zip = 'A' ;
  6. STR char [] = "Chinese network language C" ;
  7. WSTR wchar_t [] = L "Chinese network language C" ;
  8. printf("ch=%d, wch=%d, str=%d, wstr=%d\n", sizeof(ch), sizeof(wch), sizeof(str), sizeof(wstr));
  9. return 0;
  10. }

Run Results:
CH =. 1, WCH = 2, str = 12 is, WSTR = 14

WSTR plurality str reason than two bytes because: the character 'C' occupies two bytes, marks the end of the string '\ 0' is also It occupies two bytes.

The length of the width of the string

Strlen ASCII string length is calculated using the function, is calculated using the wide string length wcslen functions:

  1. #include <stdio.h>
  2. #include <wchar.h>
  3. #include <string.h>
  4. int main () {
  5. STR char [] = "Chinese network language C" ;
  6. WSTR wchar_t [] = L "Chinese network language C" ;
  7. printf("strlen(str)=%d, wcslen(wstr)=%d\n", strlen(str), wcslen(wstr));
  8. return 0;
  9. }

Run Results:
strlen (STR) =. 11, wcslen (WSTR) =. 6

operation strlen clearly incorrect results, because it calculates a byte as a character, and the two bytes wcslen calculated as a character.

Note: wcslen are described in the header file string.h and wchar.h.

Maintain a version of the source code

In the previous operating system, Windows NT, even including Windows 98, support for wide characters are not very good, so the use of ASCII encoding in most cases. After the launch of Windows NT, Unicode support has been from the bottom, so the program on Windows NT mostly use Unicode.

If you want the program to run various versions of the Windows operating system, you need to maintain two versions of the source code, ASCII and Unicode version version. Definition of ASCII characters and Unicode characters are not the same use, to a version of the source code will be very difficult to make compatible process, a lot of work, the programmer simply a nightmare.

However, Windows has for us to do a good deed, it has dealt with compatibility issues. It is how to do it?

For example, the string, the ASCII char used to define, and Unicode wchar_t used to define, and to add a prefix L. Then windows.h header file (or other file that contains the header) thus treated:

  1. #ifdef UNICODE
  2. typedef wchar_t TCHAR;
  3. #define TEXT(quote) L##quote
  4. #else
  5. typedef char TCHAR
  6. #define TEXT(quote) quote
  7. #endif

We can use the source code:

TCHAR str [] = TEXT ( "C Language Chinese network");

If Unicode version, that is, UNICODE macro defined, then the above statement is equivalent to:

wchar_t str [] = L "C Language Chinese network";

If it is ASCII, UNICODE macro is not defined, it is equivalent to:

char str [] = "C Language Chinese network";

In Windows, you can see such a treatment. Although modern operating systems already support Unicode, no longer need to consider compatibility issues with ASCII, but still have to pay for these historical issues.

Summary: For a variety of reasons, we prefer to use Windows-defined data types, macros, structures, etc., so programs written better compatibility, do not consider the issue of ASCII and Unicode. But it also poses a challenge, it is to be familiar with data types, macros, and other structures defined Window.

Guess you like

Origin www.cnblogs.com/qiumingcheng/p/11334777.html