Linux character set and encoding

Linux character set settings

1. Query
(1) Check the current server character set: # locale 
(2) Check the character set supported by the server: # locale - a

2. There are two ways to modify the Linux system character set:
(1) Modify by directly setting variables, use the following two commands:

[root~]# LANG="xxx" or export LANG="xxx" 

[root ~]# LC_ALL="xxx" or export LC_ALL="xxx"

(Note: xxx is the character set to be modified)
To check the standard character set: locale –a command, commonly used ones are: zh_CN.GB2312, Therefore, when you usually log in to the system and execute the "LANG=" command, there are no garbled characters displayed, which means that the display of the character set is cancelled. To cancel the character set, you can also execute the following command:  However, the above modification will only take effect in the current shell, and this environment variable will disappear when creating a new shell.
zh_CN.GB18030 or zh_CN.UTF-8, en_US.UTF-8, etc.

[root ~]# unset LANG

(2) Modify the file method and control it by modifying the /etc/sysconfig/i18n file

[root~]# vim /etc/sysconfig/i18n

LANG="zh_CN.GB18030" //It is the language of the system
SUPPORTED="zh_CN.UTF-8:zh_CN.GB18030:zh_CN:zh:en_US.UTF -8:en_US:en"  
SYSFONT="lat0-sun16"
For the modified file to take effect after saving and exiting, you must execute the following command to take effect

[test]$ source /etc/sysconfig/i18n

You can also set the environment variables of the Linux system in /etc/profile (global) or ~/.bashrc (individual user).

3. For more detailed instructions, please refer to: http://blog.chinaunix.net/uid-8489474-id-2031042.html

Information on encoding and character sets under Linux (Locale detailed explanation) [The following content is reproduced]

Locale is a very important concept in the process of internationalization and localization. I personally think that for Chinese users, internationalization or localization usually involves three aspects: reading Chinese, writing Chinese, and window Compatibility and communication with Chinese systems. From practical experience, the locale setting has little to do with reading Chinese, but it is closely related to writing Chinese and the mounting method of the window partition. I think that just like a pure English Windows can browse Chinese, Japanese or Italian web pages, you can browse Chinese without setting locale. So, why do we need to set the locale? When will locale be used? (1) Why do we need to set locale?

     As mentioned before, setting the locale has no direct relationship with whether you can browse Chinese web pages. Even if you set the locale to a standard English locale like en_US.ISO-8859-1, you can still browse it. For Chinese web pages, as long as your system has the corresponding character set (this is not necessarily required) and the appropriate font (such as simsun), the browser can translate the web page into Chinese for you to read. The specific process is that after the network transmits the web page to your machine, the browser will determine the corresponding encoded character set. According to the character set used by the web page, go to the font library to find the appropriate font, and then the text rendering tool will convert the corresponding font. The text is displayed on the screen.  
     In the following, I will occasionally compare the character set to a password book. I personally think it is easier to understand some things. If you are not used to it, copy the full text to any text editor and replace it with the character set. A codebook will do.  
So why do web pages sometimes display garbled characters or boxes? Personally, I think that garbled characters are displayed because the set character set is incorrect (or there is no corresponding character set). For example, if the web page is encoded in UTF-8, you have to use GB2312 to view it, and the system finds the font according to GB2312, and then in What is displayed on the screen is of course a bunch of garbled characters, which means that if you use a wrong codebook to translate the telegram sent to you, of course the content will be garbled; as for sometimes the web pages you browse can display some Chinese characters, but there are many is a box. Being able to display Chinese characters means that the browser has correctly determined the encoding of the web page and found the corresponding text in the font library. However, not every font library contains all the fonts of a certain character set. Because of this, sometimes the display will be incomplete. Just find a more complete font that supports more character sets.  
Since I can browse Chinese web pages, why do I need to set the locale? In fact, have you ever thought about this question, why is the webpage of the Chinese forum on the x official forum encoded in UTF-8 (although everyone has always strongly recommended to use GB2312 encoding), but Sina uses GB2312 encoding? The official website of Y is actually encoded in ISO-8859-15. How can it be browsed without setting this locale? This problem is like you have all the codebooks. No matter what character set a website is encoded in, you can use the codebooks in your hand to translate them. But the problem is that although you can browse Chinese web pages, English characters still flow throughout the entire operating system. So, just like you can understand English, you can also understand Chinese. The fundamental problem is: you can't write in Chinese.  
      When you decide to write something, the first thing you have to decide is which language to use. For computers, if you use which character set, you must tell For your Linux system, which password book do you want to use to write what you want to write? You know why you need to use the GB2312 character set to browse Sina, because Sina's web pages are written in GB2312. In order to enable your Linux to input Chinese, you need to set the system locale to Chinese (strictly speaking, the locale language category LC_CTYPE), such as zh_CN.GB2312, zh_CN.GB18030 or zh_CN.UTF-8. Many people don't understand these quirky expressions. What does this alien expression specify? This issue will be discussed in detail later. For now, you just need to know that this is the expression of locale.
(2) What is locale?

The word locale is translated into Chinese as region or territory. In fact, the meaning of this word is much broader. Locale is a software runtime language environment defined based on the language used by the computer user, the country or region where it is located, and the local cultural traditions. This user environment can be divided into several major categories according to various aspects of the cultural traditions involved, usually including the language symbols used by the user and their classification (LC_CTYPE), numbers (LC_NUMERIC), comparison and sorting habits (LC_COLLATE), and time display Format (LC_TIME), currency unit (LC_MONETARY), information mainly includes prompt information, error information, status information, title, label, button and menu, etc. (LC_MESSAGES), name writing method (LC_NAME), address writing method (LC_ADDRESS), phone number Number writing method (LC_TELEPHONE), weights and measures expression method (LC_MEASUREMENT), default paper size (LC_PAPER) and locale's overview of the information it contains (LC_IDENTIFICATION).
Therefore, locale is the language habits, cultural traditions and living habits of people in a certain region. The locale of a region is defined based on these major categories. These locale definition files are placed under the /usr/share/i18n/locales directory. For example, en_US, zh_CN and de_DE@euro are all locale definition files. These files are It is written in text format. You can open it with WordPad and look at the content inside. Of course, except for the limited comments, you may not be able to understand most of the things because it uses the Unicode character index method.  
A little explanation about de_DE@euro, @ is followed by the correction term, which means you can see two German locales: 

/usr/share/i18n/locales/de_DE@euro /usr/share/i18n/locales/de_DE Open these two locale definitions, and you will know that the difference between them is that de_DE@euro uses European ones. Sorting, comparison and indentation conventions, while de_DE uses the German standard conventions.  
Above we talked about the first half of zh_CN.GB18030, what is the second half? Most Linux users know the character set used by the system.  

(3) What is a character set?

Character set is the encoding method of characters, especially non-English characters in the system, which is commonly known as internal code. All character sets are placed in /usr/share/i18n/charmaps, and all character sets are also in Unicode Numbered indexed. Unicode uses a unified number to index all currently known symbols. The character set is the encoding method of these symbols, or in network transmission and computer internal communication, for the expression of different characters, Unicode is a static concept, and the character set is a dynamic concept, which is the expression of each character. A specific form of delivery or transmission. Just like the Unicode number U59D0 is the word "sister" that represents sister, but whether this word is represented by two bytes, three bytes, or four bytes depends on the character set. For example: UTF-8 character set is the currently popular encoding method for characters. UTF-8 uses one byte to represent commonly used Latin letters and two bytes to represent commonly used symbols, including commonly used Chinese characters, which is represented by three Uncommonly used characters use four bytes to represent other quirky characters. The GB2312 character set uses two bytes to represent all characters.

One thing that needs to be mentioned is that in addition to using numbers to index all characters, Unicode itself uses four bytes to store all characters. This is a very important concept when talking about mounting windows partitions. So you can also think of Unicode as a character set (I don’t know its relationship with UTF-32, anyway, UTF-32 uses four bytes to represent all characters), but expressing symbols in this way is very wasteful Resources, because most of the time in the computer world, only 26 letters are used, which can be processed in one byte. That's why there are UTF-8, UTF-16, etc. Otherwise, the world of Datong would be great, saving a lot of trouble.


(4) What exactly is zh_CN.GB2312 talking about?​ 

Locale is the language environment when the software is running, which includes language, territory and character set (Codeset). a locale book

is written in the format: language[_region[.character set]]. So, locale is always associated with a certain character set. Here are a few examples:  
a. I speak Chinese, live in the People's Republic of China, and use the national standard 2312 character set to express characters. zh_CN.GB2312=Chinese_People's Republic of China+GB 2312 character set.  

b. I speak Chinese, live in the People's Republic of China, and use the national standard 18030 character set to express characters. zh_CN.GB18030=Chinese_People's Republic of China+GB 18030 character set.​  

c. I speak Chinese and live in Taiwan Province of the People's Republic of China. I use the national standard Big5 character set to express characters. zh_TW.BIG5=Chinese_Taiwan. Big Five Character Set  

d. I speak English, live in Great Britain, and use the ISO-8859-1 character set to express characters. en_GB.ISO-8859-1=English_Great Britain.ISO-8859-1 character set  

e. I speak German, live in Germany, use UTF-8 character set, and am used to European style. de_DE.UTF-8@euro=German_Germany.UTF-8 character set@modified according to European customs  

Note that it is not [email protected], so the complete locale expression is [language[_region][.character set] [@modification value]  

The locale generated by is placed in the /usr/lib/locale/ directory, and each locale corresponds to a folder. That is to say, after [email protected] locale is created, /usr/ is generated The lib/locale/[email protected]/ directory contains the specific content of each locale.  
(5) How to customize locale

It is very easy to generate locale in gentoo. First, add userlocales support to USE, and then edit the locales.build file. This

The file is used to instruct glibc to generate locale files. Many people don’t understand what each entry means. In fact, it should be clear now based on the above explanation.​ 

a. File: /etc/locales.build en_US/ISO-8859-1 en_US.UTF-8/UTF-8  

b. zh_CN/GB18030 zh_CN.GBK/GBK zh_CN.GB2312/GB2312 zh_CN.UTF-8/UTF-8  

The above is the locales.build file, and the instructions are as follows:  

b1. en_US/ISO-8859-1: Generate a locale named en_US, using the ISO-8859-1 character set, and use this locale as the default value of the English_US locale class. In fact, it is the same as en_US.ISO-8859-1 /ISO-8859-1 makes no difference.​  

b2. en_US.UTF-8/UTF-8: Generate a locale named en_US.UTF-8, using the UTF-8 character set.​  

b3. zh_CN/GB18030: Generate a locale named zh_CN, using the GB18030 character set, and use this locale as the default value of the Chinese_China locale class. In fact, it is no different from zh_CN.GB18030/GB18030.​  

b4. zh_CN.GBK/GBK: Generate a locale named zh_CN.GBK, using the GBK character set. zh_CN.GB2312/GB2312: Generate a locale named zh_CN.GB2312, using the GB2312 character set. zh_CN.UTF-8/UTF-8: Generate a locale named zh_CN.UTF-8, using the UTF-8 character set.​  

Regarding the default locale, the default locale can be abbreviated to en_US or zh_CN, which is just for simplicity and has no special meaning.​  

Gentoo hides something when locale is defined, which is the locale generation tool: localedef. After compiling glibc, you can use this localedef to add some locales, and you will understand the locales better. For details, please see the localedef manpage.​  

$localedef -f character set -i locale definition file The name of the generated locale For example $localedef -f UTF-8 -i zh_CN zh_CN.UTF-8  

The above definition method has the same result as setting zh_CN.UTF-8/UTF-8 in locales.build.​  

(6) Detailed explanation of locale 

Several locales have just been generated, but in order for them to take effect, the Linux system must be told to use which locale(s). This requires a little understanding of the internal mechanisms of the locale. I have mentioned before that locale is divided into 12 major categories according to various aspects of cultural traditions involved. These 12 major categories are: 1. Language symbols and their classification (LC_CTYPE) 2. Numbers (LC_NUMERIC) 3. Comparison and sorting habits (LC_COLLATE) 4. Time display format (LC_TIME) 5. Currency unit (LC_MONETARY) 6. Information mainly includes prompt information, error information, status information, titles, labels, buttons and menus, etc. (LC_MESSAGES) 7 , Name writing method (LC_NAME) 8. Address writing method (LC_ADDRESS) 9. Phone number writing method (LC_TELEPHONE) 10. Weights and measures expression method (LC_MEASUREMENT) 11. Default paper size (LC_PAPER) 12. Overview of the information contained in the locale itself (LC_IDENTIFICATION).  
Among them, the one most closely related to Chinese input is LC_CTYPE. LC_CTYPE specifies the valid characters in the system and the classification of these characters, such as what are uppercase letters, lowercase letters, uppercase and lowercase conversion, and punctuation marks. , printable characters and other character attributes. The most important item in the locale definition zh_CN is to define the category of Chinese characters (Class "hanzi"), which is of course also described in Unicode. This makes Chinese characters legal and valid characters in the Linux system, and regardless of What character set are they encoded in.  
LC_CTYPE % This is a copy of the "i18n" LC_CTYPE with the following modifications: - Additional classes: 

hanzi  
copy "i18n"  
class "hanzi"; / % ..;/ ..;/ ;;;;;;;;/ ;;;;;;;;/ ;;;; END LC_CTYPE  

In the locale definition of en_US, Chinese characters are not defined, so Chinese characters are not valid characters. So if you want to input Chinese, you must use one that supports Chinese

locale, that is, zh_XX, such as zh_CN, zh_TW, zh_HK, etc.  
Another very important point is that these categories are independent of each other, that is, LC_CTYPE, LC_COLLATE and LC_MESSAGES, etc.

The categories are independent of each other and can be set to different values ​​according to user needs. This is beneficial and even necessary for many users. For example, I need an English environment that can input Chinese, so I can set LC_CTYPE to zh_CN.GB18030, and all other items are en_US.UTF-8.​  

(7) Set locale  

Setting locale means setting locale classification attributes of 12 major categories, that is, 12 LC_*. In addition to these 12 variables that can be set, there are two variables for simplicity: LC_ALL and LANG.
There is a priority relationship between them: LC_ALL>LC_*>LANG It can be said that LC_ALL is the highest level setting or mandatory setting, while LANG is the default setting. 
a. If you set LC_ALL=zh_CN.UTF-8, then no matter what value LC_* and LANG are set to, they will be forced to obey the setting of LC_ALL and become zh_CN.UTF -8. b. If you set LANG=zh_CN.UTF-8, and other LC_*=en_US.UTF-8, and do not set LC_ALL, then the system locale setting is LC_*=en_US.UTF-8. c. If you set LANG=zh_CN.UTF-8 and other LC_* and LC_ALL are not set, the system will set LC_* to the default value, which is the value of LANG zh_CN.UTF-8 . d. If you set LANG=zh_CN.UTF-8, and other LC_CTYPE=en_US.UTF-8, other LC_*, and LC_ALL are not set, then the system locale setting will be: LC_CTYPE= en_US.UTF-8, the rest of LC_COLLATE, LC_MESSAGES, etc. will adopt the default value, which is the value of LANG, that is, LC_COLLATE=LC_MESSAGES=...= LC_PAPER=LANG=zh_CN.UTF-8. Therefore, the locale is set like this: 

a. If you need a pure Chinese system, you can set LC_ALL= zh_CN.XXXX, or LANG= zh_CN.XXXX. Of course, you can set both, but as mentioned above, the value of LC_ALL will cover all Don't make other locale settings in vain.​ 

b. If you only want an environment where you can input Chinese and keep the menu, title, system information, etc. in English, then you only need to set LC_CTYPE=zh_CN.XXXX, LANG= en_US.XXXX. In this way LC_CTYPE=zh_CN.XXXX, and LC_COLLATE=LC_MESSAGES=...= LC_PAPER=LANG=en_US.XXXX.​ 

c. If you are happy, you can set the 12 LC_* to the values ​​you need one by one to create a weird system: LC_CTYPE=zh_CN.GBK/GBK (using the Chinese coded internal code GBK character set); LC_NUMERIC =en_GB.ISO-8859-1 (Use the British numeral system) [email protected] (German weights and measures use the ISO-8859-15 character set) Roman address writing method, American paper settings ….​ 

d. If you do nothing, that is, if LC_ALL, LANG and LC_* do not specify a specific value, the system will use POSIX as the lcoale, which is C locale.
————————————————

[root@localhost ~]# vi /etc/sysconfig/i18n
英文:
LANG="en_US.UTF-8"
中文:
LANG="zh_CN.UTF-8"
即时生效
source /etc/sysconfig/i18n  

Is the meaning of the above completion clear?

Guess you like

Origin blog.csdn.net/budapest/article/details/130533016