[Solution] Chinese garbled characters in arcgis vector data SHP attribute table

Recently, many fans have been looking for the problem of Chinese garbled characters in vector data. Indeed, this problem has troubled us for a long time. I think you have already consulted various methods and suggestions. You can deal with them yourself first. I will not repeat the searchable solutions here. If If you still haven't solved it, you can leave a message and I will help you.

Next, I will briefly talk about the root cause of this problem, so that everyone can understand.
First: Character Encoding
  Computers store information in binary form. Each "word" will be represented by a specific set of codes (1-4 bytes, 1 byte = 8 binary bits), that is, a number, this representation rule is called "character encoding" . According to the binary "numbering" information, the computer will go to the "character set" corresponding to the "numbering rule (character code)" to find the corresponding "word", and use the corresponding font (the character set corresponds to several "font libraries") to display it. Copied from elsewhere or downloaded from the network, the system environment or coding rules are wrong, resulting in garbled Chinese characters.

ASCII
  first invented the "character encoding" in the United States, named ASCII (American standard Code for information Interchange). It contains 128 characters (0-127), each character is represented by 8 binary bits, the first bit is specified as 0, and the last 7 bits identify a character. For example, 'A' is represented as 01000001 in binary, 65 in decimal, and 0x41 in hexadecimal, which is what we often say that an English letter occupies 1 byte, 8bit=1Byte.
  The United States thinks that one byte (which can represent 256 decimal codes) is more than enough to represent all the letters, numbers and common special symbols in the English-speaking world (in fact, ASCII only uses the first 128 codes). Later, European countries quit, they found that ASCII could not identify their letters, so the space after 127 in ASCII was used to represent these characters, and the 128-255 page character set was called the extended character set. Why -255? 8 binary digits represent the maximum decimal number, which is 255.

This is enough? Far from it!

After GB2312
  , China also used computers, Japan, South Korea... and many countries also used it. This is a big thing. How do you express these words?

The character encoding scheme GB2312 came out like this, which is equivalent to the extension of ASCII. In this encoding scheme, the original ASCII encoding is used for those less than or equal to 127, and two bytes greater than 127 are used to represent a Chinese character. The section (low byte) is from 0xA1 to 0xFE, so about 7000 simplified Chinese characters are encoded in one group. In these codes, mathematical symbols, Greek letters, and Japanese pseudonyms are all coded into it. Even the numbers, punctuation, and letters that are already in ASCII have been re-coded into two-byte long codes. This is what is often said. "Full-width" symbols, those characters before the original 127 are called "half-width" symbols.

Of course, many countries such as Japan and South Korea have also developed their own double-byte character encoding schemes.

GBK
  and later... This bit of Chinese is still not enough! What should I do if the rare and traditional characters are still not recognized? So the encoding rules were changed: if the high byte (the first byte) is greater than 127, it is considered to be a 2-byte Chinese character (the low byte (the second byte) is also used in 0-127), After this expansion, it is the GBK standard. GBK contains more than 20,000 Chinese characters and symbols. Because it was first adopted by WINDOWS, its application range is very wide. But later the compatriots of ethnic minorities also used computers, so in order to expand the ethnic minority characters, GBK was expanded to GB18030.

Chinese GBK, Japanese Shift_JIS, South Korean EUC-KR... These encoding rules all use ANSI standards, and there is a bug. They are all double-byte representations of a text. What will happen to the Chinese to Korean system? Garbled! you guessed right.

UTF-8

The above is too messy, ISO (International Organization for Standardization) decided to formulate a unified encoding standard that includes all characters in the world, including character sets, encoding schemes, etc., called "Universal Multiple-Octet Coded Chracter Set", referred to as UCS, commonly known as Unicode Standard (Note that think about the ANSI standard, neither of them are specific character encoding rules), "Universal Code" came out, see how messed up you are.

As an encoding method of the Unicode character set, UTF-8 adopts variable-length encoding, which uses 1-4 bytes to represent a character. Its characteristic is that different length encodings are used for characters in different ranges. In this way, those half-width characters in UTF-8 are represented by 1 byte (8 binary bits), while Chinese is represented by 3 bytes.

Secondly: The trouble caused by the encoding
 
As mentioned   earlier, the Windows system uses the ANSI encoding standard by default, and the Chinese system is GBK. ArcGis before 10.2 also defaults to this "default" encoding for dbf, so 2 bytes represent a Chinese character, so a field name Up to (11 bytes) 5 Chinese characters.

Since 10.2.1, Esri has become popular, and UTF-8 is used for dbf encoding. Such a field name can only write 3 Chinese characters, because 11mod3=2.

In this way, exporting a new file from gbk→UTF8 will cause the problem of truncation of field name characters.

Copy a Shapefile file from elsewhere or from the network. If the system environment or encoding rules are not selected correctly, it will be messed up.

Again: dbf's mistake
  When creating a shapefile in ArcGIS Desktop, its header file (dBase Header) generally contains information about the encoding type used by the shapefile, that is, LDID (Language Driver ID), which tells the application what encoding type to use to read it correctly. Generally, there are *.cpg files with the same name in the subfiles of Shapefile, and the encoding information is also stored. When you open them with Notepad, you can see such as UTF-8 and GBK. Both identify the encoding method of dbf, and the priority order recognized by ArcGIS is that LDID takes precedence over CPG files.

Modifying the default Codepage can change the encoding format of dbf when ArcGis creates a new Shapefile. Notice! The key point here is to change the encoding method to create a new one in the future. It will not change the encoding method of the existing dbf file, and it will not solve its garbled characters!

ArcMap reads dbf attribute table garbled

Since Shapefile is an open data format, it is possible to ignore the declaration of Language Driver ID in the use of third-party tools or in other cases, which will result in garbled characters. In this case, try to add a file with the same name *.cpg.
  
  [Note: If the data is infringing, please inform this account, and this account will delete the data as soon as possible.

Guess you like

Origin blog.csdn.net/weixin_42153420/article/details/123824880