encode to understand (1)

background

Due to some force majeure reasons, it was docked with a license plate recognition manufacturer.

The license plate recognition manufacturer does not have a test program and can only use real license plate recognition machines to intercept and send requests.

A very pitiful scenario is as follows, the other party does not return in accordance with the encoding format:

This is a typical garbled problem

Introduction to coding

ASCII

ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a computer encoding system based on the Latin alphabet , mainly used to display modern English and other Western European languages. It is the most common single- byte encoding system today and is equivalent to the international standard ISO/IEC 646. [1] 

Please note that ASCII is the abbreviation of American Standard Code for Information Interchange, not ASCII (Roman Numeral 2). There are many misunderstandings here.

ISO8859-1

ISO-8859-1 encoding is a single- byte encoding, backward compatible with ASCII, its encoding range is 0x00-0xFF, 0x00-0x7F is completely consistent with ASCII, 0x80-0x9F is a control character , 0xA0-0xFF is between Text symbol.

This character set supports some languages ​​used in Europe, including Albanian , Basque , Breton , Catalan, Danish, Dutch , Faroese , Frisian , Galician German, Greenlandic , Icelandic , Irish Gaelic , Italian, Latin, Luxembourgish , Norwegian , Portuguese, Rito Romance, Scottish Gaelic , Spanish and Swedish.

Although English does not have accented letters, it is still marked as an ISO/IEC 8859-1 code. In addition, some languages ​​outside Europe, such as Afrikaans , Swahili , Indonesian and Malay , Filipino Tagalog, etc. can also use ISO/IEC 8859-1 encoding.

French and Finnish are also originally expressed using ISO/IEC 8859-1. However , it was replaced by ISO/IEC 8859-15 in 1998 because it does not have the three letters œ , Œ , and Ÿ used in French  and Š, š, Ž, and ž used in Finnish. (ISO 8859-15 also added the euro symbol)

ANSI

When the Internet develops more and more to the extent that the East is mainly based on CJK and other coding, it cannot be satisfied.

In order to make the computer support more languages, 2 bytes in the range of 0x80~0xFFFF are usually used to represent 1 character. For example: the Chinese character '中' is in

In the Chinese operating system, the two bytes [0xD6, 0xD0] are used for storage.

Different countries and regions have formulated different standards, resulting in their own coding standards such as GB2312, GBK, GB18030, Big5, and Shift_JIS. These various Chinese extension encodings that use multiple bytes to represent a character are called ANSI encodings. In Simplified Chinese Windows operating system, ANSI encoding represents GBK encoding; in Traditional Chinese Windows operating system, ANSI encoding represents Big5; in Japanese Windows operating system, ANSI encoding represents Shift_JIS encoding.

Different ANSI encodings are incompatible with each other. When information is exchanged internationally, it is impossible to store texts belonging to two languages ​​in the same ANSI encoded text.

ANSI encoding uses one byte to represent English characters, and two or four bytes to represent Chinese.

GB2312

The Coded Character Set of Chinese Characters for Information Interchange is a set of national standards issued by the General Administration of Standard of China in 1980 and implemented on May 1, 1981. The standard number is GB 2312-1980.

The GB2312 code is suitable for information exchange between Chinese character processing, Chinese character communication and other systems, and is commonly used in mainland China; Singapore and other places also use this code. Almost all Chinese systems and internationalized software in mainland China support GB 2312.

The basic set includes 6763 Chinese characters and 682 non-Chinese graphic characters. The entire character set is divided into 94 areas, each area has 94 bits. There is only one character on each location, so the location and bit can be used to encode Chinese characters, which is called location code .

Add 2020H to the area code converted into hexadecimal to get the national standard code . Add 8080H to the national standard code to get the commonly used computer internal code. In 1995, the Extended Specification for Chinese Character Encoding (GBK) was promulgated. GBK is compatible with the internal code standard corresponding to the national standard GB 2312-1980, and supports all Chinese, Japanese, and Korean (CJK) Chinese characters of ISO/IEC10646-1 and GB 13000-1 at the vocabulary level, with a total of 20902 characters.

GBK

The full name of GBK is "Chinese Internal Code Extension Specification" (GBK is the first letter of "national standard" and "extended" Chinese pinyin, English name: Chinese Internal Code Specification), National Information Technology Standardization Technical Committee of the People's Republic of China , December 1, 1995 It was formulated on December 15, 1995 by the Standardization Department of the State Bureau of Technical Supervision and the Department of Science and Technology and Quality Supervision of the Ministry of Electronics Industry in the form of Document No. 229 of the Technical Supervision and Bidding Letter 1995, and identified it as a technical specification guidance document. This version of the GBK specification is version 1.0.

other knowledge

GB2312: National Simplified Chinese character set, compatible with ASCII .
BIG5: Unified Traditional Chinese Character Encoding
GBK : It is an extension of GB2312, supports simplified and traditional characters, and is compatible with GB2312
GB18030: Continue to expand the encoding of rare characters and Japanese , Korean, etc. on the basis of GBK, compatible with GBK

Unicode

Unicode (Unicode, Universal Code , Unicode ) is an industry standard in the field of computer science, including character sets, encoding schemes, etc. Unicode was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. R&D began in 1990 and was officially announced in 1994.

UTF-8

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, also known as Universal Code. Created in 1992 by Ken Thompson. It has now been standardized as RFC 3629. UTF-8 encodes Unicode characters in 1 to 6 bytes. It can be used on web pages to display Chinese, Simplified, Traditional and other languages ​​(such as English, Japanese, Korean) on a unified page.

UTF-16

UTF-16 is the third layer of the five-layer model of Unicode character encoding: an implementation of the Character Encoding Form (also known as "storage format"). That is, the abstract code point of the Unicode character set is mapped to a sequence of 16-bit long integers (ie, code units ) for data storage or transmission. The code point of a Unicode character requires one or two 16-bit long code units to represent, so this is a variable-length representation.

Summarize

Baidu Encyclopedia ASCII: http://baike.baidu.com/view/15482.htm

百度百科:GB2312:http://baike.baidu.com/view/443268.htm?fromtitle=GB2312&fromid=483170&type=syn

Baidu Encyclopedia: GB18030: http://baike.baidu.com/view/889058.htm

百度百科:GBK:http://baike.baidu.com/view/931619.htm?fromtitle=GBK&fromid=481954&type=search

Baidu Baike: Unicode: http://baike.baidu.com/view/40801.htm

Baidu Encyclopedia: UTF-8: http://baike.baidu.com/view/25412.htm

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325212970&siteId=291194637