About Python-- encoding = 'ISO-8859-1' and 'utf-8' of the

Unicode, UTF-8 and ISO8859-1 and the garbage problem

In the following description, will be "Chinese" word, for example, by the look-up table can know the GB2312 coding is "d6d0 cec4", Unicode encoded as "4e2d 6587", UTF encoding is "e4b8ad e69687". Note that this word does not iso8859-1 encoding, but can be iso8859-1 encoding to "express."

2. Basic knowledge of coding

The first encoding is iso8859-1, and ascii similar coding. However, in order to facilitate a wide variety of languages ​​represented, the gradual emergence of a number of standard codes has the following important.

2.1. iso8859-1

Belongs to a single-byte coding, the range of characters that can represent up to 0-255, applied English series. For example, a letter code is 0x61 = 97.

Obviously, iso8859-1 encoded representation of the character range is very narrow and can not represent Chinese characters. However, because it is a single-byte coding, and computer representation of the most basic units of the same, so often, still use iso8859-1 encoding to express. And in many protocols, use the default encoding. For example, although the "Chinese" word does not exist iso8859-1 coding, coding to gb2312 for example, should be "d6d0 cec4" two characters, using iso8859-1 encoding, when will it apart into four bytes He represents: "d6 d0 ce c4" (in fact, during the storage time, but also processed in bytes). And if it is in UTF, it is 6 bytes "e4 b8 ad e6 96 87". Obviously, this needs to be further representation is based on another encoding.

2.2. GB2312/GBK

This is the man of national standard code, designed to represent Chinese characters, double-byte coding, and the English letters and iso8859-1 consistent (compatible with iso8859-1 encoding). Wherein gbk coding can be used simultaneously represent traditional and simplified, showing only the simplified gb2312, gbk is compatible gb2312 coding.

2.3. unicode

This is the most unified coding, can be used to represent characters in all languages, but also double-byte fixed-length (also a four-byte) coding, including letters included. So you can say it is not compatible with iso8859-1 coding, nor compatible with any coding. However, with respect to iso8859-1 encoding it, uniocode coding just in front adds a 0 byte, such as a letter "0,061."

Incidentally, fixed-length coding to facilitate computer processing (Note GB2312 / GBK not fixed-length coding), and they may be used to represent all unicode characters, so in many internal use unicode software coding process, such as java.

2.4. UTF

Taking into account coding unicode not compatible iso8859-1 coding, and is easy to occupy more space: Because the English alphabet, unicode also requires two bytes to represent. So unicode not easy transfer and storage. Thus produced utf coding, coding utf compatible iso8859-1 coding, but also can be used to represent characters in all languages, however, utf encoding is variable length encoding, the length of each character byte ranging from 1-6. Further, utf coding own simple checking function. Generally, letters are represented by a byte, and three bytes using characters.

Note that, although utf is to use less space and use, but only in relation to unicode coding, if already know the Chinese characters, use GB2312 / GBK is undoubtedly the most savings. On the other hand, however, worth noting that, although the utf encoded using 3 bytes of characters, but even for the Chinese characters on the page, utf unicode coding than coding will be saved because website contains a lot of English characters.

3. java handling of character

In the java application software, there will be many related to the character set encoding, some places need the correct settings, some places require a certain level of treatment.

3.1. getBytes(charset)

This is a standard function of the java string, the effect is represented by the string of characters, and expressed in bytes encoded in accordance charset. Note that the memory always java string encoded by unicode storage. For example, "Chinese", under normal circumstances (ie there is no wrong time) storage for the "4e2d 6587", if the charset is "gbk", were coded as "d6d0 cec4", and then return byte "d6 d0 ce c4". If charset is "utf8" is the last "e4 b8 ad e6 96 87". If it is "iso8859-1", the inability to encode, and finally returns "3f 3f" (two question marks).

3.2. new String(charset)

This is another standard java string manipulation functions, and a function of acting on the contrary, the byte array in accordance with a combination of charset encoding identification, and finally converted to unicode storage. Referring to the example of the above-described getBytes, "gbk" and "utf8" can correct result "4e2d 6587", but iso8859-1 finally turned "003f 003f" (question marks).

Because utf8 can be used to represent / encode all the characters, so new String (str.getBytes ( "utf8"), "utf8") === str, that is fully reversible.

3.3. SetCharacterEncoding ()

This function is used to set the http request or corresponding coding.

For the request, refers to the contents of the code, can be designated () directly obtained through the getParameter correct string, if not specified, the iso8859-1 encoding, require further processing by default. See the following "form input." It is noteworthy that before the implementation of setCharacterEncoding (), can not perform any getParameter (). Instructions on the java doc: This method must be called prior to reading request parameters or reading input using getReader (). Moreover, the designated POST method is effective only for invalid for the GET method. Analysis of reasons, it should be in the implementation of a getParameter () is, java will be in accordance with the coding analysis of all the submissions, and follow-up getParameter () no longer analysis, so setCharacterEncoding () is invalid. For GET method to submit the form is submitted content in the URL, a beginning has been encoded according to the analysis of all the submissions, setCharacterEncoding () naturally invalid.

For the response, it is designated a content coding output at the same time, the setting will be passed to the browser, tell the browser output content encoding used.

Guess you like

Origin www.cnblogs.com/huangchenggener/p/10983866.html