Common coding summary

1. Encoding difference:
iso8859-1: usually called Latin-1, which belongs to single-byte encoding. The maximum range of characters that can be represented is 0-255, which is used in English series. For example, the code for the letter a is 0x61=97.

UTF-8: utf encoding is compatible with iso8859-1 encoding. It is not a fixed word-length encoding, but a variable-length encoding method. The length of each character varies from 1 to 6 bytes:
numbers and characters generally account for 1 bytes,
a Chinese character in utf8 character set occupies several bytes: 2
bytes: 〇 3 bytes: basically equivalent to GBK, containing more than 21,000 Chinese characters 4 bytes: Chinese and Japanese There are more than 50,000 Unicode Chinese characters in the Korean super character set: this is the most unified encoding, which can be used to represent characters in all languages, and is a fixed-length double-byte (also four-byte) encoding, including English letters inside. So it can be said that it is not compatible with iso8859-1 encoding, nor is it compatible with any encoding. However, compared to iso8859-1 encoding, uniocode encoding just adds a 0 byte in front, for example, the letter a is "00 61". Unicode is just an encoding specification. Currently, there are only three unicode encodings actually implemented: UTF-8, UCS-2 and UTF-16. The three unicode character sets can be converted according to the specification.     The original unicode encoding is fixed length, 16 bits, that is, 2 bytes represent one character, so that a total of 65536 characters can be represented. Obviously, this is not enough to represent all the characters in various languages. The Unicode 4.0 specification takes this situation into consideration and defines a set of additional character codes. The additional character codes are represented by two 16-bit characters, so that a maximum of 1,048,576 additional characters can be defined. Currently, unicode 4.0 only defines 45,960 additional characters.






It should be noted that fixed-length encoding is convenient for computer processing (note that GB2312/GBK is not a fixed-length encoding), and unicode can be used to represent all characters, so unicode encoding is used in many software, such as java.

GBK/GB2312: GBK is compatible with gb2312 encoding; both are compatible with iso8859-1 encoding; the encoding represents Chinese characters, which is a double-byte encoding, while English alphanumerics and iso8859-1 occupy a single byte;
the gbk encoding can be used to represent traditional Chinese characters at the same time Chinese characters and simplified characters, while gb2312 can only represent simplified characters, and gbk is compatible with gb2312 encoding.

2java's processing of characters
In java application software, there are many places involved in character set encoding, some places need to be set correctly, and some places need to be processed to a certain extent.

2.1 getBytes(charset)
       This is a standard function of java string processing. Its function is to encode the characters represented by the string according to charset and represent them in bytes. Note that strings are always stored in unicode encoding in java memory. For example, "Chinese" is stored as "4e2d 6587" under normal circumstances (that is, when there is no error), if the charset is "gbk", it is encoded as "d6d0 cec4", and then returns the byte "d6 d0 ce c4". If charset is "utf8" then finally "e4 b8 ad e6 96 87". If it is "iso8859-1", it returns "3f 3f" (two question marks) at the end because it cannot be encoded.

2.2 new String(charset)
        This is another standard function of java string processing. Contrary to the previous function, the byte array is combined and recognized according to the charset encoding, and finally converted to unicode storage. Referring to the getBytes example above, both "gbk" and "utf8" yield the correct result "4e2d 6587", but iso8859-1 ends up being "003f 003f" (two question marks). Because utf8 can be used to represent/encode all characters, new String( str.getBytes( "utf8" ), "utf8" ) === str, which is fully reversible.

2.3 setCharacterEncoding()
This function is used to set the http request or the corresponding encoding.

       For request, it refers to the encoding of the submitted content. After specifying, the correct string can be obtained directly through getParameter(). If not specified, the iso8859-1 encoding is used by default, and further processing is required.

       See "Form Input" below . It's worth noting that you cannot execute any getParameter() before executing setCharacterEncoding(). The java doc states: This method must be called prior to reading request parameters or reading input using getReader(). Moreover, this specification is only valid for the POST method, not for the GET method. To analyze the reason, when the POST method executes the first getParameter(), java will analyze all the submitted content according to the encoding, and the subsequent getParameter() will no longer analyze, so setCharacterEncoding() is invalid. For the GET method to submit the form, the submitted content is in the URL, and all the submitted content has been analyzed according to the encoding at the beginning, and setCharacterEncoding() is naturally invalid.

       Note: iso-8859-1 is the standard character set used for JAVA network transmission, while gb2312 is the standard Chinese character set. When you perform operations that require network transmission, such as submitting forms, you need to convert iso-8859-1 to gb2312 Character set display, otherwise if the iso-8859-1 character set is interpreted according to the gb2312 format of the browser, it will be garbled because the two are incompatible.
  
  
Rule:
utf-8 encoding can be encoded with gbk and iso8859-1 decoding After going back

to gbk encoding, it can only be decoded with iso8859-1 and then encoded back



String code = "China";
// encode
byte[] utf = code.getBytes("utf-8");
byte[] gbk = code.getBytes("gbk");
System.out.println("utf-8 encoding: " + Arrays.toString(utf));//[-28, -72, -83, -27, -101, -67] 6 bytes
System.out.println("gbk encoding: " + Arrays.toString(gbk));//[-42, -48, -71, -6] 4 bytes
// decode
String code1 = new String(utf, "utf-8"); // 中国
String code2 = new String(utf, "gbk"); // gbk decoding: Juan浗 gbk decodes with 2 bytes, so there will be one more character
String code3 = new String(gbk, "utf-8"); // gbk decodes with utf-8: ?й? utf-8 decoding requires 6 bytes
System.out.println("--------------------");
System.out.println("utf-8解码:" + code1);
System.out.println("gbk解码:" + code2);
System.out.println("gbk用utf-8解码:" + code3);
System.out.println("---------------------");
System.out.println("用utf-8编码回去");
code3 = new String(code3.getBytes("utf-8"), "gbk"); // 锟叫癸拷   gbk用utf-8解码后无法编回去
System.out.println(code3);




在JSP页面获取表单的值时会出现乱码,有两种解决方法:
1.在调用getParameter之前通过request.setCharacterEncoding设置字符编码

2.调用new String(str.getBytes("iso8859-1"), "UTF-8");编码后解码


注意:
虽然说utf是为了使用更少的空间而使用的,但那只是相对于unicode编码来说,如果已经知道是汉字,则使用GB2312/GBK无疑是最节省的。不过另一方面,值得说明的是,虽然utf编码对汉字使用3个字节,但即使对于汉字网页,utf编码也会比unicode编码节省,因为网页中包含了很多的英文字符。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325346317&siteId=291194637