JavaWeb Chinese coding method to resolve the problem

A Cause JavaWeb programming garbled
because computers known only 0 and 1, in order to transmit a variety of characters on the network need to be encoded. And because there are various uncertainties encoding, transmission and decoding process, leading to frequent garbage problem has become a major problem for beginners. This article attempts to explain the garbage problem in the simplest example.

1. Why is there garbage problem?
As a telegram as transmitters using a password if this were transmitters, while the receiving end of another password to decode this will certainly lead to the same can not be decoded. If the data transmission in computer networks, coding and receiving end transmitting end are inconsistent employed will rise to problem.

2. Recognizes a variety of coding:
ASCII:

ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a set of computer coding system based on the Latin alphabet, used to display modern English and other Western European languages. It is now the most common single-byte coding system, and is equivalent to the international standard ISO / IEC 646.

ISO-8859-1:

Because ASCII is the American standard, it contains spaces and 94 "printable characters" enough to use English. However, other languages ​​use the Latin alphabet (mainly the language of European countries), have a certain number of additional symbols letters, we need to use the area other than ASCII and control characters to store and representation.

To solve this problem, the standards of the International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) jointly developed a series of 8-bit character set of ISO-8859, which stands for ISO / IEC 8859, now define a set of 15 characters. In addition to the use of languages ​​other than Latin, Cyrillic Eastern European languages, Greek, Thai, Modern Arabic, Hebrew, etc., you can use this form to store and representation.

GBK/GB2312:

To solve the problem of Chinese display and editing, in 1980 China's State Administration issued a standard GB2312 standard, it stands for "Information exchange with the Chinese coded character set", the standard number is GB 2312-1980. GB2312 encoding applies to the exchange of information between the Chinese character processing, Chinese character communication systems, access to mainland China; Singapore also uses this encoding. Almost all of the Chinese system and the internationalization of the mainland Chinese software support GB 2312. The basic set of characters 6763 total revenue and non-graphic characters 682 characters. GB 2312 appeared, basically meet the needs of computer processing of Chinese characters, Chinese characters have been included in its coverage in mainland China, 99.75% frequency of use. But for a word rarely used aspects names, the ancient Chinese and other emerging, GB 2312 can not handle, which led to the subsequent emergence of GBK and GB 18030 Chinese character set.

In 1995 China National Standards Administration promulgated the "Chinese character coding extension specification" (GBK). GBK national standard GB 2312-1980 and the corresponding inner code compliant, while a vocabulary support all ISO / IEC10646-1 and GB 13000-1, the Japanese and Korean (CJK) characters, a total of 20,902 characters.

National Standard GB18030-2005 "Information technology - Chinese coded character set" is China's second most important after GB2312-1980 and GB13000.1-1993 Chinese character encoding standard is one of the basic criteria of the computer system must be followed. GB18030 There are two versions: GB18030-2000 and GB18030-2005. GBK GB18030-2000 is substituted version, its main feature is the addition of CJK Unified Han Extension A characters in GBK basis. The main features is the addition of GB18030-2005 characters CJK Unified Chinese characters in the GB18030-2000 expansion B basis. GB18030-2005 is independently developed by China-based Chinese characters and contain a variety of ethnic minority languages ​​(such as Tibetan, Mongolian, Dai, Yi, Korean, Uighur, etc.) of large Chinese coded character set mandatory standards, the income of Chinese characters more than 70,000. Chinese Windows operating systems use GBK / GB2312 encoded by default.

Unicode:

Many traditional coding methods have a common problem, that is, to allow bilingual computer processing (usually using Latin characters as well as their native language), but can not simultaneously support a multilingual environment (refer to simultaneous handling of multiple languages ​​mixed). To address the limitations of traditional Unicode is a character encoding scheme generated, for example, the characters defined by ISO 8859, although widely used in different countries, but the situation is not compatible, but often in different countries. Unicode appears, enable the computer to achieve cross-language, cross-platform text conversion and processing.

UTF-8:

Because the Unicode encoding uses 2 bytes to store one character. Facts have proved that the use of Unicode characters can be represented in ASCII is not efficient, because Unicode ASCII space than twice as large, while the high byte ASCII 0 for no use to him. To solve this problem, there have been some character set intermediate format, they are called Universal Transformation Format, namely UTF (Unicode Transformation Format).

UTF-8 (8-bit Unicode Transformation Format) is a variable length for the Unicode character encoding. Created by Ken Thompson in 1992. Now standardized as RFC 3629. UTF-8 byte with 1-4 UNICODE character encoding. With the same page on a web page can display Simplified Chinese Traditional and other languages ​​(such as English, Japanese, Korean).

UTF-16:

UTF-16 is the third layer Unicode character encoding five layer model: a character code table (Character Encoding Form, also called "storage format") in one implementation. I.e., the abstract code bit Unicode character set is mapped to 16-bit long integer (i.e., symbols) sequence for data storage or transfer. Unicode character code bits, requires a 16-bit or two symbols represented, so this is a fixed length FIG. Java language using the UTF-16 character memory storage format as the default.

Second, the distortion of the handle various
processing problems because of different causes distortion of the need to use different ways of processing, is presented on various issues discussed the following:

1. The response processing distortion
A. response output stream of bytes used:
String Data = "Chi Chuan podcast";

response.getOutputStream().write(data.getBytes());

Measured by the garbage problem does not occur because the String of getBytes () method uses the native platform default encoding to convert the string to a byte array default, and our Chinese windows operating system uses the default GBK coding. In the client, the browser's default also use the operating system default encoding parsing the page, the default is GBK in the windows system. Sending end and the receiving end is consistent encoding, so no garbled.

Due to UTF-8 encoding is a common encoding on international, so we are doing Web development, usually using UTF-8 encoding. But if getBytes ( "UTF-8") to give an array of bytes sent to the client, then the problem will be garbled. The reason is because getBytes ( "UTF-8") is actually specified in the UTF-8 encoding the string into a byte array. The default browser at this time still use native platform default encoding to decode the garbage problem will occur.

At this point the display is as follows:
JavaWeb Chinese coding method to resolve the problem
At this point you can use the following ways to solve the garbage problem:

Method One: manually adjust your browser's encoding

Click the right mouse button, the pop-up menu, select encoding, then select UTF-8 encoding.
JavaWeb Chinese coding method to resolve the problem
Method two: using the meta tag set specification html

JavaWeb Chinese coding method to resolve the problem

After adding the code to the normal text display.
JavaWeb Chinese coding method to resolve the problem
Method 3: Use the setHeader Response object () method sets the Content-Type header

另外,按照HTTP协议的规定,如果指定了消息正文的MIME类型后,浏览器就必须按照MIME类型解析,所以可以通过设置响应消息头Content-Type的值为text/html,并指定参数charset=UTF-8来使浏览器使用UTF-8解析页面。
JavaWeb Chinese coding method to resolve the problem
方法四:使用Response对象的setContentType()方法设置MIME类型及编码

由于方法三的使用频率比较高,所以在制定Servlet规范时,JCP组织抽取了一个比较简单的方法,此方法与方法三的原理相同,只是更简单,在实际工作中,我们通常使用此方法指定MIME类型。
JavaWeb Chinese coding method to resolve the problem
B. 使用字符输出流输出响应:
Response对象的getWriter()方法可以返回一个PrintWriter对象,输出的内容会暂存在缓冲区中。当响应结束时,Tomcat使用默认编码ISO-8859-1将Response对象的响应消息正文转换为二进制数据输出给客户端,而浏览器使用本地平台默认编码进行解码,从而导致乱码。
JavaWeb Chinese coding method to resolve the problem
JavaWeb Chinese coding method to resolve the problem
处理方式:使用response.setCharacterEncoding("UTF-8")方法告知Tomcat使用UTF-8而不是ISO-8859-1对响应消息正文进行编码。另外,还需要使用response.setContentType("text/html;charset=UTF-8")告知浏览器使用UTF-8编码解码传递过来的数据。
JavaWeb Chinese coding method to resolve the problem
修改后的显示效果如下:
JavaWeb Chinese coding method to resolve the problem
2. 请求中的乱码处理
在用户提交表单时,浏览器会按照当前页面的编码设置对中文字符进行编码,并将内容生成请求消息发送给服务器进行解析。Tomcat服务器得到请求消息后,会依据表单数据的位置不同,做不同的处理。

When submission method (method) is a POST, the form data is placed on the body of the request message, Tomcat using the default for this part of ISO-8859-1 decoding, required in this case Request.setCharacterEncoding ( "UTF-8") informed The server uses UTF-8 encoding of the request message body decoding.
JavaWeb Chinese coding method to resolve the problem
JavaWeb Chinese coding method to resolve the problem
When submission method (method) is GET, the form data is placed in line request message, and use the URL encoded secondary encoding standard. After this the part of the data obtained Tomcat, will first use the URL decoding standards, using the ISO-8859-1 and then a second decoding default. At this time, if used Request.getParamater () method to obtain String String is obtained through the ISO-8859-1 decoded string, so that distortion problem occurs. For this case, the string needs to be re-obtained in accordance with ISO-8859-1 playing back a byte array, decoding again before.
JavaWeb Chinese coding method to resolve the problem
JavaWeb Chinese coding method to resolve the problem
JavaWeb Chinese coding method to resolve the problem
3. The cookie processing garbage;
as each cookie is used Cookie and Set-Cookie header message and transmitted in response to the request message header, and the HTTP protocol, the message and the response message can only request English characters, Chinese characters are unsafe can not be used directly. Therefore, when the Cookie stored in Chinese, the need for the encoding operation. We can use the tools to encode URLEncoder () method Chinese encoding, decoding using decode () method URLDecoder tools in reading.

Use URLEncoder.encode () method of the username string is encoded and then create a Cookie object is shown below:
JavaWeb Chinese coding method to resolve the problem
Use URLDecoder.decode () method to decode the username string as shown below:
JavaWeb Chinese coding method to resolve the problem
Third, the summary
above example only initiate, as long as we can grasp the key issues live, understand the nature of garbled, garbled so what are the clouds.

Guess you like

Origin blog.51cto.com/14473726/2437704