In-depth analysis of Chinese garbled characters in web development

Overview
Some time ago, some colleagues asked me about Chinese garbled characters, and everyone's questions were similar. And I always wanted to write an article like this before, but I couldn’t make enough time, or the procrastination was serious (in fact, I was lazy). Summarize.
We know that the http protocol is request-response, and the usual garbled problems are hidden in this question and answer. If you can understand the link that the characters take during this period, and the experience in this link. If you know how to convert characters, any annoying garbled problems can be solved.
Below I will describe the problem of garbled characters encountered in the development process based on the http protocol based on my own work experience.

The problem of garbled characters in response When
Confucius saw Yan Hui secretly eat some while cooking more than 2,000 years ago, he blamed Yan Hui in words. Yan Hui explained that he didn't eat it secretly, but that something dirty fell into the pot. He fished out the dirty rice and ate it. Later, Confucius sighed that those who believe in the eyes are also untrustworthy.
When you see that the response content is garbled in the browser, you will think that the characters spit out by the program are garbled characters. The solution to this problem is to modify the program. Is this really the case? In order to illustrate this problem, I wrote a simple program to simulate a web program. The function of this program is to output the two characters of "China" encoded in utf-8. Below we use Firefox and chrome to access the program:

use Firefox to visit http://localhost:8080


and use chrome to visit http://localhost:8080



As you can see above, for the same output, different browsers show different results. Firefox shows garbled characters in the browser body, but the correct characters in the "Response" tab below. Chrome is the opposite of Firefox, the text is displayed correctly, and the label "response" displays garbled characters. And the garbled characters displayed by the two browsers are also inconsistent, firefox displays three characters, and chrome displays six characters.

As mentioned above, this web program of mine outputs the two characters "China" according to utf-8 encoding.
Is it converted into other encodings during the output process? In order to find out, I need to see the original bytecode output by the program. The original bytecode cannot be seen with the tools that come with firefox and chrome. Here I use wireshark to capture the packets of the two browsers respectively. .
The two characters "China" correspond to e4b8ad (middle) and e59bbd (country) in utf-8 encoding. If we also see these six bytes in the captured package, it means that the program's The output is no problem.

Packet capture for firefox: Packet capture


for chrome:



Through wireshark, you can see that the results of the two browsers are the same, and the Data part is e4b8ade59bbd, which is consistent with our expectations. The difference is that firefox uses 404 bytes to send the request, and chrome uses 494 bytes. This is actually because the two browsers send requests with different request headers. For example, User- Agent request header.

Since there is no problem with the output of the program, why does the browser display garbled characters? We all know that the http program will also carry some response headers when it spit out the content. The header is as follows:


It can be seen that only one Content-Length header is used to indicate the byte length of the content. As for how to interpret these six bytes, the browser does not know, so the browser can only "guess" at this time. First of all, the http protocol itself is a character protocol. Since there is no further description in the response header, it is assumed that the output content is also character content by default, and the remaining problem is to "guess" the encoding of the character that these six bytes are. . As you can see from the chrome display, chrome displays the correct utf-8 encoding in the browser window, and uses the wrong encoding to interpret the six bytes in the "response" tag. Firefox, on the other hand, "guessed" the correct encoding in the "response" tab, but used the wrong encoding in the browser window.

It should be noted that the word "guess" here is actually inaccurate. In fact, each character encoding has its own specific rules. If you are familiar with all the character encoding rules, you can deduce it by giving a byte order. its character encoding.


Knowing the problem is easy to solve. There is a Content-Type header in the http protocol, which can be used to specify the type of content and the character encoding of the content. Now we add the response header Content-Type: text/plain; charset=utf-8 to the output, and use two browsers to access http://localhost:8080. The response headers seen are as follows:


At this time, the browser of firefox Both the window and chrome's "response" tab show the correct characters.

The conclusion we have reached so far should be this, the encoding specified by charset needs to be consistent with the output content, so that there will be no garbled characters when displayed. Let's use another way to access our resources. We use telnet and curl to access http://localhost:8080

via Telnet:



Because my web program does not process any http request headers, its default action is to output the content directly after the tcp connection is established, so I see that no request headers required by the http protocol are sent during telnet, and Content can still be output.

As can be seen from the figure, charset=utf-8 is correct, and I have not made any changes to the program, which means that the encoding output by the program is the same as the encoding specified by Content-Type, but we did not see it correctly character of.

Access through curl:



You can see the response header and content display, which is the same as when using telnet to access, and the content is garbled.

So what we got by accessing the resource through the browser above, is it wrong to conclude that the output encoding and charset are consistent and there will be no garbled characters? Of course not, but only if the qualifier "browser" must be added to the conclusion. In fact, we can easily understand this problem by dividing the http response into two steps: data acquisition and data interpretation. First of all, in the data acquisition step, there is no difference between browsers, telnet, and curl. Establish a tcp connection, and then get the data returned by the web program. The difference is that in the step of data interpretation, the browser conforms to the http specification. The http specification says that the encoding specified by the charset in the response header Content-Type is the actual encoding of the response content, so the browser displays the characters correctly. The example we demonstrated with telnet and curl is only responsible for the step of obtaining data. For data interpretation, the terminal that initiates the command is responsible for this step, and the terminal has no relationship with the http protocol. The terminal will only use the preset the encoding rules to display the content.

The following is the result that I set the encoding of the mid-end to utf-8, and then use curl to access the

program . No changes have been made to the program, but the garbled characters have disappeared.

Wouldn't it really be okay to not specify encoding rules in the response headers?
Set the program's response header Content-Type to text/html and do not set charset, and then access it in two browsers respectively. Access in firefox: Access in chrome: you can see that garbled characters appear in firefox, but not in chrome. Now let's change the output of the program, the output is: <html><head><meta charset=”utf-8”></head>China</html> and then use two browsers to access it separately. FIRFOX ACCESS: The gibberish is gone. Chrom's Access: Displayed correctly. As can be seen from the above four pictures, we did not specify the encoding of the content in the response header, but there is still no garbled problem. The reason is Content-Type: text/html and <meta charset=”utf- in the response content. 8”> tag, this tag makes a self-description of the html content itself. The label language xml can also be self-describing like html, that is to say, for the content of the response is xml, even if there is no charset specified encoding, xml can specify the correct encoding through self-describing. The last thing to note is that when processing character content without charset, different browsers handle it differently. Even if the same browser has different versions, the processing method is not necessarily the same. So the garbled characters I introduced here may not appear locally, but in order to ensure that all browsers do not have problems, it is best to add charset to the response header and make it consistent with the actual encoding of the content. If the provided http resource is not used for direct access in the browser, but is used to provide an interface for each system to call, if charset is not specified, other methods need to be used to inform the other party of the content encoding. The problem of garbled characters in the request (Request) process
























The garbled characters in the request process mainly appear in two places, one is the encoding used when the request is sent, and the other is all the encodings when the web application receives the request and decodes it. What encoding is used when the request is sent mainly depends on the client used to send the request. Here we use the browser and telnet as the client to illustrate. For the web application layer, we use a tomcat as an example, so if you are not using

tomcat at work, the decoding behavior will be inconsistent with the decoding behavior described here, but the principle is the same. .

Before starting, explain the composition of the URL:
{http://localhost:8080[/app/servletpath]}?(name=xxx)
{}: Represents URL
[]: Represents URI
(): Represents query parameters Use default

for tomcat Set, use the following code to receive the request
@Override
public void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    System.out.println("name: "+req.getParameter("name"));
    System.out.println("queryString: "+req.getQueryString());
    System.out.println("pathInfo: "+req.getPathInfo());
    System.out.println("requestURL: "+req.getRequestURL());

}

Enter http://localhost:8080/app/China?name=China directly in chrome and the result is as follows:
        name: ä¸å›½
        queryString: name=%E4%B8%AD%E5%9B%BD
        pathInfo: /app/ä¸å›½/
        requestURL: http://localhost:8080/app/%E4%B8%AD%E5%9B%BD/

From the printed information, we can know that queryString and request URL are sent to chrome first The Chinese is percent encoded according to utf-8 (for percent encoding, you can see http://deyimsf.iteye.com/blog/2312462), from which it is judged that the encoding is correct when the request is sent, but A decoding error occurred when using Request.getParameter() and Request.getPathInfo(). In the tomcat documentation, you can see that there is a parameter URIEncoding, and the documentation explains it as follows:
      This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, ISO-8859-1 will be used .

It probably means that tomcat will use the encoding specified by URIEncoding to decode the URI part by percent, and if not specified, use ISO-8859-1 to decode it. From this explanation, we can know that the reason for the garbled characters is that the correct encoding is not specified with URIEncoding. Let's set the URIEncoding to utf-8 to see what will happen. The configuration in tomcat's server.xml file is as follows:
  <Connector port="8080" protocol="HTTP/1.1"
                connectionTimeout="20000"
                redirectPort="8443" URIEncoding="utf-8"/>

Enter http://localhost:8080/app/China?name=China directly in chrome, the result is as follows:
    name: China
    queryString: name=%E4%B8%AD%E5%9B %BD
    pathInfo: /app/China/
    requestURL: http://localhost:8080/app/%E4%B8%AD%E5%9B%BD/

You can see that the garbled characters have disappeared, and the garbled characters of the parameter name have also disappeared , which shows that URIEncoding also works on QueryString.

In the above example, we can see that before chrome sends the request, it will percent-encode all Chinese and send it out, and the character encoding uses utf-8. In fact, in order to ensure no garbled characters in the production process, it is necessary to perform percent encoding (also called URL encoding) on ​​requests. As for why percent encoding is required, you can read the article I wrote earlier http:/ /deyimsf.iteye.com/admin/blogs/1776082, this article gives a brief explanation of why percent-encoding is needed.

Since the http protocol only stipulates that the request should be encoded when it is sent, it does not stipulate which encoding should be used, so this processing method of chrome cannot represent all browsers. Only the URI part and the Query String part in the same request may be encoded differently by some browsers. For example, in my work, I have encountered the URI part that uses GBK encoding (without percent encoding), while the query string uses a browser that uses urf-8 for percent encoding. The solution to this problem is that we must use a certain character encoding (such as utf-8) to percent-encode the place where there is Chinese before

sending any request.

Regarding the problem of character encoding in the request body,
the garbled characters we mentioned above all appear in the URL and Query String, and another problem that is prone to garbled characters is in the http request body. Submit the form using the post method in http to put the input parameters into the request body.

The code used by the server to receive post requests is very simple, as follows:
@Override
public void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
        System.out.println("name: "+ req.getParameter("name"));
}

Very simple, output directly to the console after receiving the input parameters.

Post access in Firefox: Post access in Chrome: Then click the submit button in each of the two browsers. After submitting in Firefox, the results obtained in the background are as follows: name: Öйú After submitting in Chrome, the results in the background are as follows:     name: 中国 After the two browsers submit again, garbled characters appear, and two This is a kind of garbled code, because the program on the server side is the same, so we can speculate from this phenomenon that the encoding used by the two browsers when sending the request must be different. . Let's use wireshark to see what the original encoding is used by the two browsers when sending the request body. Wireshark screenshot of the request sent by Firefox: Wireshark screenshot of the request sent by Chrome: Looking at the bottom blue area of ​​the two images, you can see that the firefox part is name=%D6%D0%B9%FA The chrome part is name=%26 The same thing about %2320013%3B%26%2322269%3B is that both browsers have percent-encoded the value of the input parameter name. The difference is that the character encoding used is different. When the two browsers send requests, The input parameters are percent-encoded using the character encoding that they think is "correct". Is there a way to make different browsers use the same encoding when sending post requests? A simple and rude way is that we use js to control post submission, and before submitting, all input parameters are percent encoded according to a unified character encoding (such as utf-8 encoding).






























Now let's take a look at another method. Above, we took two screenshots for the two browsers before submitting the request. You can see the http response headers after getting the form in firefox and chrome. The difference between these two pictures is only Three identical response headers Server, Content-Length, Date, now we add a Content-Type: text/html; charset=utf-8 to this http response, then enter "China" in the two browsers and press Submit button.

At this point, it can be seen that the request sent by the two browsers has become a percent code in the form of
name=%E4%B8%AD%E5%9B%BD
, which is urf-8.

After the two browsers are submitted, the data obtained in the background is
name: ä¸å›½
or garbled, but now it is the same.

Here, when we obtain the input parameter value in the background, we use the same method as when we obtained the input parameter in the Query String. Request.getParameter(), the URIEncoding setting in tomcat is the same as the previous one, using utf-8 encoding . The browser uses the same encoding rules to send requests, and the same method is used to receive parameters in the background. The only difference is that the HTTP request methods are different, one for get and the other for post. So here we can draw a conclusion that URIEncoding does not work for the post method. The Request.setCharsetEncoding() method is required here, which only works on the request body.
The server-side code becomes as follows:
@Override
public void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    req.setCharacterEncoding("utf-8");
    System.out.println("name: "+req.getParameter("name"));
}

Note that the Request.setCharsetEncoding() method must be placed before all Request.getParameter() methods. Using the Content-Type request header to specify the character encoding   We have always used Content-Type as the response header to specify the character encoding of the response content. In fact, this http protocol header can also be used in the request, which can be used to specify the character encoding in the request body . Now we comment out the Request.setCharacterEncoding() part of the server. We use the telnet program to simulate the browser sending the request. The simulation operation is as follows: You can see that the charset=utf-8 setting is added to the Content-Type header. At this time, when looking at the back end, the correct encoding is printed: name: China The final conclusion is that when http uses the post method to submit the form, the encoding used to send the request is determined by the charset in the response header Content-Type. If the charset is not set in the response to get the form, the browser decides according to its own "preference". When the server parses the content of the request body, the decoding encoding is specified by the Request.setCharsetEncoding() method (j2ee) or the request header Content-Type. Questions about ISO8859-1 In the previous section, we introduced three encoding methods for setting the characters to be parsed by the server, so as to avoid garbled characters in the decoding process, namely URIEncoding, setCharsetEncoding(), and Content-Type. If these three methods are not used, then for tomcat, it will use ISO8859-1 to decode characters by default. The server program is modified as follows:




















@Override
public void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    System.out.println("name: "+new String(req.getParameter("name").getBytes("iso8859-1"),"utf-8"));
}
@Override
public void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
doGet(req, resp);
}



We use the chrome browser on the client side:



the default values ​​are used elsewhere, including URIEncodng is not set in tomcat, there is no Reqeust.setCharsetEncodnig() in the code, and there is no charset in the request header Content-Type. Then use all the access methods we mentioned earlier, such as get requests from multiple browsers and post requests from multiple browsers, provided that the Chinese must be percent-encoded when sending requests. After trying all these methods, you will find that it doesn't matter which method, as long as the value of the input parameter name uses utf-8 encoding (the doGet method in the background uses utf-8, which needs to be consistent with this) , there will be no garbled characters in the background. Does it feel very god (weird) strange (different). Let's analyze this magical phenomenon together by walking into the bottom layer of character encoding.

If a character is garbled from input to output, encoding conversion must have occurred in the middle of the input and output. For our current test case, six encoding conversions have occurred:
    1. The browser performs percent encoding on the characters
    2. Tomcat decodes the percent encoding
    3. ISO8859-1 encoding is converted to java internal code 4. Java internal code
    is converted to ISO8859 -1 encoding
    5. Convert the byte array as utf-8 encoding to java internal code
    6. Java internal code to output encoding Before

starting to explain the six encoding conversions, first clarify some description rules.
*Character: Write directly with its literal meaning, such as the characters "a", "中", etc.
*Byte: It is represented by hexadecimal and prefixed with 0x, such as the ascii character "a" byte representation is 0x61
*String.getBytes ("utf-8"): Convert java internal code to utf-8 encoding
*new String(bytes[],"utf-8"): Treat the byte array as utf-8 encoding - convert it to java internal code. The

browser
performs percent encoding on characters.   We already know that for the two "China" characters, their utf-8 codes are 0xE4B8AD, 0xE59BBD, and each character occupies three bytes. After percent encoding, it becomes %E4%B8%AD, %E5%9B%BD. You can see that percent encoding is lossless to the original encoding, it just turns the original bytes into %+original words The hexadecimal representation of the section. For example, the byte 0xE4 is converted into a percent sign and encoded as %E4, and one byte becomes three bytes.

Tomcat decoding percent sign encoding
and decoding percent sign encoding is also very simple. In fact, it is to remove the percent sign, and then combine the two bytes after the percent sign into one byte, such as percent sign encoding %E4, after decoding, it becomes is byte 0xE4. At this point, the two characters "China" become 0xE4B8AD and 0xE59BBD.
For detailed percent coding, see http://deyimsf.iteye.com/admin/blogs/2312462.

ISO8859-1 to java internal code
ISO8859-1 can be simply understood as an upgraded version of ascii. We know that ascii only uses the last 7 bits of a byte, and the high bit is always 0, so it can represent up to 128 characters. ISO8859-1 is a single-byte character set like ascii. The difference is that it uses the highest bit and adds some western characters (such as ±, ÷ and other characters). For details, please refer to https://zh.wikipedia .org/wiki/ISO/IEC_8859-1.

The java internal code we are talking about here is the encoding of characters stored in memory when the java program runs, using the utf-16 encoding defined in the unicode standard. To process characters in java is to convert various character encodings to java internal codes, and java internal codes to various character encodings. To give a simple example, java handles characters like a translator translates a language. For example, a translator whose native language is Chinese and is proficient in Japanese and English, when he converts Japanese to English or English to Japanese, he will definitely not convert them to his native language first, and then convert to other languages. Seeing this, you may say that a powerful translator does not need to be converted into a native language, or that the native language of the translator is not one, there may be many. But most of our computer languages ​​currently have only one native language.
After the introduction of ISO8859-1 and java internal code (utf-16), it can be said that the problem of conversion. utf-16 is a sequence that encodes Unicode code point values ​​as 16-bit (two-byte) integers, which encodes unicode characters as 2-byte or four-byte. As mentioned earlier, ISO8859-1 is an 8-bit single-byte character encoding, so utf-16 encoding and ISO8859-1 encoding are incompatible, but utf-16 contains all the characters in ISO8859-1, so their encodings are not compatible. There is also a correspondence between them.

Look at the following two documents in the unicode document
http://www.unicode.org/charts/PDF/U0000.pdf
http://www.unicode.org/charts/PDF/U0080.pdf As
you can see, ISO8859 All characters represented by -1 can be found in the unicode character set, and their corresponding unicode code point value is to add 8 bits of 0 before the original encoding, for example, the character "a" in ISO8859-1 is represented as 0x61, The character "a" is represented as 0x0061 in the unicode character set. With the corresponding relationship, encoding conversion can be performed.

After step 2 above (Tomcat decodes percent code), the two characters "China" are like 0xE4B8ADE59BBD in memory, exactly six bytes. We know that this is actually the utf-8 encoding sequence of these two characters, but since we did not tell tomcat what character encoding sequence this is, all tomcat thinks it is an ISO8859-1 encoding sequence and tells it to the java program , what the java program has to do is to convert this byte sequence into utf-16 according to ISO8859-1. The corresponding relationship after successful conversion is as follows:
ISO8859-1 0xE4 0xB8 0xAD 0xE5 0x9B 0xBD
UTF-16 0x00E4 0x00B8 0x00AD 0x00E5 0x009B 0x00BD
You can see that the original two characters have become six characters in java; the original six bytes have become 12 bytes in java.

Convert Java internal code to ISO8859-1 encoding
This step is actually in the execution of our example program
System.out.println("name: "+new String(req.getParameter("name").getBytes("iso8859-1"),"utf-8"));

The method getBytes("iso8859-1") converts utf-16 to ISO8859-1. There is a corresponding table in the third step (ISO8859-1 to java internal code), you can see that utf-16 to ISO8859-1 only needs to remove the 8-bit 0 in front of each character. After the conversion is successful, the two characters It becomes 0xE4B8ADE59BBD again. Although the interpretation of the bytes is wrong during the two conversions, the original byte information is not lost.

The step of converting the byte array as utf-8 encoding to java internal code
is the new String (0xE4B8ADE59BBD, "utf-8") method in the above example program, because our byte array is originally utf-8 encoding, so Transcoding according to utf-8 is definitely no problem. The corresponding relationship after successful conversion is the same:
UTF-8 0xE4B8AD 0xE59BBD
UTF-16 0x4E2D 0x56FD

Here, the two characters "China" are correctly obtained in java representation.

In this step of Java internal code conversion output encoding
, the System.out.println("China") method in the above example program is executed. Now the two characters "China" are correctly represented by utf-16 in java. The last step is the external output, that is, the process of external translation. We use the println method that comes with java here. This method will output according to the current platform's own encoding. For example, if your platform environment is Chinese, the output may be It is GBK encoding. If you can't use platform encoding and want to decide the output encoding yourself, it's very simple
System.out.write("China".getByte("character encoding"));
That's it.













Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326761654&siteId=291194637