Chinese garbled code analysis of Tomcat source code (1)

Welcome to click "The Beauty of Algorithms and Programming"↑Follow us!

This article was first published on the WeChat public account: "The Beauty of Algorithms and Programming", welcome to pay attention and learn more about this series of blogs in time.

In this series of blogs, we will introduce all kinds of annoying Chinese garbled problems that you may encounter in JavaWeb. Although you may already know how to solve the problem of Chinese garbled characters in some cases, you do not necessarily know why Chinese garbled characters occur? Many times understanding the cause of the problem is more important than the solution to the problem. We will lead you to bring you an in-depth analysis from the perspective of Tomcat source code to help you thoroughly understand the deep-seated reasons for these garbled codes.

1 Problem description

There are two JSP files, the first is called input.jsp, the content is very simple, there is a form form, there is an input box named content and a button in the form, the form is submitted to result.jsp for processing.

<%@ page contentType="text/html;charset=UTF-8"language="java"%>
<html>
<head>
    <title>Title</title>
</head>
<body>

<form action="result.jsp" method="post">

    <input type="text"name="content"/>
    <input type="submit"value="send"/>
</form>
</body>
</html>

The second file is called result.jsp, which accepts the content of the input box in the input.jsp form and displays it on the page.

<%@ page contentType="text/html;charset=UTF-8"language="java"%>
<html>
<head>
    <title>Title</title>
</head>
<body>

<%
    String content =request.getParameter("content");
%>

<p><%=content%></p>
</body>
</html>

 

The above two JSP files, I believe there will be no problem for everyone. After the writing is completed, the deployment and operation begin. When we visit http://localhost:8080/input.jsp, enter "Hello" in the form, and click submit. As a result, we were surprised that we saw a string of characters "ä½ å¥½" that we humans do not understand.

2 Source code analysis

Why does the above Chinese garbled occur? I believe this is one of the most common problems that everyone encounters when they first start JavaWeb development. Maybe now you already know how to solve the above Chinese garbled problem, but if the interviewer asks you, why does this Chinese garbled appear? Can you answer that? 

There is only one line of code in result.jsp, so obviously the problem is here.

What's going on in the line request.getParameter("content")?

Let's take a look at it bit by bit.

 

2.1RequestFacade analysis

The first question we need to figure out is what type of request variable is this? Then it can be located in its getParameter() method. Is not it?

Some students say that this is not very simple, the type of the request variable is ServletRequest.

Please think about this statement, right? This sentence is equivalent to saying it in vain. The reason is very simple, because ServletRequest is an interface, and there is no implementation code in the interface, so you still can't find the code of the getParameter() method.

How can I get the real type of the request object?

What I will introduce to you next is a very important skill, which we can acquire through breakpoint debugging.

Put a breakpoint on that line of code, and then perform debugging. When the code runs to the breakpoint, put the mouse on the request, and the following figure will appear:                           

I believe you should now be able to confirm the real type of the request object, which is RequestFacade, so the code we need to look at is the getParameter() method of the RequestFacade class, as shown below:

RequestFacade
@Override
public String getParameter(String name) {

    if (request == null) {
        throw new IllegalStateException(
                        sm.getString("requestFacade.nullRequest"));
    }

    if (Globals.IS_SECURITY_ENABLED){
        return AccessController.doPrivileged(
            new GetParameterPrivilegedAction(name));
    } else {
        return request.getParameter(name);
    }
}

Note that the request object at this time is not the same type as the request that was just encountered.

The request type here is org.apache.catalina.connector.Request, so we will analyze the getParameter() method under this request next.

2.2 connector.Request analysis


org.apache.catalina.connector.Request
@Override
public String getParameter(String name) {

    if (!parametersParsed) {
        parseParameters();
    }

    return coyoteRequest.getParameters().getParameter(name);

}

From the above code, we can see that the parameters are parsed first, and the name value is directly obtained from the parameters property after the parsing is successful.

There is a lot of code in the parseParameters() method, we only post the code related to our problem.   

String enc = getCharacterEncoding();

boolean useBodyEncodingForURI = connector.getUseBodyEncodingForURI();
if (enc != null) {
    parameters.setEncoding(enc);
    if (useBodyEncodingForURI) {
       parameters.setQueryStringEncoding(enc);
    }
} else {
    parameters.setEncoding
        (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);
    if (useBodyEncodingForURI) {
        parameters.setQueryStringEncoding
            (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);
    }
}

 

The code related to us is actually this, two steps:

First get the value of the enc variable. If it is not empty, set the encoding to enc. Here we see that it also sets the encoding format of queryString.

If it is empty, directly set the encoding format to a constant, and the value of this constant is:

public final class Constants {

    public static final String DEFAULT_CHARACTER_ENCODING="ISO-8859-1";



}

In fact, at this point, it seems that the problem has been solved, but we still have a very critical problem:

What exactly is the value of this enc variable?

2.3 What is enc

First let's look at:

@Override
public String getCharacterEncoding() {
    String result = coyoteRequest.getCharacterEncoding();
    if (result == null) {
        Context context = getContext();
        if (context != null) {
            result =  context.getRequestCharacterEncoding();
        }
    }
    return result;
}

 

From the above code, we can see that the encoding in two places is checked, one is the coyoteRequest object, and the other is in the context. 

(1) What happened in coyoteRequest 


public String getCharacterEncoding() {

    if (charEncoding != null) {
        return charEncoding;
    }

    charEncoding = getCharsetFromContentType(getContentType());

    return charEncoding;
}

public String getContentType() {
    contentType();
    if ((contentTypeMB == null) || contentTypeMB.isNull()) {
        return null;
    }
    return contentTypeMB.toString();
}

 

public MessageBytes contentType() {
    if (contentTypeMB == null) {
        contentTypeMB = headers.getValue("content-type");
    }
    return contentTypeMB;
}

 

private static String getCharsetFromContentType(String contentType) {

    if (contentType == null) {
        return (null);
    }
    int start = contentType.indexOf("charset=");
    if (start < 0) {
        return (null);
    }
    String encoding =contentType.substring(start + 8);
    int end = encoding.indexOf(';');
    if (end >= 0) {
        encoding = encoding.substring(0, end);
    }
    encoding = encoding.trim();
    if ((encoding.length() > 2) &&(encoding.startsWith("\""))
        && (encoding.endsWith("\""))) {
        encoding = encoding.substring(1, encoding.length() - 1);
    }
    return (encoding.trim());

}

From the above four pieces of code, we can see that what it does is actually check whether charset is set from the content-type of the HTTP header, and if so, set enc to the encoding.

(2) what happened in the context

 

StandardContext

@Override
public String getRequestCharacterEncoding() {
    return requestEncoding;
}

 

context refers to the context container of the current website, so when is the requestEncoding attribute set?

In Servlet 4.0, we can directly set the value of the attribute request-character-encoding, note that this attribute only works in 4.0.

<web-app xmlns="http://xmlns.jcp.org/xml/ns/javaee"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee     http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd"
         version="4.0">

    <request-character-encoding>UTF-8</request-character-encoding>
</web-app>

 

2.4 Summary of encoding setup process

We have analyzed the whole process, and finally let's summarize what happened to Tomcat in processing request.getParameter(). The next two figures will explain everything.

 

 

 

 

First determine whether the enc variable is set, if not, set it to the default value ISO-8859-1.

The process of setting the enc variable is as follows:

 

 

The first step is to determine whether the request object has charEncoding set. I believe this is the most common way to solve the problem of garbled characters, namely:

request.setCharacterEncoding("utf-8");

If it is not set, go to the second step, check whether the content-type attribute of the HTTP header is set with the charset value. Students who are familiar with the HTTP protocol believe that they should know:

Content-type:text/html;charset=utf-8

If it is not set, go to the third step, check whether the requestEncoding property is set in the context container. The so-called context container refers to whether the request-character-encoding attribute is set in web.xml. This attribute only works in Servlet4.0.

2.5 Reasons for garbled characters

After so much analysis above, it seems that we have not explained why the garbled characters appear?

 

We know that the page encodings of input.jsp and result.jsp are both UTF-8. By default, the encoding in the processing link is: ISO-8859-1. From the above figure, we find that the encoding of one left and one right is UTF-8 -8, and the middle is ISO-8859-1, so the encoding before and after is inconsistent, and garbled characters will naturally appear.

How to solve garbled characters? 

Through §2.4, we can find that we have two ways to solve garbled characters: 

The first way is to make input.jsp and result.jsp all set to: ISO-8859-1 encoding, so that the encoding of the three links becomes ISO-8859-1.

The second way is to change the encoding of the intermediate processing links to UTF-8 through various means.

What are the ways? These paths are just a few of how the enc variable is set in §2.4. 

3 Summary

From the perspective of Tomcat source code, this article deeply analyzes that if Chinese is directly submitted in the FORM form without any encoding settings by default, Chinese garbled characters will appear in the result page. Finally, it is concluded that the essential reason for the appearance of Chinese garbled characters is caused by the inconsistency of the encoding of one end and one end and the middle link. And gives several ways to solve Chinese garbled characters. If you understand why garbled characters appear from the source code point of view, any garbled characters will no longer be a problem in front of you.

If you want to know what happens next, please continue to pay attention to the WeChat public account of "The Beauty of Algorithms and Programming" to learn more exciting articles in time.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324996608&siteId=291194637