Chinese encoding problem in Java Web (part 2)

The coding involved in Java Web

      For Chinese, coding is involved where there is I/O. As mentioned earlier, I/O operations will cause coding, and most of the garbled codes caused by I/O are network I/O, because almost all All of the applications involve network operations, and data transmission through the network is in bytes, so all data must be able to be serialized into bytes. The serialization of data in Java must inherit the Serializable interface.

      The user initiates an HTTP request from the browser, and the places that need to be encoded are URL, Cookie, and Parameter. After the server receives the HTTP request, it needs to parse the HTTP protocol. The URI, Cookie and POST form parameters need to be decoded. The server may also need to read data in the database, local or other text files in the network, these data may exist Encoding problem. After the Servlet has processed all the requested data, it needs to be re-encoded and sent to the browser requested by the user through the Socket, and then decoded into text by the browser.

URL encoding and decoding

The browser encoding URL is to encode non-ASCII characters into hexadecimal numbers according to a certain encoding format, and then add "%" in front of each hexadecimal representation byte.

The following are the test results of http://localhost:8080/examples/servlets/servlet/君山?author=Junshan in Chinese FireFox 3.6.12:

Encoding and decoding of HTTP Header

    When the client initiates an HTTP request, in addition to the above URL, other parameters such as Cookie, redirectPath, etc. may be passed in the Header. These user-set values ​​may also have encoding problems. How does Tomcat decode them?
    Decoding the items in the Header is also done by calling request.getHeader. If the requested Header item is not decoded, call the toString method of MessageBytes. The default encoding used for the conversion from byte to char is also ISO-8859-1. , And we cannot set other decoding formats of the Header, so if you set the decoding of non-ASCII characters in the Header, there will be garbled codes.
   The same applies when we add the Header. Don’t pass non-ASCII characters in the Header. If you must pass it, we can encode these characters with org.apache.catalina.util.URLEncoder and then add them to the Header. The information will not be lost during the transmission from the browser to the server. It would be nice if we want to access these items and then decode them according to the corresponding character set.

Encoding and decoding of POST forms

      As mentioned earlier, the decoding of the parameters submitted by the POST form occurs when request.getParameter is called for the first time. The POST form parameter transmission method is different from QueryString, and it is passed to the server through HTTP BODY. When we click the submit button on the page, the browser first encodes the parameters filled in the form according to the Charset encoding format of ContentType and then submits it to the server. The server also uses the character set in ContentType for decoding. Therefore, the parameters submitted through the POST form will generally not cause problems, and this character set encoding is set by ourselves, which can be set by request.setCharacterEncoding(charset).

      In addition, for multipart/form-data type parameters, that is, the uploaded file encoding also uses the character set encoding defined by ContentType. It is worth noting that the uploaded file is transferred to the local temporary directory of the server in a byte stream. This The process does not involve character encoding, and the actual encoding is adding the file content to the parameters. If this encoding cannot be used, the default encoding ISO-8859-1 will be used for encoding .

Codec of HTTP BODY

      When the resource requested by the user has been successfully obtained, the content will be returned to the client browser through Response. This process must be encoded and then decoded in the browser. The codec character set of this process can be set by response.setCharacterEncoding, it will overwrite the value of request.getCharacterEncoding, and return to the client through the Content-Type of the Header, and the browser will pass the Content-Type when receiving the returned socket stream If there is no charset set in the Content-Type of the returned HTTP Header, the browser will follow the Html <meta HTTP-equiv="Content-Type" content="text/html; charset=GBK" /> Charset to decode. If it is not defined, the browser will use the default encoding to decode.



In summary, the problem of garbled characters is caused by the inconsistent character set encoding and decoding in the conversion from char to byte or from byte to char. However, one operation involves multiple encoding and decoding, so when you encounter problems, you need to analyze them according to specific problems.





Guess you like

Origin blog.csdn.net/liushulin183/article/details/50211071