Common garbled problem solutions in Javaweb

Recently, when I was doing a small demo, there was a problem of garbled characters. On the way, I found a blog with a comprehensive summary. After adding my own supplements, I will share it with you here.

---------- Reproduced source ---------

 ------------------------------------------------------

1. Garbled characters caused by file page encoding


Each file (java, js, jsp, html, etc.) has its own encoding format. The code in the file is displayed normally in one encoding, and garbled in another encoding.
In Eclipse, each project will have an encoding format (Text file encoding), which generally defaults to GBK. A better programming habit is to create a new project and set the project's encoding to UTF-8 first.
The reason for this is very simple. UTF-8 contains characters that all countries in the world need to use. It is an international encoding and has strong versatility. The relationship between several common character sets, GBK, GB2312, and UTF-8 is as follows:
GBK is a standard compatible with GB2312 after expansion based on the national standard GB2312. GBK, GB2312, etc. and UTF8 must be converted to each other through Unicode encoding

If you are interested in learning more, you can refer to the article linked below, which is well written.

http://www.cnblogs.com/xiaomia/archive/2010/11/28/1890072.html

2. Garbled characters caused by string conversion of different character sets

For each String, the underlying implementation is stored in a byte array. Using different character sets, the length of the stored array is of course different. If you do not use the same character set for decoding, there will be garbled characters.

For example the following code:

import java.io.UnsupportedEncodingException;  
import java.nio.charset.Charset;  
public class TestCharset {  
  
    public static void main(String[] args) throws UnsupportedEncodingException {  
          
        String strChineseString = "Chinese";  
        String encoding = System.getProperty("file.encoding");  
        System.out.println("The default character set of the system is: " + encoding);  
        System.out.println(strChineseString.getBytes(Charset.forName("GBK")).length);  
        System.out.println(strChineseString.getBytes(Charset.forName("UTF-8")).length);  
        System.out.println(strChineseString.getBytes().length);  
    }  
}  

The output is:

The system default character set is: UTF-8  
4  
6  
6  

It can be seen that using GBK and UTF-8 encoding, the resulting byte arrays have different lengths. The reason is that utf-8 uses 3 bytes to encode Chinese, while GBK uses 2 bytes to encode Chinese. Because my project uses UTF-8 by default, using getBytes() without parameters will get the same length as the string encoded in UTF-8. For detailed knowledge about character sets, please refer to the article address given in the first part.

 Description of the getBytes method in the JDK:
 getBytes()  encodes this String as a sequence of bytes using the platform's default character set, and stores the result in a new byte array.

 getBytes(Charset charset)  encodes this String into a byte sequence using the given charset and stores the result in a new byte array.

Each string has its own encoding at the bottom. However, once the getByte method is called, the resulting byte array is an array encoded with a specific character set, and no additional conversion is required.

When the above byte array is obtained, another method of String can be called to generate the String that needs to be transcoded.

The test example is as follows:

import java.io.UnsupportedEncodingException;  
import java.nio.charset.Charset;  
public class TestCharset {  
  
    public static void main(String[] args) throws UnsupportedEncodingException {  
        String strChineseString = "Chinese";  
        byte[] byteGBK = null;  
        byte[] byteUTF8 = null;  
        byteGBK = strChineseString.getBytes(Charset.forName("GBK"));  
        byteUTF8 = strChineseString.getBytes(Charset.forName("utf-8"));  
        System.out.println(new String(byteGBK,"GBK"));  
        System.out.println(new String(byteGBK,"utf-8"));  
        System.out.println("**************************");  
        System.out.println(new String(byteUTF8,"utf-8"));  
        System.out.println(new String(byteUTF8,"GBK"));  
    }  
}  

The output is:

Chinese  
����  
**************************  
Chinese  
Juan  

It can be seen that which character set is used to encode a String, the corresponding encoding must be used when generating a String, otherwise garbled characters will appear.

Simply put, only String transcoding that satisfies the following formula will not be garbled.

String strSource = "The string you want to transcode";  
String strSomeEncoding = "utf-8"; //For example utf-8  
String strTarget = new String (strSource.getBytes(Charset.forName(strSomeEncoding)), strSomeEncoding);  

The description of the getBytes method in JDK:
String(byte[] bytes)   constructs a new String by decoding the specified byte array using the platform's default character set. 
String(byte[] bytes, Charset charset)   Constructs a new String by decoding the specified byte array using the specified charset.

3. Chinese garbled characters caused by Socket network transmission

When using Socket for communication, there are many options for transmission, either PrintStream or PrintWriter can be used. It’s okay to transmit English, but there may be garbled characters when transmitting Chinese. There are many sayings on the Internet. After actual testing, it is found that the problem is still in the problem of bytes and characters.

As we all know, Java is divided into byte stream and character stream, character (char) is 16bit, byte (BYTE) is 8bit. PrintStrean writes a string of 8bit data. PrintWriter writes a string of 16bit data. 
String is encoded in UNICODE by default, which is 16bit. Therefore, the string written by PrintWriter is better cross-platform, and the character set of PrintStream may be garbled.

The above words can be understood in this way, PrintStream is used to operate byte, PrintWriter is used to operate Unicode, if PrintStream reads 8 bits at a time, if it encounters Chinese characters (one Chinese character occupies 16 bits), garbled characters may appear. Generally, you need to use PrintWriter when dealing with Chinese.

In the final website test, no garbled characters appeared when using PrintWriter. code show as below:

import java.io.BufferedReader;  
import java.io.DataOutputStream;  
import java.io.IOException;  
import java.io.OutputStreamWriter;  
import java.io.PrintWriter;  
import java.net.Socket;  
  
public class TestSocket {  
  
    public static void main(String[] args) throws IOException {  
        Socket socket = new Socket();  
        DataOutputStream dos = null;  
        PrintWriter pw = null;        
        BufferedReader in = null;  
        String responseXml = "要传输的中文";  
        //..........  
        dos = new DataOutputStream(socket.getOutputStream());  
        pw = new PrintWriter(new OutputStreamWriter(dos));  //不带自动刷新的Writer           
        pw.println(responseXml);  
        pw.flush();  
    }  
}  

需要注意的方面是,需要使用PrintWriter的println而不是write方法,否则服务器端会读不到数据的。原因就是println会在输出的时候在字符串后面加一个换行符,而write不会。
Println和write具体区别可以参考如下网址,里面的网友有讨论:
http://www.oschina.net/question/101123_17855

4.JSP中显示中文的乱码

Sometimes JSP pages have garbled characters when displaying Chinese. In most cases, it is the problem of character set configuration and page encoding. As long as there are no problems with the following configurations, generally there will be no garbled characters.
  • a. Add the following statement at the top of the JSP page:
<%@ page contentType="text/html; charset=utf-8" language="java" errorPage="" %>  

  • b. Add the following statement to the HTML head tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />  

  • c. Ensure that the JSP page encoding is the same as the above two charsets, which I have said in the first point of the article.

The above character set can be flexibly selected according to your needs, not necessarily utf-8. However, because utf-8 has better support for various languages, especially Chinese, it is recommended to use it. I have encountered the problem that Kau cannot display properly on GB2312 encoded pages.

5. Post and Get pass Chinese, and get garbled characters in the background

The front-end transfer Chinese is also divided into Get and Post methods.

  • The case of a.Get method:
     When the Get method is mainly used to pass the URL in Chinese.

     If it is in the js file, you can use the following code for Chinese transcoding.

var url ="http://www.baidu.com/s?industry=编码"  
url = encodeURI(url);  

    If it is in a jsp file, you can use the following statement to transcode.
    The page begins to introduce:

 

<%@ page import="java.net.URLEncoder" %>  

      Where transcoding is required, use URLEncoder to encode:

<a href="xxxxx.xx?industry=<%=URLEncoder.encode("http://www.baidu.com/s?wd=编码", "UTF-8")%>">

    No matter which method is used, the following code should be used when obtaining Chinese in the background:

request.setCharacterEncoding("utf-8");  
String industry = new String(  
request.getParameter("industry ").getBytes("ISO8859-1"),"UTF-8");    

[Notes]
1. For request, it refers to the encoding of the submitted content. After specifying, you can directly obtain the correct string through getParameter(). If not specified, the default is to use iso8859-1 encoding. In order to unify, you need to submit the specified transfer encoding. .
2. The second sentence of the above code seems to contradict the formula given in item 2. I also struggled for a long time, and finally found that ISO8859-1 is an older encoding, usually called Latin-1, which belongs to a single-byte encoding, which is exactly the same as the most basic representation unit of the computer, so it is generally not used for transcoding. question.
iso-8859-1 is the standard character set used for JAVA network transmission, while gb2312 is the standard Chinese character set. When you perform operations that require network transmission, such as submitting forms, you need to convert iso-8859-1 to gb2312 character set Display, otherwise, if the iso-8859-1 character set is interpreted according to the gb2312 format of the browser, it will be garbled because the two are incompatible. In order to save trouble, it is recommended to use the utf-8 character set uniformly.

  • b. In the case of the POST method:
    For the case of Post, it is relatively simple. You only need to specify the character set of the header of the post in the function call part of the post, such as:
xmlHttp.open("post", url , true);  
xmlHttp.setRequestHeader("Content-Type","text/xml; charset= utf-8");   
xmlHttp.send(param);  

   其中param为要传递的参数。

   后台部分和get方法一样,设置如下即可,注意传输和接受的字符集要统一。

   If you don't understand, you can refer to the following article, which is well written.
   http://www.cnblogs.com/qiuyi21/articles/1089555.html

6. The background transmits Chinese garbled characters to the foreground

A function is provided here, through which information is sent without garbled characters. The core idea is to set the character set of the response stream. The function code is as follows:   

/**
 * @Function:writeResponse
 * @Description: ajax method returns a string
 * @param str:json
 * @return: true: output succeeded, false: output failed
 */  
public boolean writeResponse(String str){  
    boolean ret = true;  
    try{  
        HttpServletResponse response = ServletActionContext.getResponse();  
        response.setContentType("text/html;charset=utf-8");  
        PrintWriter pw = response.getWriter ();  
        pw.print(str);  
        pw.close();  
    }catch (Exception e) {  
        ret = false;  
        e.printStackTrace ();  
    }  
    return ret;  
}  

7. The file name is garbled when downloading the file

Anyone who has downloaded it knows that the downloaded files are prone to garbled characters, and the reason is that there is no restriction on the encoding format of the output stream.

A piece of code is attached to help you complete the garbled-free download.

HttpServletResponse response = ServletActionContext.getResponse();  
response.setContentType("text/html;charset=utf-8");  
response.reset();  
String header = "attachment; filename=" + picName;  
      header = new String(header.getBytes(), "UTF-8");  
      response.setHeader("Content-disposition", header);  

The core code is just a few sentences. Note that the order of reset in the second and third sentences cannot be wrong.
The role of reset is to clear the buffer cache and clear some blank lines in front of the request.

 

My supplementary content:

-------------------------------------

8. Chinese garbled problem in audio and video tags

When we use the audio and video tags in html, we will not be able to play. After debugging, it is found that the path with Chinese characters passed to the foreground will be garbled. According to the data, the tomcat server does not use the correct decoding method when parsing the parameters. The default decoding format of tomcat is ISO-8859-1, so we need to modify the server.xml file in the conf folder of the tomcat server. The specific operations are as follows:

打开tomcat服务器的conf文件夹,打开server.xml文件,在Connector中加入URIEncoding参数,如下:

<Connector port="8080" protocol="HTTP/1.1"  
        connectionTimeout="20000"  
        redirectPort="8443" 
        URIEncoding="UTF-8"
 />  

9.数据库的编码

在实际开发中,数据库的编码如果和项目中的编码不一致同样会出现乱码问题,所以要保证数据库、服务器、项目前后台的字符编码都保持一致才会避免中文乱码的出现。

如果有哪位读者还有其他关于乱码的解决办法或者乱码的其他出现情况,欢迎留言交流。


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325944295&siteId=291194637