分析

jdbc

对于jdbc我们通常会在连接字符串中设置

useUnicode=true&characterEncoding=utf8

这样就可以指定在数据库上对应的编码当开发者

需要获取数据或者更新数据均会使用utf8进行编码

mysql的utf8和常规意义上的utf8存在区别 mysql使用utf8mb4方可以存储emoji

servlet

常规意义上jsp也属于servlet 通常我们会要求在jsp上方描述

<%@ page contentType="text/html; charset=utf-8" %>
<%@ page language="java" pageEncoding="UTF-8" %>

可以看到也设置了pageEncoding

其实compile将对应jsp文件读入再生成对应的servlet class会使用pageEncoding

善于思考的小伙伴会想到我们在写代码时用的编码是如何映射到java编译生成的class文件呢？如果使用的是不同编码是不是也会发生问题？

是的！通常我们会在maven中设置变量比如

<build> 
    <plugins> 
        <plugin> 
            <groupId>org.apache.maven.plugins</groupId> 
            <artifactId>maven-compiler-plugin</artifactId> 
            <version>${maven-compiler-plugin.version}</version> 
            <configuration> 
                <source>1.7</source> 
                <target>1.7</target> 
                <encoding>UTF-8</encoding> 
            </configuration> 
        </plugin> 
    </plugins> 
</build>

当然在没有设置的场景使用如下

System.getProperty("file.encoding")

说完java的编译我们可以看到在servlet的encoding包括如下几个

/**
 * Overrides the name of the character encoding used in the body of this
 * request. This method must be called prior to reading request parameters
 * or reading input using getReader().
 *
 * @param env
 *            a <code>String</code> containing the name of the character
 *            encoding.
 * @throws java.io.UnsupportedEncodingException
 *             if this is not a valid encoding
 */
public void setCharacterEncoding(String env)
        throws java.io.UnsupportedEncodingException;

/**
 * Sets the character encoding (MIME charset) of the response being sent to
 * the client, for example, to UTF-8. If the character encoding has already
 * been set by {@link #setContentType} or {@link #setLocale}, this method
 * overrides it. Calling {@link #setContentType} with the
 * <code>String</code> of <code>text/html</code> and calling this method
 * with the <code>String</code> of <code>UTF-8</code> is equivalent with
 * calling <code>setContentType</code> with the <code>String</code> of
 * <code>text/html; charset=UTF-8</code>.
 * <p>
 * This method can be called repeatedly to change the character encoding.
 * This method has no effect if it is called after <code>getWriter</code>
 * has been called or after the response has been committed.
 * <p>
 * Containers must communicate the character encoding used for the servlet
 * response's writer to the client if the protocol provides a way for doing
 * so. In the case of HTTP, the character encoding is communicated as part
 * of the <code>Content-Type</code> header for text media types. Note that
 * the character encoding cannot be communicated via HTTP headers if the
 * servlet does not specify a content type; however, it is still used to
 * encode text written via the servlet response's writer.
 *
 * @param charset
 *            a String specifying only the character set defined by IANA
 *            Character Sets
 *            (http://www.iana.org/assignments/character-sets)
 * @see #setContentType #setLocale
 * @since 2.4
 */
public void setCharacterEncoding(String charset);

其实这边就很明确的规定了请求的编码格式和响应的编码格式

很多场景下我们为了避免出现乱码问题使用了spring封装的CharacterEncodingFilter 完成代码

这个filter有什么魔力呢？为啥加上这个就“不会”乱码？

@Override
protected void doFilterInternal(
      HttpServletRequest request, HttpServletResponse response, FilterChain filterChain)
      throws ServletException, IOException {
 
   String encoding = getEncoding();
   if (encoding != null) {
      if (isForceRequestEncoding() || request.getCharacterEncoding() == null) {
         request.setCharacterEncoding(encoding);
      }
      if (isForceResponseEncoding()) {
         response.setCharacterEncoding(encoding);
      }
   }
   filterChain.doFilter(request, response);
}

什么嘛！其实也只是设置了一个编码而已。所以其实真正的请求来说在终端设置正确的编码最重要【这边设置encode是防止由于没设置encode导致出现编码异常】

比如在tomcat中获取编码如下

/**
 * Get the character encoding used for this request.
 *
 * @return The value set via {@link #setCharacterEncoding(String)} or if no
 *         call has been made to that method try to obtain if from the
 *         content type.
 *
 * @throws UnsupportedEncodingException If the user agent has specified an
 *         invalid character encoding
 */
public Charset getCharset() throws UnsupportedEncodingException {
    if (charset == null) {
        getCharacterEncoding();
        if (characterEncoding != null) {
            charset = B2CConverter.getCharset(characterEncoding);
        }
    }
 
    return charset;
}
 
/**
 * Get the character encoding used for this request.
 *
 * @return The value set via {@link #setCharacterEncoding(String)} or if no
 *         call has been made to that method try to obtain if from the
 *         content type.
 */
public String getCharacterEncoding() {
    if (characterEncoding == null) {
        characterEncoding = getCharsetFromContentType(getContentType());
    }
 
    return characterEncoding;
}

很明显当未设置charset的场景下优先从ContentType中获取【但是不见得所有客户端均设置了】

所以要想客户端传了GBK的流后端然后又设置了UTF-8的话.....当然反过来也是比如某宝开发者界面

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 
 
<html xmlns="http://www.w3.org/1999/xhtml">
 
 
<head>
 
 
<meta http-equiv="Content-Type" content="text/html; charset=GBK">
 
 
<link rel="stylesheet" type="text/css" href="http://assets.taobaocdn.com/css/app/magic/magic.css"/>
 
 
</head>
 
 
<body>
 
 
<div class="stat-error">
 
 
<p>鎮ㄦ墍璇锋眰鐨勬湇鍔′笉瀛樺湪銆�</p>
 
 
</div>
 
 
</body>
 
 
</html>

但是当我们调整为UTF-8

IO

当我们读取某些文件或者将字符串进行编码的转换时

最常见的为String的getBytes

public byte[] getBytes(Charset charset) {
    if (charset == null) throw new NullPointerException();
    return StringCoding.encode(charset, value, 0, value.length);
}
 
/**
 * Encodes this {@code String} into a sequence of bytes using the
 * platform's default charset, storing the result into a new byte array.
 *
 * <p> The behavior of this method when this string cannot be encoded in
 * the default charset is unspecified.  The {@link
 * java.nio.charset.CharsetEncoder} class should be used when more control
 * over the encoding process is required.
 *
 * @return  The resultant byte array
 *
 * @since      JDK1.1
 */
public byte[] getBytes() {
    return StringCoding.encode(value, 0, value.length);
}

可以看到当没有使用显示的编码时

static byte[] encode(char[] ca, int off, int len) {
    String csn = Charset.defaultCharset().name();
    try {
        // use charset name encode() variant which provides caching.
        return encode(csn, ca, off, len);
    } catch (UnsupportedEncodingException x) {
        warnUnsupportedCharset(csn);
    }
    try {
        return encode("ISO-8859-1", ca, off, len);
    } catch (UnsupportedEncodingException x) {
        // If this code is hit during VM initialization, MessageUtils is
        // the only way we will be able to get any kind of error message.
        MessageUtils.err("ISO-8859-1 charset not available: "
                         + x.toString());
        // If we can not find ISO-8859-1 (a required encoding) then things
        // are seriously wrong with the installation.
        System.exit(1);
        return null;
    }
}

其实使用default encoding

public static Charset defaultCharset() {
    if (defaultCharset == null) {
        synchronized (Charset.class) {
            String csn = AccessController.doPrivileged(
                new GetPropertyAction("file.encoding"));
            Charset cs = lookup(csn);
            if (cs != null)
                defaultCharset = cs;
            else
                defaultCharset = forName("UTF-8");
        }
    }
    return defaultCharset;
}

其实就是依赖于系统配置了当然万一默认编码出了问题最终使用的是ISO-8859-1编码所以尽量指定

encode了解一下（六）

分析

jdbc

servlet

IO

猜你喜欢