encode了解一下(六)

背景

通常我们做java应用 会碰到如下几个可能会涉及到编码场景

  1. jdbc
  2. servlet
  3. io

分析

jdbc

对于jdbc我们通常会在连接字符串中设置

useUnicode=true&characterEncoding=utf8

这样就可以指定在数据库上对应的编码 当开发者

需要获取数据或者更新数据均会使用utf8进行编码

mysql的utf8和常规意义上的utf8存在区别 mysql使用utf8mb4方可以存储emoji

servlet

常规意义上jsp也属于servlet 通常我们会要求在jsp上方描述

<%@ page contentType="text/html; charset=utf-8" %>
<%@ page language="java" pageEncoding="UTF-8" %>

可以看到也设置了pageEncoding

其实compile将对应jsp文件读入再生成对应的servlet class会使用pageEncoding

善于思考的小伙伴会想到我们在写代码时用的编码是如何映射到java编译生成的class文件呢?如果使用的是不同编码是不是也会发生问题?

是的!通常我们会在maven中设置变量比如

<build> 
    <plugins> 
        <plugin> 
            <groupId>org.apache.maven.plugins</groupId> 
            <artifactId>maven-compiler-plugin</artifactId> 
            <version>${maven-compiler-plugin.version}</version> 
            <configuration> 
                <source>1.7</source> 
                <target>1.7</target> 
                <encoding>UTF-8</encoding> 
            </configuration> 
        </plugin> 
    </plugins> 
</build> 

当然在没有设置的场景使用如下

System.getProperty("file.encoding")

说完java的编译我们可以看到在servlet的encoding包括如下几个

/**
 * Overrides the name of the character encoding used in the body of this
 * request. This method must be called prior to reading request parameters
 * or reading input using getReader().
 *
 * @param env
 *            a <code>String</code> containing the name of the character
 *            encoding.
 * @throws java.io.UnsupportedEncodingException
 *             if this is not a valid encoding
 */
public void setCharacterEncoding(String env)
        throws java.io.UnsupportedEncodingException;
/**
 * Sets the character encoding (MIME charset) of the response being sent to
 * the client, for example, to UTF-8. If the character encoding has already
 * been set by {@link #setContentType} or {@link #setLocale}, this method
 * overrides it. Calling {@link #setContentType} with the
 * <code>String</code> of <code>text/html</code> and calling this method
 * with the <code>String</code> of <code>UTF-8</code> is equivalent with
 * calling <code>setContentType</code> with the <code>String</code> of
 * <code>text/html; charset=UTF-8</code>.
 * <p>
 * This method can be called repeatedly to change the character encoding.
 * This method has no effect if it is called after <code>getWriter</code>
 * has been called or after the response has been committed.
 * <p>
 * Containers must communicate the character encoding used for the servlet
 * response's writer to the client if the protocol provides a way for doing
 * so. In the case of HTTP, the character encoding is communicated as part
 * of the <code>Content-Type</code> header for text media types. Note that
 * the character encoding cannot be communicated via HTTP headers if the
 * servlet does not specify a content type; however, it is still used to
 * encode text written via the servlet response's writer.
 *
 * @param charset
 *            a String specifying only the character set defined by IANA
 *            Character Sets
 *            (http://www.iana.org/assignments/character-sets)
 * @see #setContentType #setLocale
 * @since 2.4
 */
public void setCharacterEncoding(String charset);

其实这边就很明确的规定了请求的编码格式和响应的编码格式

很多场景下我们为了避免出现乱码问题 使用了spring封装的CharacterEncodingFilter 完成代码

这个filter有什么魔力呢?为啥加上这个就“不会”乱码?

@Override
protected void doFilterInternal(
      HttpServletRequest request, HttpServletResponse response, FilterChain filterChain)
      throws ServletException, IOException {
 
   String encoding = getEncoding();
   if (encoding != null) {
      if (isForceRequestEncoding() || request.getCharacterEncoding() == null) {
         request.setCharacterEncoding(encoding);
      }
      if (isForceResponseEncoding()) {
         response.setCharacterEncoding(encoding);
      }
   }
   filterChain.doFilter(request, response);
}

什么嘛!其实也只是设置了一个编码而已。所以其实真正的请求来说在终端设置正确的编码最重要【这边设置encode是防止由于没设置encode导致出现编码异常】

比如在tomcat中获取编码如下

/**
 * Get the character encoding used for this request.
 *
 * @return The value set via {@link #setCharacterEncoding(String)} or if no
 *         call has been made to that method try to obtain if from the
 *         content type.
 *
 * @throws UnsupportedEncodingException If the user agent has specified an
 *         invalid character encoding
 */
public Charset getCharset() throws UnsupportedEncodingException {
    if (charset == null) {
        getCharacterEncoding();
        if (characterEncoding != null) {
            charset = B2CConverter.getCharset(characterEncoding);
        }
    }
 
    return charset;
}
 
/**
 * Get the character encoding used for this request.
 *
 * @return The value set via {@link #setCharacterEncoding(String)} or if no
 *         call has been made to that method try to obtain if from the
 *         content type.
 */
public String getCharacterEncoding() {
    if (characterEncoding == null) {
        characterEncoding = getCharsetFromContentType(getContentType());
    }
 
    return characterEncoding;
}

很明显当未设置charset的场景下优先从ContentType中获取【但是不见得所有客户端均设置了】

所以要想客户端传了GBK的流后端然后又设置了UTF-8的话.....当然反过来也是 比如某宝开发者界面

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 
 
<html xmlns="http://www.w3.org/1999/xhtml">
 
 
<head>
 
 
<meta http-equiv="Content-Type" content="text/html; charset=GBK">
 
 
<link rel="stylesheet" type="text/css" href="http://assets.taobaocdn.com/css/app/magic/magic.css"/>
 
 
</head>
 
 
<body>
 
 
<div class="stat-error">
 
 
<p>鎮ㄦ墍璇锋眰鐨勬湇鍔′笉瀛樺湪銆�</p>
 
 
</div>
 
 
</body>
 
 
</html>

但是当我们调整为UTF-8

IO

当我们读取某些文件或者将字符串进行编码的转换时

最常见的为String的getBytes

public byte[] getBytes(Charset charset) {
    if (charset == null) throw new NullPointerException();
    return StringCoding.encode(charset, value, 0, value.length);
}
 
/**
 * Encodes this {@code String} into a sequence of bytes using the
 * platform's default charset, storing the result into a new byte array.
 *
 * <p> The behavior of this method when this string cannot be encoded in
 * the default charset is unspecified.  The {@link
 * java.nio.charset.CharsetEncoder} class should be used when more control
 * over the encoding process is required.
 *
 * @return  The resultant byte array
 *
 * @since      JDK1.1
 */
public byte[] getBytes() {
    return StringCoding.encode(value, 0, value.length);
}

可以看到当没有使用显示的编码时

static byte[] encode(char[] ca, int off, int len) {
    String csn = Charset.defaultCharset().name();
    try {
        // use charset name encode() variant which provides caching.
        return encode(csn, ca, off, len);
    } catch (UnsupportedEncodingException x) {
        warnUnsupportedCharset(csn);
    }
    try {
        return encode("ISO-8859-1", ca, off, len);
    } catch (UnsupportedEncodingException x) {
        // If this code is hit during VM initialization, MessageUtils is
        // the only way we will be able to get any kind of error message.
        MessageUtils.err("ISO-8859-1 charset not available: "
                         + x.toString());
        // If we can not find ISO-8859-1 (a required encoding) then things
        // are seriously wrong with the installation.
        System.exit(1);
        return null;
    }
}

其实使用default encoding

public static Charset defaultCharset() {
    if (defaultCharset == null) {
        synchronized (Charset.class) {
            String csn = AccessController.doPrivileged(
                new GetPropertyAction("file.encoding"));
            Charset cs = lookup(csn);
            if (cs != null)
                defaultCharset = cs;
            else
                defaultCharset = forName("UTF-8");
        }
    }
    return defaultCharset;
}

其实就是依赖于系统配置了 当然万一默认编码出了问题 最终使用的是ISO-8859-1编码 所以尽量指定

猜你喜欢

转载自my.oschina.net/qixiaobo025/blog/1816334