背景
通常我们做java应用 会碰到如下几个可能会涉及到编码场景
- jdbc
- servlet
- io
分析
jdbc
对于jdbc我们通常会在连接字符串中设置
useUnicode=true&characterEncoding=utf8
这样就可以指定在数据库上对应的编码 当开发者
需要获取数据或者更新数据均会使用utf8进行编码
mysql的utf8和常规意义上的utf8存在区别 mysql使用utf8mb4方可以存储emoji
servlet
常规意义上jsp也属于servlet 通常我们会要求在jsp上方描述
<%@ page contentType="text/html; charset=utf-8" %>
<%@ page language="java" pageEncoding="UTF-8" %>
可以看到也设置了pageEncoding
其实compile将对应jsp文件读入再生成对应的servlet class会使用pageEncoding
善于思考的小伙伴会想到我们在写代码时用的编码是如何映射到java编译生成的class文件呢?如果使用的是不同编码是不是也会发生问题?
是的!通常我们会在maven中设置变量比如
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven-compiler-plugin.version}</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
当然在没有设置的场景使用如下
System.getProperty("file.encoding")
说完java的编译我们可以看到在servlet的encoding包括如下几个
/**
* Overrides the name of the character encoding used in the body of this
* request. This method must be called prior to reading request parameters
* or reading input using getReader().
*
* @param env
* a <code>String</code> containing the name of the character
* encoding.
* @throws java.io.UnsupportedEncodingException
* if this is not a valid encoding
*/
public void setCharacterEncoding(String env)
throws java.io.UnsupportedEncodingException;
/**
* Sets the character encoding (MIME charset) of the response being sent to
* the client, for example, to UTF-8. If the character encoding has already
* been set by {@link #setContentType} or {@link #setLocale}, this method
* overrides it. Calling {@link #setContentType} with the
* <code>String</code> of <code>text/html</code> and calling this method
* with the <code>String</code> of <code>UTF-8</code> is equivalent with
* calling <code>setContentType</code> with the <code>String</code> of
* <code>text/html; charset=UTF-8</code>.
* <p>
* This method can be called repeatedly to change the character encoding.
* This method has no effect if it is called after <code>getWriter</code>
* has been called or after the response has been committed.
* <p>
* Containers must communicate the character encoding used for the servlet
* response's writer to the client if the protocol provides a way for doing
* so. In the case of HTTP, the character encoding is communicated as part
* of the <code>Content-Type</code> header for text media types. Note that
* the character encoding cannot be communicated via HTTP headers if the
* servlet does not specify a content type; however, it is still used to
* encode text written via the servlet response's writer.
*
* @param charset
* a String specifying only the character set defined by IANA
* Character Sets
* (http://www.iana.org/assignments/character-sets)
* @see #setContentType #setLocale
* @since 2.4
*/
public void setCharacterEncoding(String charset);
其实这边就很明确的规定了请求的编码格式和响应的编码格式
很多场景下我们为了避免出现乱码问题 使用了spring封装的CharacterEncodingFilter 完成代码
这个filter有什么魔力呢?为啥加上这个就“不会”乱码?
@Override
protected void doFilterInternal(
HttpServletRequest request, HttpServletResponse response, FilterChain filterChain)
throws ServletException, IOException {
String encoding = getEncoding();
if (encoding != null) {
if (isForceRequestEncoding() || request.getCharacterEncoding() == null) {
request.setCharacterEncoding(encoding);
}
if (isForceResponseEncoding()) {
response.setCharacterEncoding(encoding);
}
}
filterChain.doFilter(request, response);
}
什么嘛!其实也只是设置了一个编码而已。所以其实真正的请求来说在终端设置正确的编码最重要【这边设置encode是防止由于没设置encode导致出现编码异常】
比如在tomcat中获取编码如下
/**
* Get the character encoding used for this request.
*
* @return The value set via {@link #setCharacterEncoding(String)} or if no
* call has been made to that method try to obtain if from the
* content type.
*
* @throws UnsupportedEncodingException If the user agent has specified an
* invalid character encoding
*/
public Charset getCharset() throws UnsupportedEncodingException {
if (charset == null) {
getCharacterEncoding();
if (characterEncoding != null) {
charset = B2CConverter.getCharset(characterEncoding);
}
}
return charset;
}
/**
* Get the character encoding used for this request.
*
* @return The value set via {@link #setCharacterEncoding(String)} or if no
* call has been made to that method try to obtain if from the
* content type.
*/
public String getCharacterEncoding() {
if (characterEncoding == null) {
characterEncoding = getCharsetFromContentType(getContentType());
}
return characterEncoding;
}
很明显当未设置charset的场景下优先从ContentType中获取【但是不见得所有客户端均设置了】
所以要想客户端传了GBK的流后端然后又设置了UTF-8的话.....当然反过来也是 比如某宝开发者界面
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=GBK">
<link rel="stylesheet" type="text/css" href="http://assets.taobaocdn.com/css/app/magic/magic.css"/>
</head>
<body>
<div class="stat-error">
<p>鎮ㄦ墍璇锋眰鐨勬湇鍔′笉瀛樺湪銆�</p>
</div>
</body>
</html>
但是当我们调整为UTF-8
IO
当我们读取某些文件或者将字符串进行编码的转换时
最常见的为String的getBytes
public byte[] getBytes(Charset charset) {
if (charset == null) throw new NullPointerException();
return StringCoding.encode(charset, value, 0, value.length);
}
/**
* Encodes this {@code String} into a sequence of bytes using the
* platform's default charset, storing the result into a new byte array.
*
* <p> The behavior of this method when this string cannot be encoded in
* the default charset is unspecified. The {@link
* java.nio.charset.CharsetEncoder} class should be used when more control
* over the encoding process is required.
*
* @return The resultant byte array
*
* @since JDK1.1
*/
public byte[] getBytes() {
return StringCoding.encode(value, 0, value.length);
}
可以看到当没有使用显示的编码时
static byte[] encode(char[] ca, int off, int len) {
String csn = Charset.defaultCharset().name();
try {
// use charset name encode() variant which provides caching.
return encode(csn, ca, off, len);
} catch (UnsupportedEncodingException x) {
warnUnsupportedCharset(csn);
}
try {
return encode("ISO-8859-1", ca, off, len);
} catch (UnsupportedEncodingException x) {
// If this code is hit during VM initialization, MessageUtils is
// the only way we will be able to get any kind of error message.
MessageUtils.err("ISO-8859-1 charset not available: "
+ x.toString());
// If we can not find ISO-8859-1 (a required encoding) then things
// are seriously wrong with the installation.
System.exit(1);
return null;
}
}
其实使用default encoding
public static Charset defaultCharset() {
if (defaultCharset == null) {
synchronized (Charset.class) {
String csn = AccessController.doPrivileged(
new GetPropertyAction("file.encoding"));
Charset cs = lookup(csn);
if (cs != null)
defaultCharset = cs;
else
defaultCharset = forName("UTF-8");
}
}
return defaultCharset;
}
其实就是依赖于系统配置了 当然万一默认编码出了问题 最终使用的是ISO-8859-1编码 所以尽量指定