java codec process

A recent project, sometimes encounter problems Chinese garbled, check online a lot of information, found mostly talk about solutions, and not talked about why to use this program, what is the principle of this scheme is that?

The most typical is the connection URL database, we generally put it under db.properties classpath, and despite our java code set UTF-8, JSP also set UTF-8, also set up a database UTF-8 , insert data into the database but there will still be Chinese garbled, the final solution is to add our coding format used by the connection UTF-8 on the URL connection to the database, but we wonder why is that?

Let's talk about the issue under java coding, coding why, what code, how to encode and decode, Why Chinese garbled, how to solve the Chinese garbled.

1. Why coding

This problem must go back to the computer is a symbol representing how we humans can understand these symbols is the language we use in humans. So many human language, and thus represents many symbols of these languages, you can not use the computer in a basic storage unit --- byte to represent, and therefore must go through the split or some translation work, in order for the computer to understand our language . We can assume that the computer can understand language is English, to be able to use other languages ​​must be translated in a computer, to translate it into English. The translation process is encoded.

On the whole, the reason encoding can be summarized as: the smallest unit of information is stored in the computer a byte, i.e., 8 'bit, it can be represented by the character range is 0-255; too human symbols to be represented, It can not be fully represented by a single byte.

To resolve this conflict must have a new data structure char, from char to byte must be encoded.

2. The common coding

Understand the need to communicate in various languages, translated it is necessary, then how to translate it? Translation computer provides a variety of ways, the common right-ASCII, ISO-8859-1, GB2312, GBK, UTF-8, UTF-16 and so on. They can be seen as a dictionary, they transformed lays down the rules in accordance with this rule allows the computer to correctly identify our character. Many current encoding format, such as GB2312, GBK, UTF-8, UTF-16 can identify a character, then we in the end choose which encoding format to store characters it? It is necessary to consider other factors too, is an important storage space is important coding efficiency.

3. how encoding, decoding

A String, for example, the following code:

S = String " This is a Chinese character string " ;
 byte [] B = s.getBytes ( " UTF-. 8 " ); 
String n- = new new String (B, " UTF-. 8 " );

Encoding uses: the user can know our language into computer language can be appreciated, the generally used in transmission or storage, which is related to the operation of the computer rather than to users.

Decoding: The byte code interpreter for the language of our users to know.

4. Why have Chinese garbled

First case: Using the Chinese character set does not recognize encoded, this is relatively rare

The second case, the use of a character set encoding, but with another character set to decode, that is more common, for example, java code is encoded in UTF-8, but the time to access the database, the database used to decode GBK this will be Chinese garbled.

Charset use

Charset charset=Charset.forName("UTF-8");
ByteBuffer byteBuffer=charset.encode(string);
CharBuffer charBuffer=charset.decode(byteBuffer);

5.java in how codecs

String name="I am 小明";
toHex(name.toCharArray());
<span style="font-size:18px; white-space: pre;"></span><pre name="code" class="java">try{
     byte[] iso8859=name.getBytes("ISO-8859-1");
     toHex(iso8859);
     byte[] gb2312=name.getBytes("GB2312");
     toHex(gb2312);
     byte[] gbk=name.getBytes("GBK");
     toHex(gbk);
}
String str="小米";
byte[] b=str.getBytes("UTF-8");
public byte[] getBytes(String charsetName)
            throws UnsupportedEncodingException {
        if (charsetName == null) throw new NullPointerException();
        return StringCoding.encode(charsetName, value, 0, value.length);
    }
 static byte[] encode(String charsetName, char[] ca, int off, int len)
        throws UnsupportedEncodingException
    {
        StringEncoder se = deref(encoder);
        String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;
        if ((se == null) || !(csn.equals(se.requestedCharsetName())
                              || csn.equals(se.charsetName()))) {
            se = null;
            try {
                Charset cs = lookupCharset(csn); //生成字符集实例
                if (cs != null)
                    se = new StringEncoder(cs, csn);
            } catch (IllegalCharsetNameException x) {}
            if (se == null)
                throw new UnsupportedEncodingException (csn);
            set(encoder, se);
        }
        return se.encode(ca, off, len);
    }
 private static Charset lookupCharset(String csn) {
        if (Charset.isSupported(csn)) {
            try {
                return Charset.forName(csn);
            } catch (UnsupportedCharsetException x) {
                throw new Error(x);
            }
        }
        return null;
    }
  private StringEncoder(Charset cs, String rcn) {
            this.requestedCharsetName = rcn;
            this.cs = cs;
            this.ce = cs.newEncoder()
                .onMalformedInput(CodingErrorAction.REPLACE)
                .onUnmappableCharacter(CodingErrorAction.REPLACE);
            this.isTrusted = (cs.getClass().getClassLoader0() == null);
        }

Original: https://blog.csdn.net/u010627840/article/details/50407575



Guess you like

Origin www.cnblogs.com/xubao/p/11058129.html