Java in the HTTP network transport Chinese coding problem

Java in the HTTP network transport Chinese coding problem

1, java the new String (str.getBytes ( "utf-8"), "iso-8859-1") encoded Detailed

Provided that the characters are stored in str

  1. If so new String(str.getBytes(“gbk”),“gbk”), it can be divided into two steps:

    • first step:byte[] bytes=str.getBytes(“gbk”)

      Tell java virtual machine to the Chinese "gbk" way to convert a byte array. Two bytes corresponding to a character.

    • String s=new String(bytes,“gbk”) After the execution // s is the first step str.

Tells the virtual machine byte to byte array "gbk" manner every 2 bytes assembled into a kanji. This is the Chinese character s characters represent the first step str.

  1. If the new String(str.getBytes(“gbk”),“iso8859-1”)time

    • Corresponding to the second step is:

      String s=new String(bytes,“iso8859-1”)When this time is 1 byte each assembled into a "?." At this time, s is the number of "?", We can "?" Is seen as a special kind of characters, and it represents the information is not lost can be restored back.

  2. If the new String(str.getBytes(“gbk”),“utf-8”)time

    • Corresponding to the second step is:

      String s=new String(bytes,“utf-8”)When, at this time are each assembled into a 3-byte characters. This is the Chinese character s characters represent the first step str.

The actual process of network transmission, is the Chinese character to the utf-8 encoding transmitted over the Internet, the benefits of this approach is to save bandwidth traffic. Under IE browser options advanced internet bar have to say "always utf-8 transmission of data."

Note that when assembling iso8859-1 byte array into "?", With this special utf-8 character encoding becomes two bytes.

== getBytes () method ==

  • In Java, String of getBytes () method was an operating system default encoding format byte array. This means that under different operating systems, things return is not the same!

    String.getBytes (Stringdecode) according to the specified method returns a string decode encoded byte array in the coded representation, such as:
    byte [] b_gbk = "in" .getBytes ( "GBK");
    byte [] = b_utf8 " "; in" .getBytes ( "UTF-. 8)
    byte [] = b_iso88591 in" .getBytes ( "ISO8859-1"); "
    would return the" medium "in this kanji GBK, UTF-8 encoding and ISO8859-1 the byte array, this time

    B_gbk length is 2,

    B_utf8 length is 3,

    B_iso88591 length is 1.

== new String (byte [], decode) method ==

  • GetBytes with the opposite, by new String (byte [], decode) the way to restore the "in" word,

    This new String (byte [], decode ) the actual encoding is specified to decode the byte [] parse a string.
    String s_gbk = new new String (b_gbk, "GBK");
    String = s_utf8 new new String (b_utf8, "UTF -8 ");
    String = s_iso88591 new new String (b_iso88591," ISO8859-1 ");
    the output s_gbk, s_utf8 and s_iso88591, and will find s_gbk s_utf8 are" in ", and only a s_iso88591 is not recognized character ( can be understood as garbled), after Why ISO8859-1 encoding recombination, can not be restored "in" word? The reason is simple, because ISO8859-1 encoded coding table did not contain Kanji characters, of course, will not pass "in the" .getBytes ( "ISO8859-1"); to get the right "China" in ISO8859-1 the code value, therefore, is even more impossible to restore again by newString ().
    Thus, by String.getBytes (Stringdecode) method to get the [] byte, the code must determine the value of a String decode the coding table does exist, byte [] array to the right thus obtained is reduced.

    note:

    Sometimes, in order to make Chinese characters adapt to some special requirements (such as httpheader requires its content must be iso8859-1 encoding), the case may be by the Chinese characters as a byte-encoded, such as:
    String = s_iso88591 newString ( "in ".getBytes (" UTF-8 " )," ISO8859-1 "), thus obtained s_iso8859-1 in ISO8859-1 character string is actually three characters, these characters after the transfer to the destination, the destination the program then Strings_utf8 = newString (s_iso88591.getBytes ( "ISO8859-1 "), "UTF-8") obtained by the opposite way to correct Chinese character "in", so that not only ensures compliance with the agreement, also supports Chinese.

2, the network request, the codec implemented Chinese characters: URLEncoder.encode () and URLDecoder.decode ()

demo

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
 
public class JavaStudy {
    public static void main(String[] args) throws UnsupportedEncodingException {
        //编码
        String strUTF = "上海";
        String encode = URLEncoder.encode(strUTF, "utf-8");
        System.out.println(encode);//%E4%B8%8A%E6%B5%B7
 
        //解码
        String decoStr = "%E4%B8%8A%E6%B5%B7";
        String decode = URLDecoder.decode(decoStr, "utf-8");
        System.out.println(decode);//上海
        
    }
}

Precautions

  1. == URLEncoder.encode (String s, String enc ) ==
    specified encoding scheme, the string will be encoded application/x-www-form-urlencodedformat

    When a transmission request is used.

    == URLDecoder.decode (String s, String enc ) ==
    specified encoding scheme, for application/x-www-form-urlencodeddecoded string.

    When accepting a request to use.

  2. Encoding and decoding of the type to be consistent.

Guess you like

Origin www.cnblogs.com/qzkuan/p/12077061.html