[Base] Java coding issues so that you no longer confused


If you are a programmer living in 2003, do not understand the basics of characters, character sets, encoding and Unicode. Then you have to be careful, and if I caught you, I will let you peel onions for six months in a submarine to punish you. - Source Network


1. ASCII encoding

60s of last century, the United States developed a set of character encoding, the relationship between English characters and bits, made uniform regulations. This is called ASCII code, still in use. ASCII encoding code provides for a total of 128 characters, such as spaces "SPACE" is 32 (binary 00100000), uppercase letter A is 65 (binary 01000001). The 128 symbols (including 32 control symbols can not be printed out), only takes a byte 7 behind the foremost a predetermined uniform 0.0 to 31 as a control character is a carriage return linefeed deletion, 32 to print 126 characters, can be input through the keyboard and can be displayed.

English with 128 symbol encoding enough, but to represent other languages, 128 symbols is not enough. For example, in French, there is phonetic symbols above the letters, it can not be represented by ASCII codes. As a result, some European countries decided to use most significant byte of idle incorporated into the new symbol. For example, the French é coded as 130 (binary 10000010). As a result, the coding system used by European countries, may represent up to 256 symbols.

However, here again there is a new problem. Different countries have different letters, therefore, even if they are using the encoding 256 symbols, letters represent is not the same. For example, 130 represents the coding in French é, Hebrew letter it represents the coding Gimel (ג), in Russian encoding symbols will on behalf of another. But in any case, all these codes, the symbol represents 0-127 is the same, not the same as this period is only 128-255.

As for text Asian countries, symbols used even more, as many as 10 million Chinese characters. A byte can only represent 256 kinds of symbols, it is definitely not enough, you must use multiple bytes express a symbol. For example, Simplified Chinese common encoding is GB2312, use two bytes of a character, so in theory can represent up to 65 536 symbols.

2. Unicode encoding

Imagine, if there is a code, all the symbols of the world are included. Each symbol is given a unique code, then it does not appear the above problem. Unicode encoding is such a code.

Unicode is a great set of characters, present size can accommodate more than one million symbols. Encoding each symbol is different, for example, U + 0639 represents the Arabic letter Ain, U + 0041 for English capital letters A, U + 4E25 represent the Chinese character "strict."

It should be noted, Unicode is just a set of symbols, it only provides binary notation, but does not specify how this should be stored in binary code . This creates two problems:

  • The first question is, how can we distinguish unicode and ascii? The computer knows how three bytes represent a symbol, rather than the three symbols represent it?
  • The second problem is that we already know, the letters only one byte is enough, if unicode unified regulations, each symbol is represented by three or four bytes, are bound for two before each letter to three bytes is 0, which is a huge waste for storage, the size of the text file will be large and therefore a two to three times, this is unacceptable.

Remember, Unicode is only used to map a character and numeric standards. It supports the number of characters is not limited, nor does it require the character to be accounted for two, three or any other number of bytes. How Unicode character is encoded into a memory byte which is another topic, which is UTF (Unicode Transformation Formats) defined.

3. UTF-8 encoding

Popularity of the Internet, urged a unified coding appears. UTF-8 is the most widely used implementation using a unicode on the Internet. Other implementations also include UTF-16 and UTF-32, but the basic need on the Internet. Repeat, here is the relationship, UTF-8 Unicode is one of implementation.

UTF-8 (8-bit Unicode Transformation Format) is a variable length for Unicode character encoding, also known as Unicode. Created by Ken Thompson in 1992. Now standardized as RFC 3629. UTF-8 with 1-4 byte encoding Unicode characters. On the web page can be unified with the page display Simplified Chinese Traditional and other languages ​​(such as English, Japanese, Korean).

UTF-8 biggest feature is that it is a variable length encoding. It can be 1 to 4 bytes of one symbol, the symbol changes depending on the byte length (UTF-8 encoding can accommodate 2 ^ 21 characters, a total of more than 200 million characters).

UTF-8 encoding rules are very simple, only two:

  1. For single byte symbols, a set of byte 0, the back 7 of the symbol codes to unicode. Therefore, for the English alphabet, UTF-8 encoding and ASCII codes are the same.

  2. For symbol n bytes (n> 1), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, the first two bytes of the rear set 10 uniformly. The remaining bits not mentioned, all this unicode code symbol.

The following table summarizes the encoding rules, the letter x represents available encoding bits.

Unicode symbol range | UTF-8 encoding
UTF bytes (hex) | (binary)
-------------------- + ------ --------------------------------------- byte 0000 0000-0000 007F | 0xxxxxxx
two bytes 0000 0080-0000 07FF | 110xxxxx 10xxxxxx
three bytes 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
four bytes 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Below, or to the Chinese character "strict" for example, demonstrates how to implement UTF-8 encoding.

Known "strict" unicode is 4E25 (100111000100101), according to the table, can be found in the range of 4E25 third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes that the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, from the "strict" last bit Start, fill in the format of x from back to front, the extra bit 0s. This resulted in a "strict" UTF-8 encoding is "11100100 1,011,100,010,100,101", converted to hexadecimal is E4B8A5.

4. UTF8 difference between UTF16 and UTF32,

Let's make sure the concept is a Unicode is a character set, this character set the world on all of the characters defines a unique code. It simply provides the binary code for each symbol of the lack of measures to refine the rules of storage. UTF-8, UTF-16, UTF-32 is the storage format defined in Unicode. (Take a communications Liezi make a comparison, a signal (analog to Unicode encoding means), by which different coding methods, are encoded into different signal level)

4.1 UCS-2, and UCS-4

Unicode is to integrate all the world's languages ​​are born. Any of the Unicode text corresponds to a value that is called code point (code point). Code point values ​​are normally written in the format U + ABCD. The correspondence between the character and the code point is the UCS-2 (Universal Character Set coded in 2 octets). As the name implies, UCS-2 is to use two bytes to represent code points, which is in the range of U + 0000 ~ U + FFFF.

In order to represent more text, it also proposed UCS-4, i.e. four bytes represented by the code point. It ranges from U + 00000000 ~ U + 7FFFFFFF, wherein U + 00000000 ~ U + 0000FFFF and UCS-2 is the same.

It is to be noted, UCS-2, and UCS-4 only provides the correspondence between the word and the code points, and does not specify how the code points are stored in the computer. Known as UTF (Unicode Transformation Format) storing a predetermined manner, wherein the application is more UTF-16 and UTF-8 a.

4.2 UTF-16

UTF-16 is defined by RFC2781, which uses two bytes to represent one code point. Not difficult to guess, UTF-16 corresponds exactly to the UCS-2, i.e., the predetermined code points saved UCS-2 directly down by Big Endian or Little Endian manner. UTF-16 include three: UTF-16, UTF-16BE (Big Endian), UTF-16LE (Little Endian). UTF-16BE and UTF-16LE not difficult to understand, and UTF-16 will need to pass in the beginning of the file with the character named BOM (Byte Order Mark) to indicate that the file is Big Endian or Little Endian. BOM is U + FEFF this character. In fact, BOM is a small smart idea. Since there is no definition of UCS-2 U + FEFF, so long as such a sequence of bytes or FE FF FF FE occurs, it can be considered to be U + FEFF, and may be judged Big Endian or Little Endian.

BOM (Byte Order Mark) at the beginning of your document to tell endian reader of the document. UTF-8 BOM does not need to indicate the byte order, but may be used to indicate the encoding BOM. Character "ZERO WIDTH NO-BREAK SPACE" the UTF-8 encoding is EF BB BF. Therefore, if the recipient received EF BB BF at the beginning of the byte stream, I know this is a UTF-8 encoding. UTF-16 only need to add bom. Because it is based on unicode sequence encoding the BMP in the range of two bytes, you need to identify small or big endian.
 

Low byte order (Little Endian) and big endian (Big Endian)
low order byte and a high byte order is a convention on a range of bytes read and stored in memory (referred to as words) of. This means that when you let the computer use UTF-16 to the letter A (two bytes) in the presence of memory, which endian scheme uses to determine your first byte in the second byte before or after. So a little less easy to understand, let's look at an example: When you use UTF-16 able to save a piece of content in different systems in the second half it might look like this:
00 68 00 65 00 00. 6C 6C 00 6F (high order byte, the upper byte is present in front)
68 00 65 00 00 6C 6C 6F 00 00 (low-order byte and the lower byte is present in front)

Endian scheme is just a matter of preference microprocessor architects, for example, Intel use low endian, Motorola high byte order.

for example. "ABC" with the result that the three characters encoded in various ways as follows:

Encoding type Code value
UTF-16BE 00 41 00 42 00 43
UTF-16LE 41 00 42 00 43 00
UTF-16(Big Endian) FE FF 00 41 00 42 00 43
UTF-16(Little Endian) FF FE 41 00 42 00 43 00
UTF-16 (without BOM) 00 41 00 42 00 43

4.3 UTF-32

UTF-32 code points represented by four bytes, so that all the code that can fully represent the UCS-4, rather than with the sophisticated algorithms as UTF-16. Similarly with the UTF-16, UTF-32 also includes UTF-32, UTF-32BE, UTF-32LE three encoding, UTF-32 character also need BOM.

4.4 text editor to know how text encoding

When a software opens a text, it first thing to do is to determine what exactly is this text coded character sets which use the saved. Software generally use three methods to determine the character set and encoding of the text:     

  1. Detecting identification header (BOM)

    EF   BB   BF   UTF-8      
    FE   FF   UTF-16/UCS-2,   big   endian      
    FF   FE   UTF-16/UCS-2,   little   endian      
    FF   FE   00   00   UTF-32/UCS-4,   little   endian.      
    00   00   FE   FF   UTF-32/UCS-4,   big-endian.

  2. Guess the encoding of the current file their own software according to the coding rules

  3. It prompts the user to enter their own encoding the current file

5. GBK difference between GB2312 and GB18030,

Is an extension of ASCll GB2312 codes, occupies two bytes. Character meaning a less than 127 characters same as the original, but even greater than 127 when the two together, represent a character, a preceding byte (he called high byte) 0xA1 used from 0xF7, behind a word section (low byte) from 0xA1 to 0xFE, so that we can combine a plurality of approximately 7000 simplified Chinese characters. In these codes, we also mathematical symbols, Greek letters Roman, Japanese Kana who were incorporated into it, even in ASCII already there in numbers, punctuation, letters are all re-edited the two-byte coding it is often said that "full" character, while those originally called "half-size" in 127 characters or less of.

GB2312 characters can be represented, or not enough, so GBK appeared. GBK is an extension of GB1212, and it is 2 bytes, a low byte of GBK no longer required after a certain number within the code 127, as long as the first byte is fixed is greater than 127 indicates that this is the start of a character, regardless of is followed by an extended character set is not in the content. After the encoding scheme is known as a result of expansion of GBK standard, GB2312 GBK includes all of the content, but also an increase of nearly 20,000 new characters (including traditional Chinese characters) and symbols.

GB18030 using variable length coding, it may be 1 byte, 2 bytes and 4 bytes. And GBK is an extension GB2312, fully compatible with both.

After the introduction of the above, we can see that Unicode is a world standard, the development of the code table for all language symbols in the world, but GBK, GB2312, etc. are mainly encoded for Chinese characters.

6. Java coding problems in

We know where it comes to coding generally in character or byte-to-byte characters to the conversion, and the reasons for such a scene is mainly in I / O when the I / O includes disk I / O and network I / O. The garbled Most I / O are caused by network I / O.

The user initiates an HTTP request from the browser side, where there is need for coding that URL, Cookie, Parameter. After terminating the server by the HTTP request to parse HTTP protocol, wherein URI, Cookie and POST form parameters need to be decoded, the server may also need to read the data in the database, a local network or elsewhere in the text file, such data may exist coding problem, when the Servlet finished processing all the requested data, the data needs to be re-encoded request is sent to the user's browser via the Socket, and then decoded into the text through the browser . These processes are shown below:

As shown above HTTP request to design a lot to encoding and decoding, encoding and decoding of what they rule? The following will focus elaborate:

URL encoding and decoding of
the user submits a URL, this URL may exist in Chinese, so the need to code, how to encode this URL? According to what rules to encode? On how to decode? FIG as a URL:

Port of correspondence in Tomcat Configure, and Context Path in Configure, web.xml Servlet Path in the Web application in

    <servlet-mapping> 
            <servlet-name>junshanExample</servlet-name> 
            <url-pattern>/servlets/servlet/*</url-pattern> 
     </servlet-mapping> 

Configure, PathInfo is a specific Servlet our request, QueryString parameter to be passed, attention here is enter the URL it is requested by the Get method directly in the browser, if the POST method request, then, QueryString will be submitted through a form way to the server, then this will be described later.

QueryString figure above PathInfo and there was Chinese, when we enter the URL directly in the browser, it will be how to encode and parse the URL in the browser and server do? In order to verify that the browser is how we choose to encode a URL FireFox browser and observe the actual contents of the URL request through our HTTPFox plug, the following is a URL: HTTP: // localhost: 8080 / examples / servlets / servlet / Junshan author =? Junshan test results of the Chinese FireFox3.6.12:

Coding results Junshan are: e5 90 9b e5 b1 b1, be fd c9 bd, review the previous code shows, PathInfo is UTF-8 encoding QueryString is the result of GBK encoding, as to why there is "%"? Check the URL encoding specification RFC3986 known browser URL is encoded non-ASCII characters encoded as hexadecimal numbers and then add "%" before each byte hexadecimal representation according to a certain coding format, the final URL will It has become the format of the map.

The test results can be seen from the above browser PathInfo and QueryString code is not the same, different browsers PathInfo may not be the same, which caused great difficulties for decoding server, let's look at an example to Tomcat, Tomcat how to receive this URL is decoded.

protected void convertURI(MessageBytes uri, Request request) 
 throws Exception { 
        ByteChunk bc = uri.getByteChunk(); 
        int length = bc.getLength(); 
        CharChunk cc = uri.getCharChunk(); 
        cc.allocate(length, -1); 
        String enc = connector.getURIEncoding(); 
        if (enc != null) { 
            B2CConverter conv = request.getURIConverter(); 
            try { 
                if (conv == null) { 
                    conv = new B2CConverter(enc); 
                    request.setURIConverter(conv); 
                } 
            } catch (IOException e) {...} 
            if (conv != null) { 
                try { 
                    conv.convert(bc, cc, cc.getBuffer().length - 
 cc.getEnd()); 
                    uri.setChars(cc.getBuffer(), cc.getStart(), 
 cc.getLength()); 
                    return; 
                } catch (IOException e) {...} 
            } 
        } 
        // Default encoding: fast conversion 
        byte[] bbuf = bc.getBuffer(); 
        char[] cbuf = cc.getBuffer(); 
        int start = bc.getStart(); 
        for (int i = 0; i < length; i++) { 
            cbuf[i] = (char) (bbuf[i + start] & 0xff); 
        } 
        uri.setChars(cbuf, 0, length); 
 }

From the above code can be known to the URI portion of the URL decoding is set in the connector of the character Defined, if not defined, then the default encoding ISO-8859-1 will be resolved. So if there is the Chinese URL is preferably arranged URIEncoding UTF-8 encoding.

QueryString and how to resolve? Form parameters QueryString GET request and the HTTP POST HTTP requests Parameters are as saved parameter values ​​are acquired request.getParameter. Their decoding is performed when request.getParameter first method is called. request.getParameter method will call parseParameters method org.apache.catalina.connector.Request when it is called. This method will be passed as a parameter for GET and POST method to decode, but they have decoded characters may be different. Decoding POST form will be described later, QueryString decode the character set in which to define it? Which itself is spread over HTTP Header server, and also in the URL, whether URI and decode the character set of the same? From the front of the browser to take a different encoding format PathInfo and QueryString different coding can guess the character set to decode certainly will not be the same. Indeed QueryString is decoded in either the character set defined in the Header ContentType Charset default or is ISO-8859-1, to be used as defined in the coding ContentType necessary to set the connector The useBodyEncodingForURI set to true. The name of the configuration items of little people confused, it is not the entire URI are used to decode BodyEncoding but only use BodyEncoding decode the QueryString, this special attention.

From the above URL encoding and decoding process, the more complex, but we are not encoded and decoded in the application complete control, so in our application should try to avoid using non-ASCII characters in the URL, or very may run into the garbage problem, of course, is best set in our server The URIEncoding and useBodyEncodingForURI two parameters.

HTTP Header codec
when the client initiates an HTTP request may also be transmitted to other parameters such as Cookie, redirectPath the like in addition to the above Header URL, values set by a user is likely to also exist coding problems, Tomcat thereof is how to decode it?

Header entries for the decoding is also request.getHeader call, if the requested entry Header does not decode MessageBytes toString method is invoked, this method of conversion from byte to char default encoding used is ISO-8859-1 , and we can not set Header decoding of other formats, so if you set the Header non-ASCII character encoding will certainly be garbled.

When we add Header is the same reason, do not pass non-ASCII characters in the Header, if we must pass it, we can use these characters first and then add Header org.apache.catalina.util.URLEncoder encoding, so in the browser to the server in the transfer process will not lose information, and if we want to access these items and then follow the appropriate character set decoding enough.

POST form of codec
in the aforementioned decoding parameters POST form submission is the first call occurs request.getParameter, parameter passing and POST form QueryString different, it is passed to the server by the HTTP BODY. When we click on the submit button on the first page of the browser will be based on the ContentType Charset encoding format for encoding parameters and then fill in the form submitted to the server, the server is also used to decode the character set ContentType. So the parameters submitted by the POST form is generally not a problem, and this is our own character set encoding settings can be set by request.setCharacterEncoding (charset).

Further multipart form-data type for the parameter /, is uploaded files encoded using character set encoding is also defined ContentType, noteworthy is the upload file is transferred to the temporary local directory server by way of the byte stream, the process and not related to character encoding, while the real coding is to add content to the parameters in the file, if using this code will not be coded with the default encoding to ISO-8859-1 encoding.

HTTP BODY codec
when the resource requested by the user has been successfully obtained, the content will be returned to the client browser via Response, this process must first be encoded and then decoded browser. This character set encoding and decoding process may be set by response.setCharacterEncoding, it will override the value request.getCharacterEncoding, and returned to the client by the Content-Type Header, the browser will receive the return flow through the socket Content-Type the charset to decode, if the returned HTTP Header in the Content-Type charset is not set, then the browser will be based on the Html to decode the charset. If you have not defined, then the browser will use the default encoding to decode.

Other areas requiring attention encoded
in addition to the URL and parameter coding problem, there are many places that may exist on the server side coding, such as may need to read xml, velocity template engine, JSP or read data from the database.

xml file encoding formats can be developed by setting up

    <?xml version="1.0" encoding="UTF-8"?> 

Velocity templates encoding format is provided:

services.VelocityService.input.encoding=UTF-8 

JSP coding format is provided:

<%@page contentType="text/html; charset=UTF-8"%>

Client access to the database is through the JDBC driver to complete, using JDBC to access data and built-encoded data to be consistent, can be developed as MySQL by setting the JDBC URL:

url="jdbc:mysql://localhost:3306/DB?useUnicode=true&characterEncoding=GBK"

8. garbage problem analysis

The following look at, when we met with some garbled, how we should deal with these issues? The only problems are the cause garbled or byte to byte in the char to char conversion character set encoding and decoding caused by the inconsistency, as it involves multiple codec often one operation, it is difficult to find when garbled in the end is which part something is wrong. According to their own experience, often from the very beginning step by step to check the source of the reason it is the fastest.

9. References

Coding blog

Why Java can identify up to 65,535 characters.

Unicode itself is only a standard, not a particular implementation, and is not limited the number of bytes. Unicode version currently used for practical corresponding to UCS-2, 16-bit code space, and therefore can represent the maximum 65535 characters. Unicode is the development of 60,000 really is not enough, in fact, now has Unicode support more than 100,000 characters (the first 100 000 was adopted in 2005, for the Malayalam language. The current Unicode version 6.3,2013 on September 30 the development .Java is still used in UCS-2.

Guess you like

Origin www.cnblogs.com/54chensongxia/p/11650841.html