String encoding and decoding

Character: A sign that people use, a symbol in the abstract sense. For example: '1', '中', 'a' Byte: the unit of data storage in the computer, an 8-bit binary number, is a very specific storage space character set: which characters are used. That is to say which Chinese characters, letters and symbols will be included in the standard. The set of "characters" it contains is called a "character set". Encoding: specifies whether each "character" is stored in one byte or multiple bytes, and which bytes are used for storage. This provision is called "encoding". Usually we call "character set", such as: GB2312, GBK, JIS, etc., in addition to the meaning of "collection of characters", also includes the meaning of "encoding".

 1. Basic knowledge of coding

 

1. iso8859-1 is a single-byte encoding, the maximum character range that can be represented is 0-255, and it is used in English series. For example, the code for the letter 'a' is 0x61=97. Obviously, the character range represented by the iso8859-1 encoding is very narrow and cannot represent Chinese characters. However, since it is a single-byte encoding, which is consistent with the most basic representation unit of the computer, the iso8859-1 encoding is still used in many cases. And on many protocols, this encoding is used by default. For example, although the word "Chinese" does not have iso8859-1 encoding, take gb2312 encoding as an example, it should be "d6d0 cec4" two characters, when using iso8859-1 encoding, it will be split into 4 bytes to Represents: "d6 d0 ce c4" (in fact, when storing, it is also processed in bytes). And if it is UTF encoding, it is 6 bytes "e4 b8 ad e6 96 87". Obviously, this representation also needs to be based on another encoding.

 

2. GB2312/GBK This is the national standard code of Chinese characters, which is specially used to represent Chinese characters. It is a double-byte code, and the English letters are consistent with iso8859-1 (compatible with iso8859-1 code). Among them, gbk encoding can be used to represent both traditional and simplified characters, while gb2312 can only represent simplified characters, and gbk is compatible with gb2312 encoding.

 

 3. Unicode This is the most unified encoding that can be used to represent characters in all languages, and is a fixed-length double-byte (also four-byte) encoding, including English letters. So it can be said that it is not compatible with iso8859-1 encoding, nor is it compatible with any encoding. However, compared to iso8859-1 encoding, unicode encoding just adds a 0 byte in front, for example the letter 'a' is "00 61". It should be noted that fixed-length encoding is convenient for computer processing (note that GB2312/GBK is not a fixed-length encoding), and unicode can be used to represent all characters, so unicode encoding is used in many software, such as java.

 

4. UTF takes into account that unicode encoding is not compatible with iso8859-1 encoding, and it is easy to take up more space: because for English letters, unicode also needs two bytes to represent. So unicode is not convenient for transmission and storage. Therefore, utf encoding is produced. utf encoding is compatible with iso8859-1 encoding, and can also be used to represent characters in all languages. However, utf encoding is an indeterminate length encoding, and the length of each character varies from 1-6 bytes. In addition, utf encoding comes with a simple verification function. Generally speaking, English letters are represented by one byte, while Chinese characters are represented by three bytes. Note that although utf is used to use less space, it is only relative to unicode encoding. If you already know that it is Chinese characters, using GB2312/GBK is undoubtedly the most economical. But on the other hand, it is worth noting that although utf encoding uses 3 bytes for Chinese characters, even for Chinese web pages, utf encoding will save more than unicode encoding, because the web page contains a lot of English characters.

 

2. JAVA's processing of characters

 

 1. getBytes(charset) This is a standard function of java string processing. Its function is to encode the characters represented by the string according to charset and represent them in bytes. Note that strings are always stored in unicode encoding in java memory. When a Java program obtains a string from an input stream, file or character literal, it will convert the character encoding. For example, the constructor of InputStreamReader needs to specify the encoding method, and for characters obtained from files and character literals When it is a string, the system default encoding method is used to decode the character data. Consider the following piece of code: String str="medium";

 

① byte[] bytes = str.getBytes();

 

 ② bytes = str.getBytes(“ISO-8859-1”);

 

③ Statement ①: Assign a string literal containing only one character "中" to an object str of the String class. The character literal "中" is encoded according to the default encoding method of the operating system. In the Chinese windows system, it is usually "GBK" and "Medium" are 0xD6D0 in GBK encoding. When assigning this character to str, Java will encode the string, that is, convert "Medium" in GBK encoding method to "Medium" in Unicode encoding method , the encoding of the Unicode encoding method "Medium" is 0x4E2D, so the binary representation of str in memory during program execution is 0x4E2D in hexadecimal. Statement ②: Obtain the binary form of str string. The getBytes(String encoding) method needs to specify the encoding method, indicating that the binary form of the string in which encoding method is obtained. There is no parameter set in this statement, which means that the default encoding method of the operating system is adopted, that is, the bytes obtained here are the binary form of "Medium" in GBK encoding, that is, bytes[0]=0xD6, bytes[1]=0xD0. Statement ③: The difference between this statement and statement ② is that the encoding method is specified. The ISO-8859-1 is specified here, which is commonly known as Latin-1. The encoding uses 8 bits to encode characters, so there are only 256 in the encoding space. characters. The encoding only contains basic ASCII codes and some extended other Western European characters, so it is impossible to contain the Chinese character "中" in this character set, which means that the Java virtual machine cannot find "中" in the ISO-8859-1 code set. The code corresponding to the "" word, in this case, only a question mark (?, 0x3f) character is returned, so at this time bytes.length is only 1, and bytes[0]=0x3f.

 

2.new String(byte[] bytes, String encoding) The getBytes() method obtains a binary byte array from a string. If you want to get a string from a binary byte array, you need to use the new String(byte[] bytes, String encoding) method, which parses the binary array in the byte array bytes according to the encoding encoding method and generates a new string object.

 

byte[] bytes = {(byte)0xD6, (byte)0xD0, (byte)0x31};

 

① String str = new String(bytes);

 

② str = new String(bytes,”ISO-8859-1”);

 

 ③ Statement ①: Define a byte array. Statement ②: The binary data in the byte array is encoded into a string according to the default encoding method (GBK). We know that 0xD6 0xD0 in GBK means "medium", and 0x31 means the character "1" (GBK is compatible with ASCII, but not compatible with ISO-8859-1 except for ASCII), so the value obtained by str is "medium 1". Statement ③: This sentence uses the ISO-8859-1 encoding method to encode the byte data. Since a byte will be parsed into a character in the ISO-8859-1 encoding method, the byte array will be interpreted as A string containing three characters, but since there are no characters corresponding to 0xD6 and 0xD0 in ISO-8859-1 encoding, the first two characters will generate two question marks, because 0x31 corresponds to characters in ISO-8859-1 encoding "1" (ISO-8859-1 is also compatible with ASCII), so the value of str obtained by this statement is "??1".

 

 3.setCharacterEncoding() This function is used to set the http request or the corresponding encoding. For request, it refers to the encoding of the submitted content. After specifying, the correct string can be obtained directly through getParameter(). If not specified, the default is to use iso8859-1 encoding, which requires further processing. It's worth noting that you cannot execute any getParameter() before executing setCharacterEncoding(). Moreover, this specification is only valid for the POST method, not for the GET method. The reason for the analysis should be that when the first getParameter() is executed, java will analyze all the submitted content according to the encoding, and the subsequent getParameter() will no longer be analyzed, so setCharacterEncoding() is invalid. For the GET method to submit the form, the submitted content is in the URL, and all the submitted content has been analyzed according to the encoding at the beginning, and setCharacterEncoding() is naturally invalid. For response, it specifies the encoding of the output content, and at the same time, the setting will be passed to the browser to tell the browser the encoding used for the output content.

 

 3. Page Coding Page coding mainly includes two aspects, one is the encoding format of the page itself, that is, what encoding method is used to save it, and the other is what encoding format the client browser displays the page in.

 

 1. Page save encoding format

 

1). The encoding of the HTML page depends on the encoding option when you save the file. Most web editing software allows you to choose the encoding type, and the default is local encoding. In order to reduce the encoding problem of the webpage, it is best to save it as UTF-8 Encoding format.

 

 2). The JSP page uses the following tags to specify the encoding format of the JSP source file. Specifically, we can add the following sentence to the header of the JSP source file: %@page[/email] pageEncoding="xxx"%>,xxx Can be GB2312, GBK, UTF-8 (different from MySQL, MySQL is UTF8), etc. The default value is ISO-8859-1. The encoding when saving the file should be the same as xxx.

 

2. Page display encoding (notifies the client browser what character set encoding to use to display the page)

 

 1). Set the page display encoding method in HTML and use the tag to set the page display encoding

 

2). To set the page display encoding method in Servlet, use response.setContentType("text/html; charset=xxx"); to specify the generated page encoding.

 

 3). Set the page display code in JSP Use the set page display code. The default character set is ISO-8859-1.

 

 3. Page input code When setting the page display code, specify the page input method. If no page encoding is specified, the default encoding of the operating system itself is used.

 

 4. Form transmission parameter encoding When using form input data, the processing process is as follows: User input *(gbk:d6d0 cec4) browser *(gbk: d6d0 cec4) web server iso8859-1(00d6 00d ​​000ce 00c4) class, which needs to be in the class Processing: getbytes("iso8859-1") is d6 d0 ce c4, new String("gbk") is d6d0 cec4, and unicode encoding in memory is 4e2d 6587. 1. The encoding method entered by the user is related to the encoding specified by the page.

 

2. From browser to web server, you can specify the character set used when submitting content in the form, otherwise the encoding specified by the page will be used. However, if the parameter is directly input in the url method, the encoding is often the encoding of the operating system itself, because it has nothing to do with the page at this time.

 

3. What the Web server receives is a byte stream. By default (getParameter), it will be processed in iso8859-1 encoding. The result is incorrect, so it needs to be processed. But if the encoding is set in advance (through request.setCharacterEncoding()), the correct result can be obtained directly.

 

5. Database coding

 

1. MySQL's character set Mysql currently supports multiple character sets, and supports conversion between different character sets (easy to port and support multiple languages). Mysql can set server-level character sets, database-level character sets, data table-level character sets, and character sets of table columns. In fact, the place where the character set is finally used is the column that stores characters. For example, you set the col1 column in table1 to be a character set. Type, col1 uses the character set, if the col2 column of table1 table is of int type, col2 does not use the concept of character set. The server-level character set, database-level character set, and data table-level character set are all default options for the column character set. Mysql must have a character set, which can be specified by adding parameters at startup, or at compile time, or in the configuration file. The Mysql server character set is only used as the default value at the database level. When creating a database, you can specify the character set, if not specified, the server's character set is used. Similarly, when creating a table, you can specify the character set at the table level. If not specified, the character set of the database is used as the character set of the table. When creating a column, you can specify the character set of a column, if not specified, the character set of the table is used. Normally, you only need to set the server-level character set. Other database-level, table-level, and column-level character sets inherit from the server-level character set. Since UTF8 is the widest character set, in general, we set the Mysql server-level character set to UTF8!

 

 2. MySQL's storage mechanism MySQL requires the client (mysql command line, JDBC, PHP, CGI, etc.) to establish a connection with MySQL, and must specify what character set the data sent by the client uses, that is, character_set_client; the weirdness of MySQL The point is that the obtained character set is not immediately converted to the character set stored in the database, but is first converted to a character set specified by the character_set_connection variable; after conversion to the character set of character_set_connection, it is converted to the default of the database. The character set character_set_database is stored; when this data is output, it is converted to the character set specified by character_set_results. The functions of the above three variables are as follows: character_set_client: Set the character set used by the client to send the query character_set_connection: Set the character set the server needs to convert the received query string into character_set_results: Set the character set the server will convert the result data to 3. JAVA and database When connecting to the database using JDBC in the JAVA program, use the two properties of useUnicode=true and characterEncoding=utf-8 in the URL to set the encoding used by the Client. If you use MySQL 4.1 or later and MySQL JDBC Driver 3.0.16 or later, you don't need to use useUnicode=true& in the url of jdbc EncodingCharacter=GBK, the jdbc driver will automatically detect the encoding specified by the variable (character_set_server) of the mysql server when connecting, and then assign the value to character_set_client, character_set_connection. Use the following statement to view the encoding used by the SQL sent by the JDBC client to the server: public void select() throws SQLException { String url = "jdbc:mysql://localhost/database"; Connection conn = DriverManager.getConnection(url); ResultSet rs = conn.createStatement().executeQuery("SHOW VARIABLES LIKE 'character_set_%'"); while(rs.next()){ System.out.println(rs.getString(1)+","+rs. getString(2)); } rs.close(); }

 

6. Detailed process of JAVA encoding conversion Our common JAVA programs include the following categories: *Classes that run directly on the console (including visual interface classes) *JSP code classes (Note: JSP is a variant of the Servlets class) *Servlets class*EJB Class * Other support classes that cannot be run directly These class files may contain Chinese strings, and we often use the first three types of JAVA programs to directly interact with users for output and input characters, such as: we use JSP and Servlet get the characters sent by the client, these characters also include Chinese characters. Regardless of the role of these JAVA classes, the life cycle of these JAVA programs is as follows: * The programmer selects a suitable editing software on a certain operating system to realize the source code and save it in the operating system with the .java extension For example, we use notepad to edit a java source program in Chinese win2k; *Programmers use javac.exe in JDK to compile these source codes to form a .class class (JSP files are compiled by the container calling JDK); * Run these classes directly or deploy these classes to the WEB container to run, and output the results. So, how does the JDK and JVM encode and decode these files and run them during these processes? Here, we take the Chinese windows xp operating system as an example to illustrate how JAVA classes are encoded and decoded.

In the first step, we use editing software such as Notepad to write a Java source program file (including the above five types of JAVA programs) in Chinese win2k. The program file is saved in the default GBK encoding format supported by the operating system (the operating system defaults to The supported format is file.encoding format) to form a .java file, that is, before the java program is compiled, the JAVA source program file is saved in the file.encoding encoding format supported by the operating system by default; to view the system file .encoding parameter, you can use the following code: 

 public class ShowSystemDefaultEncoding {   

           public static void main(String[] args) {   

             String encoding = System.getProperty("file.encoding");   

            System.out.println(encoding);   

}}

In the second step, we use JDK's javac.exe file to compile our Java source program. Since JDK is an international version, when compiling, if we do not specify the encoding format of our JAVA source program with the -encoding parameter, then javac .exe first obtains the encoding format adopted by our operating system by default, that is, when compiling a java program, if we do not specify the encoding format of the source program file, JDK first obtains the file.encoding parameter of the operating system (which stores the default operating system). The encoding format, such as WIN2k, its value is GBK), and then JDK will convert our java source program from the file.encoding encoding format to the default UNICODE format inside JAVA and put it into memory. Then, javac compiles the converted unicode format file into a .class class file. At this time, the .class file is UNICODE encoded, and it is temporarily stored in the memory. Then, JDK uses the UNICODE encoded compiled class. The file is saved to our operating system to form the .class file we see. For us, the .class file we finally got is a class file whose content is saved in UNICODE encoding format. It contains Chinese strings in our source program, but it has been converted to UNICODE format by file.encoding format at this time. . In this step, the JSP source program file is different. For JSP, the process is as follows: that is, the WEB container calls the JSP compiler, and the JSP compiler first checks whether the file encoding format is set in the JSP file. If there is no file encoding format in the JSP file Set the encoding format of the JSP file, then the JSP compiler calls the JDK to first use the JVM default character encoding format for the JSP file (that is, the default file of the operating system where the WEB container is located. encoding) into a temporary Servlet class, and then compile it into a UNICODE format class class and save it in a temporary folder. For example: on Chinese win2k, the WEB container converts the JSP file from GBK encoding format to UNICODE format, and then compiles it into a temporarily saved Servlet class to respond to user requests. The third step is to run the classes compiled in the second step, which are divided into three cases: A. Classes that run directly on the console B, EJB classes and support classes that cannot be run directly (such as JavaBean classes) C. JSP code and In the case of Servlet class A, a class that runs directly on the console, JVM support is required to run this class, that is, the JRE must be installed in the operating system. The running process is as follows: first java starts the JVM, at this time the JVM reads the class file saved in the operating system and reads the content into the memory, at this time the memory is the class class in UNICODE format, and then the JVM runs it, if at this time If this class needs to receive user input, the class will encode the string input by the user with the file.encoding encoding format by default and convert it into unicode and save it in memory (the user can set the encoding format of the input stream). After the program runs, the generated string (unicode encoded) is returned to the JVM, and finally the JRE converts the string into a file. The encoding format (the user can set the encoding format of the output stream) is passed to the operating system display interface and output to the interface. The conversion of each step above requires the correct encoding format conversion, so that the phenomenon of garbled characters will not appear in the end. B. EJB classes and supporting classes that cannot be run directly (such as JavaBean classes) Because of EJB classes and supporting classes that cannot be run directly, they generally do not directly interact with the user for input and output, and they often interact with other classes for input and output. output, so after they are compiled in the second step, a class with UNICODE encoding content is formed and saved in the operating system. In the future, as long as the interaction between it and other classes is not lost during the parameter passing process, it will be will function correctly. C. After the JSP code and Servlet class go through the second step, the JSP file is also converted into a Servlets class file, but it does not exist in the classes directory unlike the standard Servlets, it exists in the temporary directory of the WEB container, so In this step we also treat it as Servlets. For servlets, when the client requests it, the WEB container calls its JVM to run the servlet. First, the JVM reads the servlet class from the system and loads it into the memory. The memory is the code of the servlet class encoded in UNICODE. Then the JVM runs the servlet class in memory. If the servlet is running, it needs to accept characters from the client, such as the value entered in the form and the value passed in the URL. At this time, if the program is not set to accept The encoding format used for the parameter, the WEB container will use the ISO-8859-1 encoding format by default to accept the incoming value and convert it into UNICODE format in the JVM and save it in the memory of the WEB container. After the servlet runs, the output is generated, and the output string is in UNICODE format. Then, the container directly sends the UNICODE format string (such as html syntax, user output string, etc.) generated by the servlet running to the client browser and outputs it to If the user specifies the encoding format for output at this time, it will be output to the browser according to the specified encoding format. If it is not specified, it will be sent to the client's browser according to ISO-8859-1 encoding by default. C. After the JSP code and Servlet class go through the second step, the JSP file is also converted into a Servlets class file, but it does not exist in the classes directory unlike the standard Servlets, it exists in the temporary directory of the WEB container, so In this step we also treat it as Servlets. For servlets, when the client requests it, the WEB container calls its JVM to run the servlet. First, the JVM reads the servlet class from the system and loads it into the memory. The memory is the code of the servlet class encoded in UNICODE. Then the JVM runs the servlet class in memory. If the servlet is running, it needs to accept characters from the client, such as the value entered in the form and the value passed in the URL. At this time, if the program is not set to accept The encoding format used for the parameter, the WEB container will use the ISO-8859-1 encoding format by default to accept the incoming value and convert it into UNICODE format in the JVM and save it in the memory of the WEB container. After the servlet runs, the output is generated, and the output string is in UNICODE format. Then, the container directly sends the UNICODE format string (such as html syntax, user output string, etc.) generated by the servlet running to the client browser and outputs it to If the user specifies the encoding format for output at this time, it will be output to the browser according to the specified encoding format. If it is not specified, it will be sent to the client's browser according to ISO-8859-1 encoding by default. C. After the JSP code and Servlet class go through the second step, the JSP file is also converted into a Servlets class file, but it does not exist in the classes directory unlike the standard Servlets, it exists in the temporary directory of the WEB container, so In this step we also treat it as Servlets. For servlets, when the client requests it, the WEB container calls its JVM to run the servlet. First, the JVM reads the servlet class from the system and loads it into the memory. The memory is the code of the servlet class encoded in UNICODE. Then the JVM runs the servlet class in memory. If the servlet is running, it needs to accept characters from the client, such as the value entered in the form and the value passed in the URL. At this time, if the program is not set to accept The encoding format used for the parameter, the WEB container will use the ISO-8859-1 encoding format by default to accept the incoming value and convert it into UNICODE format in the JVM and save it in the memory of the WEB container. After the servlet runs, the output is generated. The output string is in UNICODE format. Then, the container will generate the UNICO generated by the servlet running.

 

Seven, jsp compilation process (take tomcat as an example)

 

1. Tomcat first reads the code of the entire JSP page and writes it to a new JAVA file. When reading a JSP file, tomcat will first read the pageEncoding attribute of the JSP file, and then read the JSP file according to the encoding specified by pageEncoding. If pageEncoding is not specified, tomcat will use the character set encoding specified by contentType. If contentType is not specified, it will use the default ISO-8859-1 encoding.

 

2. After Tomcat reads the JSP file, it will use UTF-8 encoding to write these contents into a new file, and then compile.

 

3. When the JSP file is displayed, the MIME type and charset specified in contentType are used. If charset is not specified, the encoding specified in pageEncoding is used. If pageEncoding is not specified, the default ISO-8859-1 encoding is used.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326937147&siteId=291194637