Conversion of String and byte[] in Java

String s = "fs123fdsa";//String variable

byte b[] = s.getBytes();//String converted to byte[]

String t = new String(b);//bytep[] is converted to String

  Doing JAVA often encounters the problem of Chinese garbled characters, as well as various encoding problems, especially the problem that the content of the String class needs to be re-encoded. To solve these problems, you must understand how JAVA handles strings. 

1. "Characters" are represented by numbers

  Let's first re-understand how the computer handles "characters". This principle must be remembered by everyone, especially when writing programs in JAVA, it must not be vague. We know that computers use numbers to represent everything, and "characters" are no exception. For example, if we want to display an Arabic number "3", in our PC, it is not just a number 3 to represent the "3" we want to write, but is represented by hexadecimal 0x33, including The memory or writing to the file is actually written with 0x33. If you don't believe me, you can edit a text file, write a "3", and then use ultraEdit to see his source code. 

 

2. All "characters" must be represented by numbers + coding table

  At this time, there is a question: why must 0x33 be used to represent "3"? Instead of using 0x43 to represent it? Or use 0x03 directly instead? In fact, anything can be used to represent it, but everyone is accustomed to using the ASCII encoding table (which is the National Information Exchange Table of the United States) to determine what number each character should be represented by. Similarly, in order to represent Chinese characters, my country has also specified a Chinese code table, of which GB2312 is the most widely used. For example, the Chinese character "dang" is represented by two eight-digit numbers, 0xB5 and 0xB1. So if the program that displays the characters doesn't know what encoding table a column of numbers is encoded in, he can't tell what characters these are. If you randomly use an incorrect encoding table to process these numbers, the processed characters are likely to be completely wrong. For example, on an English system, there is no GB2312 encoding table, so give him a 0xB5, 0xB1, and he will treat it as ASCII (the operating system usually has its own default encoding table), and the result shows that there are two strange symbol, because these two words are the two symbols in the ASCII table. Also in the traditional Chinese system, his code table is BIG5, and the display is also a strange Chinese, not the word "dang". 

 

3.UNICODE lets the world speak one language 

  After reading the above text, do you feel that there are so many languages ​​in the world, and each has its own set of coding tables, which is very troublesome? Even in Chinese, there are two popular encoding tables, one is GB2312 and the other is BIG5. When you want to use characters of different Chinese encodings, you have to turn around, which is really troublesome. Not only this, if you want to write an article that contains a lot of Chinese characters, it will be troublesome. You must let the program that processes the article know which character is what encoding standard. If you want to find a word in an article, you must also specify which word in which encoding you are looking for. Otherwise, you need to find a Chinese word "dang" with 0xB5, 0xB1, and it is very possible to find out the unrelated words such as Japanese and Polish that are represented by the same numbers for you, which is troublesome enough! 

  So people think, it is better for everyone to use the same coding standard. All kinds of characters have their place in the coding table, and the programs that process the text only need to process according to this coding table. However, if you want a coding table to contain all the characters, the table will be large. Originally, there are only 128 English characters + numbers in total. But after adding Chinese, there are suddenly tens of thousands more, so the size required to store one character is also much larger. Now UNICODE stipulates that a character must be represented by two 8-digit numbers, think about it, 8x8x8x8x = 65536, how big a number is! So the world's words can be included. Of course, there are some people who say that there may be more than 60,000 Chinese characters, and other characters are also included, but foreigners think that you Chinese don't use that much, so it's settled, and we can't do anything about it. It should be noted that although GB2312 and UNICODE both use two 8-digit numbers to represent a Chinese character, the specific specifications are different. For example, 0xB5 and 0xB1 are not the word "dang" in UNICODE, but the words of another country. come.

 

4. How does C handle characters concisely 

  Let's talk about C strings. C language was born before JAVA. The basic data type of C language does not have the type of string, it only has char[]. That is, C puts the sequence of characters into a byte array and it's done. And C doesn't care what literals are placed in the array, or what encoding standards those words are in. And the size of his char is not necessarily 8 digits, sometimes 16 digits, it depends on the specific machine and operating system. Therefore, the person writing the program must know what encoding table the content of the char[] being processed is the string represented by the encoding table. It is meaningless to know whether the characters of the two countries are the same or not! 

 

5. How does JAVA process characters

  The world will always improve, JAVA is an example. JAVA finally has the String class, which is the best tool for solving character problems. In JAVA, a basic point is: String class objects do not need to specify the encoding table! Why does it know by itself what characters are represented by a bunch of numbers? It is because the character information in String is stored in UNICODE encoding. In order to represent characters (note that it is a single character), JAVA also has the data type of char, and its size is a fixed length of 2 8-digit hexadecimal digits, which is 0~65535 Luo. The purpose is to correspond to a character in UNICODE. If you want to get a UNICODE number in a String, you can use the getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin) method to get a char[], this char[] represents String characters, press UNICODE The number encoded by the encoding table. 

  It is a pity that most systems and programs do not process characters according to UNICODE, and JAVA programs always exchange data with other programs and systems, so when receiving a character or sending a character, it must be Pay attention to the relationship between the current system and UNICODE. For example, you receive a number from the network or a file: 0xB5, 0xB1, the JAVA program does not know that these two words are Chinese? Either Japanese or English. If you do not specify the encoding table of the two numbers, JAVA will process it according to the default encoding table of the current system. If these two numbers are sent from Chinese WIN98, and the JAVA program is run on English LINUX, there will be a so-called garbled problem. That is, JAVA processes these two numbers according to the English encoding table ASCII. When the String obtained by new String ({0xB5, 0xB1}), this String represents not the Chinese word "dang", but two Strange characters in English too. However, if you know that these two numbers must be in Chinese, you can specify to use new String({0xB5,0xB1},"GB2312") to process, then the newly created String is really a "dang" word. Of course, if you want to display a JAVA String with the word "Dang" on Chinese WIN98, you must output the word as two 8-digit numbers: 0xB5, 0xB1, whether it is written in a file or output to the browser, both Must be 0xB5, 0xB1. How to output the word "dang" with GB2312? String.getBytes("GB2312") can be pulled! So one thing to remember: any information exchanged with the outside world is done in byte[]! . You can notice that most of the I/O classes in JAVA have methods that take byte[] as parameters and return values. However, there are also a lot of confusing programs written, which do not provide a method for byte[] to exchange information, which causes headaches for programmers of different text platforms. Servlet's HttpRequest.getParameter() is just that. Fortunately, some JSP/SERVLETs also provide a method of specifying the encoding table first, so that this problem can be solved relatively simply. 

 

6. Some error handling methods for JAVA Chinese problems on the Internet 

  One is the most common, no matter what the content is, use new String(...,"ISO-8859-1") to create a string, and then use the default encoding format (usually the English system on the server). ) output string. In fact, the String you use does not represent real characters according to UNICODE, but forcibly copies the BYTE array into the char[] of String. Once your operating environment changes, you are forced to modify a lot of code. And it's not possible to handle several different encodings of text in the same string. 

  The other is to convert a string in one encoding format, such as GB2312, into a string in another format, such as UTF-8, and then directly use new String(...) To create a String, the characters placed in the String cannot be determined, and it represents different characters on different systems. If you ask others to use "UTF-8 format" String to exchange information, it has actually broken the provisions made by JAVA in order to be compatible with various languages. The essential idea of ​​this error is to use the string purely as a memory that can be freely encoded by itself in the way of writing the C language, while ignoring that there is only one encoding format for JAVA strings. If you really want to encode freely, use byte[] or char[] to completely solve the problem. 

 

The above, in addition to the basic knowledge for solving JAVA Chinese problems, are also basic computer knowledge that should be mastered many years ago.

 

Article source: https://www.cnblogs.com/fuzhaoyang56/archive/2013/05/24/3096471.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326254196&siteId=291194637