C, JAVA program coding problem


Respect originality, original address:
http://blog.csdn.net/lpali/article/details/5405203



String s = "fs123fdsa";//String variable

byte b[] = s.getBytes();//String converted to byte[]

String t = new String(b);//bytep[] converted to String

for JAVA often encounters Chinese garbled problems, as well as various encoding problems, especially the problem that the content of the String class needs to be re-encoded. To solve these problems, you must understand how JAVA handles strings.

1. "Characters" are represented by numbers. Let's

first re-understand how computers handle "characters". This principle must be remembered by everyone, especially when writing programs in JAVA, it must not be ambiguous. We know that computers use numbers to represent everything, and "characters" are no exception. For example, if we want to display an Arabic number "3", in our PC, it is not just a number 3 to represent the "3" we want to write, but is represented by hexadecimal 0x33, including The memory or writing to the file is actually written with 0x33. If you don't believe me, you can edit a text file, write a "3", and then use ultraEdit to see his source code.


2. All "characters" must be represented by numbers + coding table.

At this time, there is a question: why must 0x33 be used to represent "3"? Instead of using 0x43 to represent it? Or use 0x03 directly instead? In fact, anything can be used to represent it, but everyone is accustomed to using the ASCII encoding table (which is the National Information Exchange Table of the United States) to determine what number each character should be represented by. Similarly, in order to represent Chinese characters, my country has also specified a Chinese code table, of which GB2312 is the most widely used. For example, the Chinese character "dang" is represented by two eight-digit numbers, 0xB5 and 0xB1. So if the program that displays the characters doesn't know what encoding table a column of numbers is encoded in, he can't tell what characters these are. If you randomly use an incorrect encoding table to process these numbers, the processed characters are likely to be completely wrong. For example, on an English system, there is no GB2312 encoding table, so give him a 0xB5, 0xB1, and he will treat it as ASCII (the operating system usually has its own default encoding table), and the result shows that there are two strange symbol, because these two words are the two symbols in the ASCII table. Also in the traditional Chinese system, his code table is BIG5, and the display is also a strange Chinese, not the word "dang".



3. UNICODE allows the whole world to speak one language

. After reading the above text, do you feel that there are so many languages ​​in the world, each with its own set of coding tables, which is very troublesome? Even in Chinese, there are two popular encoding tables, one is GB2312 and the other is BIG5. When you want to use characters of different Chinese encodings, you have to go back and forth, which is really troublesome. Not only this, if you want to write an article that contains a lot of Chinese characters, it will be troublesome. You must let the program that processes the article know which character is what encoding standard. If you want to find a word in an article, you must also specify which word in which encoding you are looking for. Otherwise, you need to find a Chinese "dang" character of 0xB5, 0xB1, and it is very possible to find out the unrelated words such as Japanese and Polish that are represented by the same numbers for you, which is troublesome enough!

所以人们想,不如大家都用同一个编码标准吧,各种文字都在编码表里有一席之地,处理文字的程序只需要都按这个编码表来处理就可以了。不过要一个编码表里包含所有的文字,这张表就大了,本来英文字+数字一共只有128个以内。但加上中文后,忽然就多了数万个,所以存放一个字符需要的大小也大了很多。现在UNICODE规定了一个字符必须由2个8位数字来表示,想想,8x8x8x8x = 65536 ,是多大的一个数字啊!所以全世界的文字才能都包含进去。当然拉,也有人说中国字可能都不止6万个拉,还要包括别的文字,但人家外国人觉得你们中国人常用的也没那么多,所以就这么定了,我们也没办法。需要注意的是GB2312和UNICODE虽然都是用两个8位数来代表一个中文字,但具体的规格可不一样,比如0xB5,0xB1在UNICODE里面可不是“当”字,而是另外一国的文字来的。



4. C是如何简洁的处理字符的

我们来谈谈C的字符串。C语言诞生在JAVA之前,C语言的基本数据类型是没有字符串这个类型的,它只有char[]。也就是C把字符顺序放入一个字节数组就完了。而且C也不管放在数组里的是什么文字,也不管那些字是按什么编码标准的。而且他的char的大小也不一定是8位数字,有时候是16位也可能,这要看具体的机器和操作系统。所以写程序的人必须要知道正在处理的char[]的内容到底是按什么编码表表示的字符串,要知道如果比较两国文字是否相同,可是没任何意义的哦!





5. JAVA是是如何处理字符的。

世界总会进步的,JAVA就是一个例子。JAVA终于有了String类了,它是解决字符问题的最好工具。在JAVA里,一个基本的要点是:String类对象是不需要指定编码表的!为什么它会自己知道一堆数字各代表什么字符呢?就是因为String里的字符信息是用UNICODE编码存放的。而JAVA为了表示字符(注意是单个字符),也有char这个数据类型,而且他的大小是固定2个8位16进制数字长度,也就是0~65535罗。为的就是对应UNICODE里面的一个字符。大家如果想取一个String里的按UNICODE数字,可以用getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin) 方法取得一个char[],这个char[]里就是表示String字符的,按UNICODE编码表编码的数字。

可惜现在绝大多数的系统和程序都不是按UNICODE来处理字符,而JAVA程序总是要和别的程序和系统交换数据的,所以在接收一个字符,或者是发送一个字符的时候,就必须要留意当前系统和UNICODE的关系了。比如你从网络或者文件接受到一数字:0xB5,0xB1,JAVA程序并不知道这两个字到底是中文呢?还是日文,或者英文。你如果不指明这个两个数字的编码表,JAVA就会按当前系统默认的编码表来处理。如果这两个数字是从中文WIN98发出去的,JAVA程序又是在英文LINUX上运行的,那就出现了所谓的乱码问题了。也就是JAVA按英文的编码表ASCII来处理这两个数字,当通过new String({0xB5,0xB1})得到的String的时候,这个String代表的已经不是中文的“当”字,而是两个英文的奇怪字符了。不过如果你知道这两个数字一定是中文的话,就可以指定用new String({0xB5,0xB1},"GB2312")来处理,这时候新建立的String才真的是一个“当”字。当然拉,如果你要把一个“当”字的JAVA的String显示在中文WIN98上,必须把这个字输出成两个8位数字:0xB5,0xB1,不管是写成文件还是输出到浏览器上,都必须是0xB5,0xB1。如何把“当”字用GB2312输出?String.getBytes("GB2312")就可以拉!所以有一点要记住:和外界交换任何信息都是以byte[]来进行的!。你可以留意一下JAVA大多数的I/O类,都有以byte[]作为参数和返回值的方法。不过,也有很多写的比较糊涂的程序,没有提供byte[]交换信息的方法,害的不同文字平台的程序员很头疼。Servlet的HttpRequest.getParameter()就是这样。好在有的JSP/SERVLET容易还提供先指定编码表的方法,才能比较简单的解决这个问题。





6. 网上关于JAVA中文问题的一些错误处理方法。

一个是最常见的,不管什么内容,都用new String(...,"ISO-8859-1")来建立字符串,然后使用的时候按默认的编码格式(通常在服务器上都是英文系统)输出字符串。这样其实你使用的String并不是按UNICODE来代表真正的字符,而是强行把BYTE数组复制到String的char[]里,一旦你的运行环境改变,你就被迫要修改一大堆的代码。而且也无法在同一个字符串里处理几种不同编码的文字。

另一个是把一种编码格式的字符串,比如是GB2312,转换成另一种格式的字符串,比如UTF-8,然后不指明是UTF-8编码,而直接用new String(...)来建立String,这样放在String里面的字符也是无法确定的,它在不同的系统上代表不同的字符。如果要求别人用“UTF-8格式”的String来交换信息的时候,其实已经破坏了JAVA为了兼容各种语言所做的规定。这种错误的本质思想是还按写C语言的方式,把字符串纯粹当作可以自己自由编码的存储器使用,而忽略了JAVA字符串只有一种编码格式。如果真的想自由编码,用byte[]或者char[]就完全了解决问题的了。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326401657&siteId=291194637