20200227 java characters, bytes, and coding

java characters, bytes, and coding

And the development of character encoding

From the point of view of computer support multi-language point of view, it can be divided into three stages:

The system code Explanation system
A stage ASCII The computer started only in English, other languages ​​can not be stored and displayed on the computer. English DOS
Stage Two ANSI code (localized) To enable the computer to support more languages, typically uses 2 bytes of 0x80 ~ 0xFF range to represent a character. For example: Character 'in' Chinese operating system, using [0xD6,0xD0] these two bytes. Different countries and regions to develop different standards, thereby creating GB2312, BIG5, JIS and other respective coding standards. These uses 2 bytes to represent a variety of kanji character encoding extending called ANSI code . In the simplified Chinese system, ANSI coded representation of GB2312 encoding, in the Japanese OS, ANSI coded representation of the JIS code. Incompatible between different ANSI encoding, when the exchange of information internationally, you can not belong to the text in both languages, stored in the same period of ANSI encoded text. Chinese DOS, Chinese Windows 95/98, Windows 95/98 Japanese
Stage Three UNICODE (international) In order for the international exchange of information more convenient and international organizations to develop a UNICODE character sets , set unified the various languages of each character and the number that's unique to meet the cross-language, cross-platform text conversion processing requirements . Windows NT/2000/XP,Linux,Java

Method string stored in memory:

  • In the ASCII stage, single-byte character string using a character stored byte (SBCS). For example, "Bob123" in memory of 7 bytes
  • Support multiple languages using the ANSI encoding stage, each character is one byte or more bytes to represent (MBCS), therefore, the characters stored in this manner is also referred to as multi-byte characters . For example, "Chinese 123" in Chinese Windows 95 in memory of 7 bytes, 2 bytes for each character, each alphanumeric characters (one byte)
  • After the UNICODE is employed, the computer storing the string, each character to be stored in a UNICODE character set number. Currently computers typically use 2 bytes (16 bits) to store a number (DBCS), therefore, the characters stored in this manner is also referred wide character . For example, the string "Chinese 123" in the Windows 2000, the actual memory is stored in No. 5, a total of 10 bytes.

Ascii coding and China

Ascii (American Standard Code for Information Interchange ), American Standard Code for Information Interchange, an official of the ASCII code table .

A byte Ascii to eight, can be combined total of 256 (8 th power of 2) different states species.

Wherein a start Americans 32 states are numbered from zero respectively predetermined special purpose, but a terminal, the printer good agreement meets these bytes are transferred over, it agreed to do some operation. Met 00x10, the terminal on the wrap, good met 0x1b embodiment, the printer prints the highlighted word, a color display or the terminal to use letters. These 0x20 byte following condition is called "control code."

Then all the spaces, punctuation, numbers, letters each case with consecutive bytes indicates the state has been allocated to No. 127, so that the computer can be used to store a different byte of the English text.

Later, like building the Tower of Babel, like, all over the world are beginning to use computers, but many countries are not using English letters, there are many of them are not in ASCII. The need to horizontal, vertical, cross-like shape in space after they decided to use 127 to represent these new letters, symbols, also added a lot of pictures form, the number has been allocated to the last state 255. From the character set 128-255 This page is called " extended character set ."

The creation of a two-byte coding "GB2312" and the same after the two-byte "GBK". After "GBK" extended "GB18030", these coding standards are known as "DBCS" (Double Byte Charecter Set double-byte character set).

The maximum range of standard features DBSC is the biggest feature is two bytes long and one byte long Chinese character English characters coexist in the same coding scheme set in, if less than 127 bytes then follow a Ascii code, if more than 127 and then the characters behind a kanji character.

And also in accordance with their national needs to get its own set of available codes. Very happy we can use, and then found not ah, computers need to communicate between the various countries, your programming resources garbled to me, I will not take the resources you have.

So, ISO (International Standards Organization who) decided to address this problem. They used very simple: scrap all regional coding scheme, including a re-engage all cultures on Earth, coding all the letters and symbols. They plan to call it "Universal Multiple-Octet Coded Character Set ", referred to as the UCS, commonly known as " UNICODE ."

For those ascii in the "half-size" character (that is routed to the front of 128 characters), UNICODE embracing its original encoding the same, but its length is extended from the original eight to 16, and the character of other cultures and languages ​​is all re-unified coding. Since the "half-size" need to use English symbols only the lower 8 bits, so its high eight is always 0, so this program when you save the atmosphere of the English text will be more than double the waste of space.

Character byte string

The key coding understanding of the concept and the concept byte characters should be understood accurately. These two concepts easily confused, we are here to do something to distinguish between:

Concept description For example
character People use the mark, a symbol on an abstract sense. '1', 'in', 'a', '$', '¥', ......
byte Storing data in a computer unit, an 8-bit binary number, is a very specific storage space. 0x01, 0x45, 0xFA, ……
ANSI string In memory, if "character" is ANSI encoded in the form of a character may use a byte or more bytes to represent, then we call this string to an ANSI string or multi-byte string . "Chinese 123" (representing 7 bytes)
UNICODE string In memory, if "character" is in UNICODE sequence number exists, then we call this string is UNICODE strings or wide-byte string . L "Chinese 123" (10bytes)

Due to the different requirements of ANSI coding standards are not the same, therefore, for a given multi-byte string , we have to know what kind it is used in encoding rules, to be able to know what it contains "character." For UNICODE strings , the "character" No matter what the circumstances, it represents the content is always the same.

Character set and encoding

ANSI encoding standard developed by different countries and regions, only the provisions of the respective language needed "character." For example: the standard Chinese characters (GB2312) does not specify how the Korean character is stored. Content specified coding standard ANSI comprising two meanings:

  1. What characters to use. That is what Chinese characters, letters and symbols are the income standard. Included "character" of the collection is called " character set ."
  2. It requires that each "character" are one byte or more bytes of memory, which bytes to store with this provision is called " encoding ."

Various countries and regions when developing coding standards, "a collection of characters" and "encoding" are generally developed simultaneously. Therefore, usually we call "character set", such as: GB2312, GBK, JIS, etc., in addition to "a collection of characters," meaning that the outer layer, also contains a "coding" means.

" UNICODE character sets " contains a variety of languages to use all the "characters." UNICODE character set used for encoding standard there are many, such as: UTF-8, UTF-7 , UTF-16, UnicodeLittle, UnicodeBig the like .

Common coding Introduction

A brief introduction to common coding rules, to be a preparation for the chapter behind. Here, we have the characteristics of encoding rules, all the codes were divided into three categories:

classification Coding Standards Explanation
Single-byte character encoding ISO-8859-1 The simplest coding rules, each byte UNICODE directly as a character. For example, [0xD6, 0xD0] these two bytes, iso-8859-1 through into a string, obtained directly [0x00D6, 0x00D0] two UNICODE characters, i.e. "ÖÐ".
On the contrary, by the UNICODE string is converted to iso-8859-1 byte string, 0 to 255 characters can only be properly converted.
ANSI code GB2312, BIG5, Shift_JIS, ISO-8859-2 …… When the UNICODE string into "byte string" by ANSI code, in accordance with their respective coding, a UNICODE characters may be transformed into a byte or more.
Conversely, when the string is converted into a string of bytes, a plurality of bytes may also be converted into a character. For example, [0xD6, 0xD0] these two bytes, into a string through GB2312, the resulting [0x4E2D] characters, i.e. 'the' word.
"ANSI code" features:
1. These "ANSI coding standards" can only handle UNICODE characters in their respective languages range.
The relationship between 2. "UNICODE characters" and "out of the byte conversion 'is arbitrarily predetermined.
UNICODE coding UTF-8, UTF-16, UnicodeBig …… And when "the ANSI code" Similarly, the string is converted into "byte string" by UNICODE encoding, a UNICODE character could be transformed into a byte or more.
And "ANSI code" is different:
1. These "UNICODE coding" can handle all the UNICODE characters.
2. between "the UNICODE character" and "out of the byte conversion" is obtained by calculation.

We actually do not need to get to the bottom of each coding specific to a certain character encoding which became a few bytes, we need to know the concept of "code" is the "character" into "byte" on it. For "UNICODE coding", because they can be obtained by calculation, therefore, on special occasions, we can get to know a certain kind of "UNICODE coding" is how rules.

Unicode UTF 及

UNICODE 是用两个字节来表示为一个字符,他总共可以组合出65535不同的字符,这大概已经可以覆盖世界上所有文化的符号。UNICODE 如何在网络上传输也是一个必须考虑的问题,于是面向传输的众多 UTF(UCS Transfer Format)标准出现了,顾名思义,UTF8就是每次8个位传输数据,而UTF16就是每次16个位,只不过为了传输时的可靠性,从UNICODE到UTF时并不是直接的对应,而是要过一些算法和规则来转换。而且网络传输字符编码也涉及到大端小端的问题。

Java中的字符与字节

类型或操作 Java
字符 char
字节 byte
ANSI 字符串 byte[]
UNICODE 字符串 String
字节串→字符串 string = new String(bytes, “encoding”)
字符串→字节串 bytes = string.getBytes(“encoding”)

java中的编码

前面说了,java使用到编码是UNICODE。怎么具体“体会到”这种编码呢?我们可以用java中的转义符 \

一、我们直接使用“”来转化数字为字符的话,后面的数字应为八进制。

而且只能转化一个字节大小,即255个字符,如下:

  • 八进制转义序列: + 八进制数;范围’\000’~’\377’(对应十进制0~255)
  • \0:空字符

有人问了Unicode不是两个字节吗,为什么这里一个字节就可以,其实java在这里会把它转化为两个字节按Unicode转换。记住是Unicode,不要因为一个字节就以为是ASCII编码,如下代码:

System.out.println('\367'); //这里输出的是 ÷
//八进制367转化为10进制为247
System.out.println((int) '÷');//输出十进制:247
//序号247在Ascii和Unicode对应的字符如下:

二、 Unicode转义字符:\u + 4个十六进制数字;对应十进制范围是0~65535

  • \u0000:空字符
  • \u0000-\uFFFF:我们电脑出现的每个字符都包含在这其中

三、 特殊字符:就3个

因为在Java中 双引号"、引号'、反斜杆\ 都有特定的含义,双引号要包住字符串,引号要包住字符,反斜杠是转义符,所以我们通过在加一个转义符\让他们真正代表他们自己。

  • \":双引号
  • \':单引号
  • \\:反斜线

四、控制字符:5个

转义符\加固定字符在java中有五个,表示一定的控制操作

  • \r:回车,return 到当前行的最左边。
  • \n:换行,向下移动一行,并不移动左右。
  • \f:走纸换页
  • \t:横向跳格
  • \b:退格

Linux中\n表示回车+换行;

Windows中\r\n表示回车+换行。

中文的半角和全角

在DBCS系列的编码里面,把连在 ASCII 里本来就有的数字、标点、字母都统统重新编了两个字节长的编码,这就是常说的”全角”字符,而原来在127号以下的那些就叫”半角”字符了。所以如果这些例如字母使用一个字节编码就是“半角”,两个字节就是“全角”。其实样子看起来还是有区别的:

几种误解,以及乱码产生的原因和解决办法

容易产生的误解

对编码的误解
误解一 在将“字节串”转化成“UNICODE 字符串”时,比如在读取文本文件时,或者通过网络传输文本时,容易将“字节串”简单地作为单字节字符串,采用每“一个字节”就是“一个字符”的方法进行转化。而实际上,在非英文的环境中,应该将“字节串”作为 ANSI 字符串,采用适当的编码来得到 UNICODE 字符串,有可能“多个字节”才能得到“一个字符”。 通常,一直在英文环境下做开发的程序员们,容易有这种误解。
误解二 在 DOS,Windows 98 等非 UNICODE 环境下,字符串都是以 ANSI 编码的字节形式存在的。这种以字节形式存在的字符串,必须知道是哪种编码才能被正确地使用。这使我们形成了一个惯性思维:“字符串的编码”。 当 UNICODE 被支持后,Java 中的 String 是以字符的“序号”来存储的,不是以“某种编码的字节”来存储的,因此已经不存在“字符串的编码”这个概念了。只有在“字符串”与“字节串”转化时,或者,将一个“字节串”当成一个 ANSI 字符串时,才有编码的概念。 不少的人都有这个误解。

在这里,我们可以看到,其中所讲的“误解一”,即采用每“一个字节”就是“一个字符”的转化方法,实际上也就等同于采用 iso-8859-1 进行转化。因此,我们常常使用 bytes = string.getBytes(“iso-8859-1”) 来进行逆向操作,得到原始的“字节串”。然后再使用正确的 ANSI 编码,比如 string = new String(bytes, “GB2312”),来得到正确的“UNICODE 字符串”。

非 UNICODE 程序在不同语言环境间移植时的乱码

非 UNICODE 程序中的字符串,都是以某种 ANSI 编码形式存在的。如果程序运行时的语言环境与开发时的语言环境不同,将会导致 ANSI 字符串的显示失败。

比如,在日文环境下开发的非 UNICODE 的日文程序界面,拿到中文环境下运行时,界面上将显示乱码。如果这个日文程序界面改为采用 UNICODE 来记录字符串,那么当在中文环境下运行时,界面上将可以显示正常的日文。

由于客观原因,有时候我们必须在中文操作系统下运行非 UNICODE 的日文软件,这时我们可以采用一些工具,比如,南极星,AppLocale 等,暂时的模拟不同的语言环境。

网页提交字符串

当页面中的表单提交字符串时,首先把字符串按照当前页面的编码,转化成字节串。然后再将每个字节转化成 “%XX” 的格式提交到 Web 服务器。比如,一个编码为 GB2312 的页面,提交 “中” 这个字符串时,提交给服务器的内容为 “%D6%D0”。

在服务器端,Web 服务器把收到的 “%D6%D0” 转化成 [0xD6, 0xD0] 两个字节,然后再根据 GB2312 编码规则得到 “中” 字。

在 Tomcat 服务器中,request.getParameter() 得到乱码时,常常是因为前面提到的“误解一”造成的。默认情况下,当提交 “%D6%D0” 给 Tomcat 服务器时,request.getParameter() 将返回 [0x00D6, 0x00D0] 两个 UNICODE 字符,而不是返回一个 “中” 字符。因此,我们需要使用 bytes = string.getBytes(“iso-8859-1”) 得到原始的字节串,再用 string = new String(bytes, “GB2312”) 重新得到正确的字符串 “中”。

从数据库读取字符串

通过数据库客户端(比如 ODBC 或 JDBC)从数据库服务器中读取字符串时,客户端需要从服务器获知所使用的 ANSI 编码。当数据库服务器发送字节流给客户端时,客户端负责将字节流按照正确的编码转化成 UNICODE 字符串。

如果从数据库读取字符串时得到乱码,而数据库中存放的数据又是正确的,那么往往还是因为前面提到的“误解一”造成的。解决的办法还是通过 string = new String(string.getBytes(“iso-8859-1”), “GB2312”) 的方法,重新得到原始的字节串,再重新使用正确的编码转化成字符串。

几种错误理解的纠正

误解:“ISO-8859-1 是国际编码?”

非也。iso-8859-1 只是单字节字符集中最简单的一种,也就是“字节编号”与“UNICODE 字符编号”一致的那种编码规则。当我们要把一个“字节串”转化成“字符串”,而又不知道它是哪一种 ANSI 编码时,先暂时地把“每一个字节”作为“一个字符”进行转化,不会造成信息丢失。然后再使用 bytes = string.getBytes(“iso-8859-1”) 的方法可恢复到原始的字节串。

误解:“Java 中,怎样知道某个字符串的内码?”

Java 中,字符串类 java.lang.String 处理的是 UNICODE 字符串,不是 ANSI 字符串。我们只需要把字符串作为“抽象的符号的串”来看待。因此不存在字符串的内码的问题。

例子

// -------------------- Charset 和 StandardCharsets --------------------------- //
        System.out.println(Charset.defaultCharset()); // UTF-8
        System.out.println(Charset.availableCharsets());
        System.out.println(Charset.isSupported("utf8")); // true
        System.out.println(StandardCharsets.UTF_8); // UTF-8

        System.out.println("中文123".getBytes(StandardCharsets.UTF_8).length); // 9
        System.out.println("中文123".getBytes(Charset.forName("gbk")).length); // 7

        // ---------------------- ÷ ------------------------- //
        int x = 0367; // 八进制表示法
        char c = '\367'; // 八进制转义序列
        System.out.println((char) x); // ÷
        System.out.println(c); // ÷

        //八进制367转化为10进制为247
        System.out.println('÷');// ÷
        System.out.println((int) '÷');// 247
        System.out.println((char) 247); // ÷

        // ---------------------- 0 ------------------------- //
        System.out.println('\0'); //
        System.out.println((int) '\0'); // 0
        System.out.println((char) 0); //


        // ----------------------------------------------- //
        //八进制367对应十六进制00F7,所以下面两个输出都是 ÷
        System.out.println('\u00f7'); // ÷

        System.out.println(Integer.toBinaryString('÷')); // 二进制 11110111
        System.out.println(Integer.toOctalString('÷')); // 八进制 367
        System.out.println(Integer.toHexString('÷')); // 十六进制 f7
        System.out.println(Integer.toString('÷')); // 十进制 247
        System.out.println(Integer.toString('÷', 4)); // n进制 3313

        // -------------------- 特殊字符 --------------------------- //
        System.out.println('\"'); // "
        System.out.println('\''); // '
        System.out.println('\\'); // \

        // -------------------- 控制字符 --------------------------- //
        System.out.println("aa\rbb"); // bb
        System.out.println("cc\r\ndd"); // cc换行dd
        System.out.println("12\r34\n56\f78\t90\b12"); // 34换行5678  912
        
        // ----------------------------------------------- //

        byte[] bytes = "中".getBytes(Charset.forName("gbk"));
        for (byte aByte : bytes) {
            System.out.println(Integer.toHexString(aByte));
        }
        // ffffffd6
        // ffffffd0

        byte bytes1 = 0xffffffd6;
        byte bytes2 = 0xffffffd0;
        byte[] bytes3 = new byte[]{bytes1, bytes2};
        System.out.println(new String(bytes3, Charset.forName("gbk"))); // 中
    }

参考

Guess you like

Origin www.cnblogs.com/huangwenjie/p/12372482.html