Java- data type encoding (ASCII, Unicode, and UTF-8)

Mechanical hard disk hardware architecture (understand) https://diy.pconline.com.cn/cpu/study_cpu/1009/2215404_all.html

 

A data storage unit

1.bit (bit)

https://www.bilibili.com/video/av55918101

The computer data in the hard disk, the hard disk to mechanically, for example, made of magnetic material therein. There poles N \ S two, may represent two states. It can be seen as 0/1. This is the smallest storage unit of a computer, called a bit .

2.Byte (bytes)

A disk there are many such small magnet, can represent many 0/1. And 0/1 just to represent a binary number .

Just look at a binary number and of little value. 60s of last century, the United States developed a set of character encoding of English characters and binary relationship between, made uniform regulations. This is called ASCII code, still in use.

ASCII code encoding provides for a total of 128 characters. The spaces (the SPACE) is 32 (00100000), a capital letter A is 65 (01000001). The 128 symbols (including 32 control symbols can not be printed out), only takes a byte of the back 7, the foremost one uniform predetermined zero.

ASCII code . 1 Byte = 'bit. 8 , and later also a byte is equal to eight the default.

 

Second, coding

1. Character

Character is not a storage unit, but rather a sign language. Thereby elicit code table coding problem.

English with 128 symbols using  ASCII  code table is enough, but to represent other languages, 128 symbols is not enough.

Simplified Chinese common code table is GB2312 , two Byte represent a character, so in theory can represent up to 256 x 256 = 65536 symbols.

2.Unicode code table

So many countries of the world, there are a variety stopwatch, with a binary number can be interpreted as different symbols. Therefore, in order to open a text file, you must know the code table corresponding to it, otherwise the wrong interpretation of the code table, there will be garbled.

Imagine, if there is a code, all the symbols of the world are included. Each symbol is given a unique code, then the garbage problem will disappear. This is Unicode , a code table in which all symbols.

Unicode  now can scale to accommodate more than one million symbols. Encoding each symbol is different. As, U + 0041 for English capital letters A, U + 4E25 represents characters Yan (specific symbol correspondence table can query  Unicode.org , or special  characters correspondence table ).

A conventional code table corresponding to a single coding method. The  ASCII  in a 8 bit character, GB2312 in a 16 bit character. With the  Unicode after dying, the front of the character binary value of small, large character rearward binary value.

If all the longest in bits to encode a character, that character will forward a number of zero-padded appeared, causing a lot of wasted space. So after using the Unicode code table, how many computer you want to go coding is a problem.

3.UTF-8 encoding (the RFC 3629 defined)

First UTF-8 is a Unicode one of implementation. Other implementations also UTF-16 (( the UCS-2 ) character two bytes or four bytes), and UTF-32 (( the UCS-. 4 ) is represented by four bytes of characters)

UTF-8 biggest feature is that it is a variable length encoding. It uses 4 bytes 1 to a symbol, byte length varies depending on the symbol.

UTF-8 encoding rules are simple, only two:

  • For single byte symbols, a set of byte 0, the back 7 of the Unicode code symbol. Therefore, for the English alphabet, UTF-8 encoding and ASCII codes are the same.
  • For n (n> 1) signed byte, the first byte of the first n bits are set to 1, the n + 1 bit is set to 0, the first two bytes of the rear set 10 uniformly. The remaining bits not mentioned, all of the Unicode code symbol.

The following table summarizes the encoding rules, the letter x represents available encoding bits.

Unicode symbol range | UTF-8 encoding 
(hex) | (binary) 
-------------------- + ---------- ----------------------------------- 
0000 0000-0000 007F | 0xxxxxxx 
0000 0080-0000 07FF | 110xxxxx 10xxxxxx 
0800-0000 FFFF 0000 | 1110xxxx 10xxxxxx 10xxxxxx 
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Kanji Yan, for example

  • Yan Unicode is 4E25 (100111000100101)
  • According to the table, 4E25 in the range of the third row (0000 0800 - 0000 FFFF), and therefore strict UTF-8 encoding requires three bytes, i.e., the format 1110xxxx 10xxxxxx 10xxxxxx.
  • Strict last bit start sequence x fill format from back to front, the extra bit 0s. Thus obtained a strict UTF-8 encoding 1,110,010,010,111,000 10100101, it is converted to hexadecimal E4B8A5.

 

Three, Java basic data types

About boolean type storage

Java virtual machine will be mapped to a boolean int, use 1 to represent true, 0 represents false. 
That boolean type occupy 1 bit. https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-2.html#jvms-2.3.4

About floating-point range

https://blog.csdn.net/shichimiyasatone/article/details/85276316

About automatic entry boxes

https://www.cnblogs.com/dolphin0520/p/3780005.html

About type conversions

  • When the number of different precision in the calculation, Java operand will be converted into a low-accuracy precision operands, followed by calculation. The calculation result is a value with high accuracy.
  • When the conversion value to a high-precision low accuracy value, to determine whether the value of the variable type can be represented with high precision in the low precision.
  • The precision of low-precision automatic type conversions ((byte, short, char) - int - long - float - double)
  • Low accuracy of precision casts

 


https://docs.oracle.com/javase/specs/jls/se8/html/jls-4.html#jls-4.2.1

https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

http://www.ruanyifeng.com/blog/2014/12/unicode.html

Guess you like

Origin www.cnblogs.com/jhxxb/p/11154925.html