Bit byte ASCII code bit byte Unicode UTF 32 UTF 8 could not tell silly
"1 byte is 8 bits, a letter is 1 byte, 2 bytes characters"
- Above this sentence heard a lot of people have seen
- But most people do not know why (for example, two days ago I)
- This sentence in some of the provisions is not correct
bit
- Chinese: bit
- 0 or 1
- Computer in the smallest unit
byte
- Chinese: Byte
- 1byte == 8bit represents 8 bits (2 Why not answer back or 4)
Credits inserted bit byte beating a name taken nice
ASCII code
- Let the United States to develop a set of computer identification, preservation, read standard characters
- Characters are defined: the number '7' letter 'a' character operations (delete determination)
- A total of 128 characters each character has a unique coding View ascii.pdf
- The computer unit is the smallest bit so we use the bits to store
- So how many bits are needed to keep it a character?
- 2 save up to four data bits 00011011
- 4 save up to 16 bits of data 0,000,000,100,100,011 ...... 1111
- Up to 7-bit data 128 stored 00000000000001 00000100000011 1111111 ......
- Save up to 255 8-bit data 11111111 0,000,000,000,000,001 0,000,001,000,000,011 ......
- In fact, you can save 7bit 128 characters but eventually use the 8-bit because it was used as an eighth parity bit (parity bit now useless)
- The concept was first byte is used to represent a "word" is the last char popular convention 8bit representation 1byte
- Exception: GSM 7bit encoding defaults
GBK GB18030 encoding GB2312 coding coding coding XXXX
- If the whole world is good enough English that ASCII code
- But China Japan Korea also want to display text using a computer of their own country
- Therefore, China has developed GB2312 developed Shift_JIS Japanese Korean developed EUC_KR
- Each country has developed its own coding usually with 2 to 4 bytes to represent a character
Unicode character set
- Coding of each country is not the same as it is prone to cause garbled
- So how countries use a table to determine the number of characters (scientific name code bits / Code Point / Code Point)
- Unicode on the breeding born (Unicode, Unicode, single)
- No. 1,114,112 Unicode prepared to store all the text world
- 1114112 10001 00000000 00000000 20bit binary system
- 2020 March 10 released the latest Unicode®13.0.0.pdf
UTF-32 (or UCS-4) encoding
- Provisions of each character to 4byte (32bit) to store read transfer
- advantage
- Quick access to the number of characters
- Quick access to the specified character data
- Shortcoming
- Originally 1byte (4bit) will be able to save the data needed 4byte (32bit) to hold a waste of capacity
- Slow memory read transfer
UTF-8 encoding
-
1 to 4 used to determine a character byte
-
UTF-8 encoded Unicode character set is divided into four sections
Decimal | Binary |
---|---|
0-177 | (00000000 00000000 00000000 00000000)-(00000000 00000000 00000000 01111111) |
178-2047 | (00000000 00000000 00000000 11111111)-(00000000 00000000 01111111 11111111) |
2048-65535 | (00000000 00000000 11111111 11111111)-(00000000 01111111 11111111 11111111) |
65536-1114111 | (00000000 11111111 11111111 11111111)-(01111111 11111111 11111111 11111111) |
- Four sections correspond to the four formulas
official |
---|
0xxxxxx |
110xxxxx 10xxxxxx |
1110xxxx 10xxxxxx 10xxxxxx |
11110xxx 10xxxxxx 10xxxxxx |
- E.g
character | Decimal | Binary | It belongs to the interval |
---|---|---|---|
from | 122 | 01111010 | 1 |
Zou | 37049 | 10010000 10111001 | 3 |
- Character 'z' in a first range applies to the first formula
- Use 'z' binary 01111010 filled from right to left 0xxxxxx
- Finally obtained 'z' of UTF-8 code 01111010
- Character 'Zou' belongs to the third zone to apply the third formula
- Use the 'Zou' binary 1,001,000,010,111,001 from right to left to fill 1110xxxx 10xxxxxx 10xxxxxx
- Finally obtained 'Zou' UTF-8 code is obtained 1,110,100,110,000,010 10111001
- advantage
- Compatible with ASCII code
- Save read transfer range below 178 characters only need one byte (4bit)
- According to have a length of 1 to determine the number of characters before a binary characters
- Shortcoming
- Can not directly obtain the data it contains many characters
- You can not directly locate a binary character position
to sum up
- 1 is a two byte error characters in UTF-8 encoding rules
- These are personal opinions
- Please correct me if errors and omissions
Github:https://github.com/QiangZou