Bit byte ASCII code bit byte Unicode UTF 32 UTF 8 could not tell silly

Bit byte ASCII code bit byte Unicode UTF 32 UTF 8 could not tell silly


"1 byte is 8 bits, a letter is 1 byte, 2 bytes characters"

  • Above this sentence heard a lot of people have seen
  • But most people do not know why (for example, two days ago I)
  • This sentence in some of the provisions is not correct

bit

  • Chinese: bit
  • 0 or 1
  • Computer in the smallest unit

byte

  • Chinese: Byte
  • 1byte == 8bit represents 8 bits (2 Why not answer back or 4)

Credits inserted bit byte beating a name taken nice

ASCII code

  • Let the United States to develop a set of computer identification, preservation, read standard characters
  • Characters are defined: the number '7' letter 'a' character operations (delete determination)
  • A total of 128 characters each character has a unique coding View ascii.pdf
  • The computer unit is the smallest bit so we use the bits to store
  • So how many bits are needed to keep it a character?
    • 2 save up to four data bits 00011011
    • 4 save up to 16 bits of data 0,000,000,100,100,011 ...... 1111
    • Up to 7-bit data 128 stored 00000000000001 00000100000011 1111111 ......
    • Save up to 255 8-bit data 11111111 0,000,000,000,000,001 0,000,001,000,000,011 ......
  • In fact, you can save 7bit 128 characters but eventually use the 8-bit because it was used as an eighth parity bit (parity bit now useless)
  • The concept was first byte is used to represent a "word" is the last char popular convention 8bit representation 1byte
  • Exception: GSM 7bit encoding defaults

GBK GB18030 encoding GB2312 coding coding coding XXXX

  • If the whole world is good enough English that ASCII code
  • But China Japan Korea also want to display text using a computer of their own country
  • Therefore, China has developed GB2312 developed Shift_JIS Japanese Korean developed EUC_KR
  • Each country has developed its own coding usually with 2 to 4 bytes to represent a character

Unicode character set

  • Coding of each country is not the same as it is prone to cause garbled
  • So how countries use a table to determine the number of characters (scientific name code bits / Code Point / Code Point)
  • Unicode on the breeding born (Unicode, Unicode, single)
  • No. 1,114,112 Unicode prepared to store all the text world
  • 1114112 10001 00000000 00000000 20bit binary system
  • 2020 March 10 released the latest Unicode®13.0.0.pdf

UTF-32 (or UCS-4) encoding

  • Provisions of each character to 4byte (32bit) to store read transfer
  • advantage
    • Quick access to the number of characters
    • Quick access to the specified character data
  • Shortcoming
    • Originally 1byte (4bit) will be able to save the data needed 4byte (32bit) to hold a waste of capacity
    • Slow memory read transfer

UTF-8 encoding

  • 1 to 4 used to determine a character byte

  • UTF-8 encoded Unicode character set is divided into four sections

Decimal Binary
0-177 (00000000 00000000 00000000 00000000)-(00000000 00000000 00000000 01111111)
178-2047 (00000000 00000000 00000000 11111111)-(00000000 00000000 01111111 11111111)
2048-65535 (00000000 00000000 11111111 11111111)-(00000000 01111111 11111111 11111111)
65536-1114111 (00000000 11111111 11111111 11111111)-(01111111 11111111 11111111 11111111)
  • Four sections correspond to the four formulas
official
0xxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx
  • E.g
character Decimal Binary It belongs to the interval
from 122 01111010 1
Zou 37049 10010000 10111001 3
  • Character 'z' in a first range applies to the first formula
    • Use 'z' binary 01111010 filled from right to left 0xxxxxx
    • Finally obtained 'z' of UTF-8 code 01111010
  • Character 'Zou' belongs to the third zone to apply the third formula
    • Use the 'Zou' binary 1,001,000,010,111,001 from right to left to fill 1110xxxx 10xxxxxx 10xxxxxx
    • Finally obtained 'Zou' UTF-8 code is obtained 1,110,100,110,000,010 10111001
  • advantage
    • Compatible with ASCII code
    • Save read transfer range below 178 characters only need one byte (4bit)
    • According to have a length of 1 to determine the number of characters before a binary characters
  • Shortcoming
    • Can not directly obtain the data it contains many characters
    • You can not directly locate a binary character position

to sum up

  • 1 is a two byte error characters in UTF-8 encoding rules
  • These are personal opinions
  • Please correct me if errors and omissions

Github:https://github.com/QiangZou

Blog: https://www.cnblogs.com/zouqiang/

Guess you like

Origin www.cnblogs.com/zouqiang/p/12628602.html
Recommended