Go turn basic types of language - character types

https://blog.csdn.net/FHD994603831/article/details/92435724

 

 

 

Character type
Golang no special character types, if you want to store a single character (letter) is generally used to store the byte.
Character string is a string of fixed length connecting the character sequence. Go strings are connected by a single byte. That is the traditional string is made up of characters, and different Go strings, it is by bytes.

Unicode
long ago, the world is still relatively simple, at least, there is only one computer in the world ASCII character set: American Standard Code for Information Interchange. ASCII, more accurately, the US ASCII, use 7bit representing 128 characters: the case of the English alphabet, numbers, punctuation marks and device control character. For early computer programs, these would be sufficient, but it also led to other parts of the world many users can not use their own system of symbols directly. With the development of the Internet, the mixed data in multiple languages has become very common (translation: for example, the English text itself or Chinese translation contains the ASCII, Chinese, Japanese and other languages characters). How to effectively deal with these contains a variety of rich and varied language text data it?

The answer is to use Unicode (http://unicode.org), which collected all of the symbol system in the world, including accents and other diacritical marks, tabs and carriage returns, there are many mysterious symbols, each symbol is assigned a unique code points Unicode, Unicode code points corresponding to integer type rune Go language (Annotation: rune is equivalent type int32).

We collected more than 120,000 characters in the Unicode standard version in the eighth, covering more than 100 languages. These computer programs and data is how to embody it? General represents a Unicode code point data type int32, i.e. corresponding to the language type Go rune; it is synonymous rune Rune mean.

We can be a Rune int32 sequence is represented as a sequence. This encoding is called UTF-32 or UCS-4, each 32bit Unicode code points are used to represent the same size. It's more simple and uniform, but it will waste a lot of storage space, because most computer-readable text is ASCII characters, each had 8bit or ASCII characters can be represented only 1 byte. And even the usual characters are also far less than 65,536, which means you can use 16bit encoding expression commonly used characters. However, there are other better coding approach?

The UTF8
the UTF8 to Unicode code point is a variable-length coding sequence of encoded bytes. UTF8 encoding is the father of the Go language Ken Thompson and Rob Pike co-inventor, is now a standard of Unicode. UTF8 encoding using 1-4 bytes to represent each Unicode code point, part of the ASCII characters only 1 byte, commonly used character portion 2 or 3 bytes. The first byte after the encoded high bit of each symbol used to represent encoded bits total number of bytes. If the first byte of high-end bit is 0, then the corresponding 7bit ASCII characters, each character ASCII character is still a byte, and the traditional ASCII code compatible. If the high bit of the first byte is 110, then it requires 2 bytes; each subsequent end 10 begins with a bit. More Unicode code point is a similar policy processing.

0-127 Runes 0xxxxxxx (the ASCII)
110xxxxx 10xxxxxx 128-2047 (values <128 unused)
1110xxxx 10xxxxxx 10xxxxxx 2048-65535 (values <unused 2048)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 65536-0x10ffff (OTHER unused values)
. 1
2
. 3
. 4
becomes longer coding can not directly access the first n characters through the index, but the UTF8 encoding to get a lot of extra advantages. First UTF8 encoding is relatively compact, fully compatible with ASCII code, and can automatically sync: it may be able to determine the current starting byte character encoding position by backtracking up to three bytes forward. It is also a prefix code, so there will be no ambiguity does not need to look ahead when decoding from left to right (translation: GBK coding as such, if you do not know the starting position may occur ambiguity). No character encoding other coded character sub-string, string or other coding sequence, so long as the search searches its byte coding sequence to a character when the context before and after the search results have to worry about interference. While the same sequence and coding sequence UTF8 Unicode code points, and therefore can be sorted directly UTF8 coding sequence. At the same time because there is no embedded NUL (0) bytes, that can be well compatible programming language as NUL end of the string.

Go language source files using UTF8 encoding, language processing and text Go UTF8 encoding is also very good. unicode package provides many functions rune character processing related functions (such as to distinguish between letters and numbers, letters or uppercase and lowercase conversion), unicode / utf8 package provides a function for encoding and decoding rune UTF8 character sequence.

There are many difficult Unicode characters directly from the keyboard, and there are a lot of characters has a similar structure; some even invisible characters (Annotation: Chinese and Japanese have a lot of similar but different words). Go language string literals Unicode escapes so that we can enter special characters through Unicode code point. There are two forms: \ uhhhh 16bit values ​​corresponding to the code point, \ Uhhhhhhhh code point corresponding to 32bit values, where h is a hexadecimal number; rarely need to use forms of 32bit. Corresponding to each code point UTF8 encoding. For example: letter string below represent the same denomination value:

"World"
"\ XE4 \ XB8 \ X96 \ XE7 \ x95 \ x8c"
"\ u4e16 \ u754c"
"\ U00004e16 \ U0000754c '
. 1
2
. 3
. 4
above three escape sequences to provide alternative wording to the first string, but their values are the same.

Unicode escapes can also be used in rune characters. The following three characters are equivalent:

'World' '\ u4e16' '\ U00004e16'
. 1
a code point values less than 256 may be written in a byte hexadecimal escape, for example \ X41 corresponding character 'A', but for the larger code point You must use \ u or \ U escapes form. Thus, \ xe4 \ xb8 \ x96 rune is not a valid character, although these three bytes corresponding to a valid code points UTF8 encoding.

Thanks UTF8 encoding excellent design, many operations do not need to decode the string operation. We can not decode directly test whether a string is a prefix of another string:

HasPrefix FUNC (S, String prefix) {BOOL
return len (S)> = len (prefix) && S [: len (prefix)] == prefix
}
. 1
2
. 3
or suffix Test:

HasSuffix FUNC (S, String suffix) {BOOL
return len (S)> = len (suffix) && S [len (S) -len (suffix):] == suffix
}
. 1
2
. 3
or substring comprising Test:

FUNC the Contains (S, substr String) BOOL {
for I: = 0; I <len (S); I ++ {
IF HasPrefix (S [I:], substr) {
return to true
}
}
return to false
}
. 1
2
. 3
. 4
. 5
. 6
. 7
. 8
for post-processing and encoding UTF8 original byte text processing logic is the same. However, many other corresponding to the encoding is not the case. (The above processing functions from the packet strings string, the code contains a true optimization technique Contains a hash implementation.)

On the other hand, if we really care about each Unicode character, we can use other treatments. Consider a first example of a string of the foregoing, it is a mixture of Chinese and Western characters. Figure 3.5 shows its memory representation. String contains 13 bytes, coded in UTF8 form, but only Unicode characters corresponding to 9:

Import "Unicode / UTF8"
S: = "the Hello, World"
fmt.Println (len (S)) // "13 is"
fmt.Println (utf8.RuneCountInString (S)) // ". 9"
. 1
2
. 3
. 4
for processing the true character, we need a UTF8 decoder. unicode / utf8 package provides this functionality, we can use:

I for: = 0; I <len (S); {
R & lt, size: utf8.DecodeRuneInString = (S [I:])
fmt.Printf ( "% D \ T% C \ n-", I, R & lt)
I + size =
}
. 1
2
. 3
. 4
. 5
each call returns DecodeRuneInString function and a length r, r corresponding to the character itself, corresponding to the length r using the number of encoded byte UTF8 encoding. Byte length may be used to update the index position of the i-th character in the string. But this coding is awkward, we need a more concise syntax. Fortunately, range cycle Go language when dealing with strings, implicit automatically decode UTF8 string. The following operating cycle shown in Figure 3.5; Note that for non-ASCII, the index update step size greater than one byte.


for i, r: = range " Hello, World" {
fmt.Printf ( "% T% Q \% T D \ n-D \", I, R & lt, R & lt)
}
. 1
2
. 3
we can use a simple loop statistics of the number of characters in a string, like this:

n-: = 0
for _, _ Range = {S
n-++
}
. 1
2
. 3
. 4
, like other forms of cycle that we can ignore the unwanted variables:

n-: = 0
for Range S {
n-++
}
. 1
2
. 3
. 4
, or we can call directly utf8.RuneCountInString (s) function.

As we mentioned earlier, a text string using UTF8 encoding is just a convention, but for the real string loop is not a convention, that's right. If the string loop for just an ordinary binary data, or UTF8-encoded data contains an error, what will happen?

Each UTF8 character encoding, either explicitly or implicitly call utf8.DecodeRuneInString decoder to decode cycle in the range, if you encounter UTF8 encoding input a wrong, it will create a special Unicode character \ uFFFD, in printing this symbol is usually a black hexagonal or diamond shape, which contains a white question mark "?." When the program encounters such a character, usually a danger signal, indicating that the input is not a perfect error-UTF8 string.

UTF8 string exchange format is very convenient, but using rune sequence within the program may be more convenient, since the rune same size, and easy cutting support array index.

The [] Rune type conversion is applied to UTF8 encoded string, the string is returned encoded Unicode code point sequence:

// "Program" in Japanese Katakana
S: = "Proton Jewelery Getting Rousseau"
fmt.Printf ( "% X \ n-", S) // "E3 83 97 E3 83 AD E3 82 B0 E3 83 A9 E3 83 A0"
R & lt: = [ ] Rune (S)
fmt.Printf ( "% X \ n-", R & lt) // "[30d7 30ed 30b0 30e9 30e0]"
. 1
2
. 3
. 4
. 5
(X% in the first parameter for each of Printf hexadecimal number to insert a space before.)

If the slice, or an array of Unicode characters [] Rune into type string, the UTF8 encoding them:

fmt.Println (string (r)) // " Proton Jewelery Getting Rousseau"
1
will be a string of integer transformation means to generate UTF8 string contains only characters corresponding to Unicode code points:

fmt.Println (String (65)) // "A", Not "65"
fmt.Println (String (0x4eac)) // "Virgin"
. 1
2
if the character corresponding to the code point is invalid, then use \ uFFFD invalid as an alternative character:

fmt.Println (String (1234567)) // "?"
. 1
character type Details
character constants are single quotation marks ( '') enclose a single character. For example: var c1 byte = 'a' var c2 int = ' in' var c3 byte = '9' .
Go are allowed escape character '\' subsequent to the special characters into a character constant. For example: var c3 char = '\ n ' // '\ n' represents a line break.
Go language characters using UTF-8 encoding
alphabetical characters -3 -1 bytes bytes
in Go, the nature of the character is an integer, when the direct output code value is encoded in UTF-8 character corresponding. ??? [ASCII, Unicode, utf- 8].

Can be directly assigned to a variable, a number, and then press output format% c, the digital outputs corresponding unicode characters.

Character type operation may be made, the equivalent of an integer, because it has Unicode code corresponds.


What is the difference UTF-8 and Unicode?
Unicode is a character set, ASCII is a character set.

Each character is assigned a unique ID for the character set, we used all of the characters in the Unicode character set has a unique ID corresponding to, for example, in the above example encoded in a Unicode and 97 are in ASCII. "You" in Unicode encoding is 20320, but focused on the characters of different countries, "you" ID will be different. Regardless of any case, Unicode characters are the ID will not change.

UTF-8 encoding rules, the characters in the Unicode ID encoded in some way. UTF-8 is a variable-length coding rules, ranging from 1 to 4 bytes. Encoding rules are as follows:

0xxxxxx text symbols represents 0 to 127, is compatible with the ASCII character set.
From 128 to 0x10ffff represent other characters.
According to this rule, under the Latin language character encoding Typically, each character still occupies one byte, and Chinese each character occupies three bytes.

It refers to a broad Unicode standard character sets and encoding rules defined, i.e. Unicode character set, and UTF-8, UTF-16 coding.

References:
"Go language Bible": HTTPS: //github.com/994603831/gopl-zh.github.com
----------------
Disclaimer: This article is CSDN blogger "ice north whirlwind" of the original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
Original link: https: //blog.csdn.net/FHD994603831/article/details/92435724

Guess you like

Origin www.cnblogs.com/saolv/p/11795928.html