And other languages, there is no character type, the Go language special case characters only integers .
Why special case characters only integer it? Because in Go to represent characters byte
and rune
types are integer aliases . Go in the source code, we can see:
// byte is an alias for uint8 and is equivalent to uint8 in all ways. It is
// used, by convention, to distinguish byte values from 8-bit unsigned
// integer values.
type byte = uint8
// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32
复制代码
byte
It isuint8
an alias for a length of 1 byte, ASCII characters used to representrune
It isint32
an alias, a length of 4 bytes, for indicating in UTF-8 encoded Unicode code point
Tips: Unicode starting at 0, assigns a number to each symbol, which is called "code points" (code point).
Representation of characters
So, how to represent it in the Go language character?
Go language use in single quotes to represent characters, for example 'j'
.
byte
If you want to indicate byte
the type of characters you can use byte
keywords to specify the type of character variables:
var byteC byte = 'j'
复制代码
And because byte
essentially cosmetic uint8
, it can be directly converted to an integer value. In the format specifier we use %c
represent characters, %d
represent integers:
// 声明 byte 类型字符
var byteC byte = 'j'
fmt.Printf("字符 %c 对应的整型为 %d\n", byteC, byteC)
// Output: 字符 j 对应的整型为 106
复制代码
rune
And byte
the same, I want to declare rune
the type of characters you can use rune
keyword specify:
var runeC rune = 'J'
复制代码
But if neither of the character type when you declare a variable, Go It is the default rune
type :
runeC := 'J'
fmt.Printf("字符 %c 的类型为 %T\n", runeC, runeC)
// Output: 字符 J 的类型为 int32
复制代码
Why do we need two types?
Here you see might ask, since the characters are used to represent, why do you need two types of it?
We know that byte
one byte, so it can be used to represent ASCII characters. And UTF-8 is a variable-length encoding methods, the character length ranging from 1 byte to 4 bytes . byte
Obviously not good at such a representation, even if you want to use more byte
were expressed, you have no way of knowing UTF-8 character you have to deal with exactly accounted for a few bytes.
So, if you arrogantly interception on the Chinese string, will be garbled output:
testString := "你好,世界"
fmt.Println(testString[:2]) // 输出乱码,因为截取了前两个字节
fmt.Println(testString[:3]) // 输出「你」,一个中文字符由三个字节表示
复制代码
Then you need rune
help. Using []rune()
the string into Unicode code point then taken, in which case the string of UTF-8 characters containing no need to consider:
testString := "你好,世界"
fmt.Println(string([]rune(testString)[:2])) // 输出:「你好」
复制代码
Tips: Unicode and ASCII as a character set, UTF-8 is a coding.
Traversal strings
String run through two ways, one is the index traversal, one is used range
.
Subscript traversal
Since, in Go, the string in UTF-8 encoding storage, using len()
the function acquiring string length, to obtain that the UTF-8 encoded string length in bytes, will produce a string of the index mark by byte . Thus, if the UTF-8 encoded string of characters contained, will be garbled:
testString := "Hello,世界"
for i := 0; i < len(testString); i++ {
c := testString[i]
fmt.Printf("%c 的类型是 %s\n", c, reflect.TypeOf(c))
}
/* Output:
H 的类型是 uint8(ASCII 字符返回正常)
e 的类型是 uint8
l 的类型是 uint8
l 的类型是 uint8
o 的类型是 uint8
ï 的类型是 uint8(从这里开始出现了奇怪的乱码)
¼ 的类型是 uint8
的类型是 uint8
ä 的类型是 uint8
¸ 的类型是 uint8
的类型是 uint8
ç 的类型是 uint8
的类型是 uint8
的类型是 uint8
*/
复制代码
range
range
Traversal will get rune
the type of character:
testString := "Hello,世界"
for _, c := range testString {
fmt.Printf("%c 的类型是 %s\n", c, reflect.TypeOf(c))
}
/* Output:
H 的类型是 int32
e 的类型是 int32
l 的类型是 int32
l 的类型是 int32
o 的类型是 int32
, 的类型是 int32
世 的类型是 int32
界 的类型是 int32
*/
复制代码
to sum up
- Go language has no notion of character, a character is a bunch of bytes , it may be a single-byte (ASCII character set), there may be more than one byte (Unicode character set)
byte
It isuint8
an alias for a length of 1 byte, ASCII characters used to representrune
It isint32
an alias, a length of 4 bytes, for indicating in UTF-8 encoded Unicode code point- String is taken bytes
- The subscript index byte string generated
- You want to iterate
rune
character type, userange
the method to traverse
Reference material
- Ruan Yifeng: Unicode and JavaScript Detailed
- The Go Blog - Strings, bytes, runes and characters in Go
If you think the article is well written, please help me two small favors:
- Thumbs up and follow me, let this article be more people to see
- No public concern " program to save the world ," the first time you push to get new articles
I am writing to encourage you is the greatest motivation, thank you!