Go talk the language of the characters represents the string traversal

And other languages, there is no character type, the Go language special case characters only integers .

Why special case characters only integer it? Because in Go to represent characters byteand runetypes are integer aliases . Go in the source code, we can see:

// byte is an alias for uint8 and is equivalent to uint8 in all ways. It is
// used, by convention, to distinguish byte values from 8-bit unsigned
// integer values.
type byte = uint8

// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32
复制代码
  • byteIt is uint8an alias for a length of 1 byte, ASCII characters used to represent
  • runeIt is int32an alias, a length of 4 bytes, for indicating in UTF-8 encoded Unicode code point

Tips: Unicode starting at 0, assigns a number to each symbol, which is called "code points" (code point).

Representation of characters

So, how to represent it in the Go language character?

Go language use in single quotes to represent characters, for example 'j'.

byte

If you want to indicate bytethe type of characters you can use bytekeywords to specify the type of character variables:

var byteC byte = 'j'
复制代码

And because byteessentially cosmetic uint8, it can be directly converted to an integer value. In the format specifier we use %crepresent characters, %drepresent integers:

// 声明 byte 类型字符
var byteC byte = 'j'
fmt.Printf("字符 %c 对应的整型为 %d\n", byteC, byteC)
// Output: 字符 j 对应的整型为 106
复制代码

rune

And bytethe same, I want to declare runethe type of characters you can use runekeyword specify:

var runeC rune = 'J'
复制代码

But if neither of the character type when you declare a variable, Go It is the default runetype :

runeC := 'J'
fmt.Printf("字符 %c 的类型为 %T\n", runeC, runeC)
// Output: 字符 J 的类型为 int32
复制代码

Why do we need two types?

Here you see might ask, since the characters are used to represent, why do you need two types of it?

We know that byteone byte, so it can be used to represent ASCII characters. And UTF-8 is a variable-length encoding methods, the character length ranging from 1 byte to 4 bytes . byteObviously not good at such a representation, even if you want to use more bytewere expressed, you have no way of knowing UTF-8 character you have to deal with exactly accounted for a few bytes.

So, if you arrogantly interception on the Chinese string, will be garbled output:

testString := "你好,世界"
fmt.Println(testString[:2]) // 输出乱码,因为截取了前两个字节
fmt.Println(testString[:3]) // 输出「你」,一个中文字符由三个字节表示
复制代码

Then you need runehelp. Using []rune()the string into Unicode code point then taken, in which case the string of UTF-8 characters containing no need to consider:

testString := "你好,世界"
fmt.Println(string([]rune(testString)[:2])) // 输出:「你好」
复制代码

Tips: Unicode and ASCII as a character set, UTF-8 is a coding.

Traversal strings

String run through two ways, one is the index traversal, one is used range.

Subscript traversal

Since, in Go, the string in UTF-8 encoding storage, using len()the function acquiring string length, to obtain that the UTF-8 encoded string length in bytes, will produce a string of the index mark by byte . Thus, if the UTF-8 encoded string of characters contained, will be garbled:

testString := "Hello,世界"

for i := 0; i < len(testString); i++ {
	c := testString[i]
	fmt.Printf("%c 的类型是 %s\n", c, reflect.TypeOf(c))
}

/* Output:
H 的类型是 uint8(ASCII 字符返回正常)
e 的类型是 uint8
l 的类型是 uint8
l 的类型是 uint8
o 的类型是 uint8
ï 的类型是 uint8(从这里开始出现了奇怪的乱码)
¼ 的类型是 uint8
Œ 的类型是 uint8
ä 的类型是 uint8
¸ 的类型是 uint8
– 的类型是 uint8
ç 的类型是 uint8
• 的类型是 uint8
Œ 的类型是 uint8
*/
复制代码

range

rangeTraversal will get runethe type of character:

testString := "Hello,世界"

for _, c := range testString {
	fmt.Printf("%c 的类型是 %s\n", c, reflect.TypeOf(c))
}

/* Output:
H 的类型是 int32
e 的类型是 int32
l 的类型是 int32
l 的类型是 int32
o 的类型是 int32
, 的类型是 int32
世 的类型是 int32
界 的类型是 int32
*/
复制代码

to sum up

  • Go language has no notion of character, a character is a bunch of bytes , it may be a single-byte (ASCII character set), there may be more than one byte (Unicode character set)
  • byteIt is uint8an alias for a length of 1 byte, ASCII characters used to represent
  • runeIt is int32an alias, a length of 4 bytes, for indicating in UTF-8 encoded Unicode code point
  • String is taken bytes
  • The subscript index byte string generated
  • You want to iterate runecharacter type, use rangethe method to traverse

Reference material


If you think the article is well written, please help me two small favors:

  1. Thumbs up and follow me, let this article be more people to see
  2. No public concern " program to save the world ," the first time you push to get new articles

I am writing to encourage you is the greatest motivation, thank you!

Guess you like

Origin juejin.im/post/5dd0a718e51d453daa0e428c