Reference blog: Interception and word count of UTF8 strings in lua [reprint]
need
Truncate substrings by literal count
function(string, start position, intercept length) utf8sub("Hello 1 world haha",2,5) = good 1 world ha utf8sub("1 hello 1 world haha",2,5) = hello 1 world utf8sub("Hello world 1 haha",1,5) = hello world 1 utf8sub("12345678",3,5) = 34567 utf8sub("øpø hello pix",2,5) = pø hello p
wrong way
I found some algorithms on the Internet, but they are not correct; either it is garbled, or it only considers the situation of 4 byte Chinese, which is not comprehensive enough.
1. string.sub(s, 1, intercept length*4)
It is definitely wrong to directly use "`""string.sub(s,1, intercepted length*4)`" on the Internet, because if the Chinese and English strings are mixed, for example, the character length of `Hello 1 World` is ` 4,4,1,4,4`, if you intercept 4 words, 4*4=4+4+1+4+3, then the `boundary` word of `world` will be taken from the first 3 bytes, then garbled characters
2. if byte>128 then index = index + 4
The key to the problem
1. utf8 characters are variable length characters
2. Regular character length
As listed in Literal Character Encodings , utf-8 is the encoding scheme for the unicode character set. Therefore, its variable-length encoding method is:
One byte: 0********
Two bytes: 110*****, 10*****
Three bytes: 1110****, 10******, 10******
Four bytes: 11110***, 10******, 10******, 10******
Five bytes: 111110**, 10******, 10******, 10******, 10******
Six bytes: 1111110*, 10******, 10******, 10******, 10******, 10******
Therefore, after getting the byte string, if you want to judge the byte length of the UTF8 character, according to the above rules, you only need to get the first Byte of the character, and according to its value, you can judge that the character is represented by several Bytes.
Its code is as follows:
local funciton charsize(ch) if not ch then return 0 elseif ch >=252 then return 6 elseif ch >= 248 and ch < 252 then return 5 elseif ch >= 240 and ch < 248 then return 4 elseif ch >= 224 and ch < 240 then return 3 elseif ch >= 192 and ch < 224 then return 2 elseif ch < 192 then return 1 end end
-- Calculate the number of characters in the utf8 string, each character is calculated as one character -- eg utf8len("1 hello") => 3 function utf8len(str) local len = 0 local aNum = 0 -- the number of letters local hNum = 0 -- the number of Chinese characters local currentIndex = 1 while currentIndex <= #str do local char = string.byte(str, currentIndex) local cs = charsize(char) currentIndex = currentIndex + cs only = only +1 if cs == 1 then aNum = aNum + 1 elseif cs >= 2 then hNum = hNum + 1 end end return len, aNum, hNum end
-- intercept utf8 string -- str: String to intercept -- startChar: start character subscript, starting from 1 -- numChars: length of characters to be intercepted function utf8sub(str, startChar, numChars) local startIndex = 1 while startChar > 1 do local char = string.byte(str, startIndex) startIndex = startIndex + chsize(char) startChar = startChar - 1 end local currentIndex = startIndex while numChars > 0 and currentIndex <= #str do local char = string.byte(str, currentIndex) currentIndex = currentIndex + chsize(char) numChars = numChars -1 end return str:sub(startIndex, currentIndex - 1) end -- self-test function test() -- test utf8len assert(utf8len("Hello 1 world haha") == 7) assert(utf8len("Hello world 1 haha") == 8) assert(utf8len("Hello world 1 haha") == 9) assert(utf8len("12345678") == 8) assert(utf8len("øpø hello pix") == 8) -- test utf8sub assert(utf8sub("Hello 1 world haha",2,5) == "Good 1 world haha") assert(utf8sub("1 hello 1 world haha",2,5) == "hello 1 world") assert(utf8sub("Hello 1 world haha",2,6) == "Hello 1 world") assert(utf8sub("Hello world 1 haha",1,5) == "Hello world 1") assert(utf8sub("12345678",3,5) == "34567") assert(utf8sub("øpø hello pix",2,5) == "pø hello p") print("all test succ") end test()