Lua intercepts utf-8 encoded Chinese and English mixed strings

 

Reference blog: Interception and word count of UTF8 strings in lua [reprint]

need

Truncate substrings by literal count

copy code

function(string, start position, intercept length)

utf8sub("Hello 1 world haha",2,5) = good 1 world ha
utf8sub("1 hello 1 world haha",2,5) = hello 1 world
utf8sub("Hello world 1 haha",1,5) = hello world 1
utf8sub("12345678",3,5)    =    34567
utf8sub("øpø hello pix",2,5) = pø hello p

copy code

wrong way

I found some algorithms on the Internet, but they are not correct; either it is garbled, or it only considers the situation of 4 byte Chinese, which is not comprehensive enough.

1. string.sub(s, 1, intercept length*4)

  It is definitely wrong to directly use "`""string.sub(s,1, intercepted length*4)`" on the Internet, because if the Chinese and English strings are mixed, for example, the character length of `Hello 1 World` is ` 4,4,1,4,4`, if you intercept 4 words, 4*4=4+4+1+4+3, then the `boundary` word of `world` will be taken from the first 3 bytes, then garbled characters

2. if byte>128 then index = index + 4

The key to the problem

1. utf8 characters are variable length characters

2. Regular character length

 

As listed in Literal Character Encodings , utf-8 is the encoding scheme for the unicode character set. Therefore, its variable-length encoding method is:

One byte: 0********

Two bytes: 110*****, 10*****

Three bytes: 1110****, 10******, 10******

Four bytes: 11110***, 10******, 10******, 10******

Five bytes: 111110**, 10******, 10******, 10******, 10******

Six bytes: 1111110*, 10******, 10******, 10******, 10******, 10******

Therefore, after getting the byte string, if you want to judge the byte length of the UTF8 character, according to the above rules, you only need to get the first Byte of the character, and according to its value, you can judge that the character is represented by several Bytes.

Its code is as follows:

copy code

local funciton charsize(ch)
    if not ch then return 0
    elseif ch >=252 then return 6
    elseif ch >= 248 and ch < 252 then return 5
    elseif ch >= 240 and ch < 248 then return 4
    elseif ch >= 224 and ch < 240 then return 3
    elseif ch >= 192 and ch < 224 then return 2
    elseif ch < 192 then return 1
    end
end

copy code

 

copy code

-- Calculate the number of characters in the utf8 string, each character is calculated as one character
-- eg utf8len("1 hello") => 3
function utf8len(str)
    local len = 0
    local aNum = 0 -- the number of letters
    local hNum = 0 -- the number of Chinese characters
    local currentIndex = 1
    while currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        local cs = charsize(char)
        currentIndex = currentIndex + cs
        only = only +1
        if cs == 1 then
            aNum = aNum + 1
        elseif cs >= 2 then
            hNum = hNum + 1
        end
    end
    return len, aNum, hNum
end

copy code

 

copy code

-- intercept utf8 string
-- str: String to intercept
-- startChar: start character subscript, starting from 1
-- numChars: length of characters to be intercepted
function utf8sub(str, startChar, numChars)
    local startIndex = 1
    while startChar > 1 do
        local char = string.byte(str, startIndex)
        startIndex = startIndex + chsize(char)
        startChar = startChar - 1
    end

    local currentIndex = startIndex

    while numChars > 0 and currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        currentIndex = currentIndex + chsize(char)
        numChars = numChars -1
    end
    return str:sub(startIndex, currentIndex - 1)
end

-- self-test
function test()
    -- test utf8len
    assert(utf8len("Hello 1 world haha") == 7)
    assert(utf8len("Hello world 1 haha") == 8)
    assert(utf8len("Hello world 1 haha") == 9)
    assert(utf8len("12345678") == 8)
    assert(utf8len("øpø hello pix") == 8)

    -- test utf8sub
    assert(utf8sub("Hello 1 world haha",2,5) == "Good 1 world haha")
    assert(utf8sub("1 hello 1 world haha",2,5) == "hello 1 world")
    assert(utf8sub("Hello 1 world haha",2,6) == "Hello 1 world")
    assert(utf8sub("Hello world 1 haha",1,5) == "Hello world 1")
    assert(utf8sub("12345678",3,5) == "34567")
    assert(utf8sub("øpø hello pix",2,5) == "pø hello p")

    print("all test succ")
end

test()

copy code

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324981218&siteId=291194637