C ++ was filtered off Chinese string (GBK, UTF-8)

In recent stuff like word processing sensitive game, in order to strengthen the masking process, so it is necessary to filter out the string of characters in addition to the other things, such as numbers, symbols, letters and so on.

First, access to information and I wrote a function:

Example: Returns the number of characters in the input string:

std::string StrWithOutSymbol(const std::string &source)
{    
    string sourceWithOutSymbol;

    int i = 0;
    while (source[i] != 0)
    {
        if (source[i] & 0x80 )
        {
            sourceWithOutSymbol += source[i];
            sourceWithOutSymbol += source[i + 1];
            i += 2;
        else
        {
            i ++;
        }
    }
    return
   sourceWithOutSymbol; 
}

The principle of this function is ord ($ str) & 0x80 to judge Chinese characters 

80 corresponding to the binary code 1000 0000, most significant bit is one, the representative kanji encoding format known as a format of 10 characters occupies 2 bytes, but represent a character

"Windows, the simplified Chinese character set encoding is simultaneously one byte and 2 bytes to represent When the high time is 0x00 ~ 0x7F, one byte, 0x80 is high than when expressed by 2 bytes "

When you find a byte content of greater than 0x7f, then it must be a (with another byte into a patchwork) characters, certainly larger than 0x7f how to judge it?
0x7f (1111111) behind a number that is 0x80 (10000000), so I want to greater than 0x7f, the most significant bit of this byte are definitely 1, we need only determine whether the highest bit is 1 on the list.

Analyzing method:
bits and (the same is only 1 bit is 1, otherwise 0):
as: To determine whether the third number is a 4 long with (100) and the bit, the first determination, a number of 2 is 1 just 2 (10) bits with.
Similarly determine whether to eighth place with just 1 (10000000) is 0x80 bit with the.

Why not here> 0x7f? php lines may also, in other inside a strongly typed language, the most significant bit of a byte used to indicate negative, a negative certainly not be greater than 0x7F (the greatest integer)


Another example:
A code is a assic 97 (1100001)
assic A code is 65 (1000001)

b assic the code is 98 (1.10001 million)
assic code B is 66 (1000010)

Found a pattern: a letter az, as long as the lowercase letters, sixth place is definitely 1, we can use this to determine the case:
this time for as long as with the letters with 0x20 (100000) to place and judgment:
IF ( the ord (A $) & 0x20) {
        // uppercase
}

How to put all uppercase letters into sixth place on the line 1 into a 0:? 
$ A = 'A';
$ A = CHR (ord ($ A) & (~ 0x20));
echo $ A;

 

I am confident it then this function is added into the project, click Run, type Chinese check when! Project being given? ? ? ? Array bounds? ? ? ?

This is why, I locate where the error was found cocos-lua I use when passing the string to c ++ string is passed in UTF-8 to encode, I went to see UIF-8 encoding rules found

UTF-8 encoding rules: if only one byte is the highest bit is 0; if it is a multi-byte, first byte from its highest level, the number of consecutive binary 1 bit value determines which encodes the number of bytes remaining 10 bytes are beginning. UTF-8 conversion table as follows:

 

 

 And before I was operated in accordance with the encoding GBK, GBK two bytes for each Chinese character only, and if possible Chinese utf-8 3 bytes, four bytes, five or even six, so with function just as there will be a cross-border situation occurs, so the string is encoded with UTF-8, on the need for additional treatment, so I wrote a new function:

UTF-8 encoded string in Chinese screening a function:

std::string censorStrWithOutSymbol(const std::string &source)
{    
    string sourceWithOutSymbol;

    int i = 0;
    while (source[i] != 0)
    {
        if (source[i] & 0x80 && source[i] & 0x40 && source[i] & 0x20)
        {
            int byteCount = 0;
            if (source[i] & 0x10)
            {
                byteCount = 4;
            }
            else
            {
                byteCount = 3;
            }
            for (int a = 0; a < byteCount; a++)
            {
                sourceWithOutSymbol += source[i];
                i++;
            }
        }
        else if (source[i] & 0x80 && source[i] & 0x40)
        {
            i += 2;
        }
        else
        {
            i += 1;
        }
    }
    return sourceWithOutSymbol;
}

Click Run, a success! Comfortable.

Guess you like

Origin www.cnblogs.com/kpxy/p/11256791.html