Data Processing - → Hiragana Katakana conversion algorithm

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/a2111221996/article/details/102730177

Data processing - Katakana → Hiragana conversion algorithm
A few days ago completed a demand, we need to write all the data contained katakana algorithm will focus on all the data is converted into hiragana. After an investigation found Katakana in Japanese Katakana contains full-width, full-width katakana (trumpet) half-width katakana, katakana pronunciation expand these kinds of katakana. Demands require additional data will not be lost in the conversion process large amounts of data, such as other characters it contains no conversion, but also can not be lost. So I consider the algorithm should recognize numbers, letters, spaces, Hiragana, Katakana, and Japanese in traditional Chinese characters and other features. This is necessary to define these characters in the dataset byte length occupied by the complete demand methods should be established in the table after this consideration. Because the character string in the data set is multi-byte stored, and in the build environment is stored in a single byte, so if the table initialized directly compilation environment, will have different byte lengths conflicts, can not be achieved the corresponding conversion needs.
Based on this, I tried to convert the data into a single set of strings of bytes, here used the READ_ConvStringToStringEx functions related to this knowledge we can query function and implement the principles MultiByteToWideChar WideCharToMultiByte function, I will not go into details. But here it encountered difficulties, that is, when it can not determine the character after converting the memory occupied by the length of the string, because this method is only valid for multi-byte strings. At this point, the solution can only be changed again.
Finally, I think the file into ways to use external, when the table imported in the form of an external file, I can define the encoding themselves, and in the compilation environment because there is no investigation clearly Japanese encoding, so there is no implementation will compile single-byte string environment is converted to a string of multi-byte function, based on this comparison table to establish my business survey basically ended. Next be written algorithm, wherein encoding the data set in the table are the UTF-8 encoding, we know, the UTF-8 encoding of bytes is as follows:
1 byte 0xxxxxxx
2 bytes 110xxxxx 10xxxxxx
3 bytes 10xxxxxx 10xxxxxx 1110xxxx
4 bytes 10xxxxxx 10xxxxxx 10xxxxxx 11110xxx
5 bytes 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 111110xx
6 bytes 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
which all bytes are occupied by the number of katakana 3, so I wrote recognition algorithm, let the acquired data may be taken directly or be converted into the cycle determined according to the length. The following is the length of the memory element occupied by a string index determination method.

if (0x00 == (strInput.at(i) & 0x80))//1字节
else if ((strInput.at(i) & 0x80 && strInput.at(i) & 0x40 && strInput.at(i) & 0x20) == 0x00)//2字节
else if ((strInput.at(i) & 0x80) &&(strInput.at(i) & 0x40) &&(strInput.at(i) & 0x20) &&(strInput.at(i) & 0x10) == 0x00)//3字节
else if ((strInput.at(i) & 0x80) &&(strInput.at(i) & 0x40) &&(strInput.at(i) & 0x20) &&(strInput.at(i) & 0x10) &&(strInput.at(i) & 0x08) == 0)//4字节

1000 0000 0x80
0x40 0100 0000
0x20 0010 0000
0x10 0001 0000
0x08 0000 1000
If a length of 3 bytes, will enter several cycles I write, because taking into account the performance of the algorithm, I would place the hit rate of the loop the more forward, and if it does not continue after the hit re-enter the other cycle, to improve performance. Because the dataset there is not the same as the three-byte katakana characters, so a decision to pay more to ensure that data is not lost. And everyone in the program is to be noted before entering the next cycle each time to remember the value of i corresponding rearward displacement length, and the value of i to determine the length relationship of the acquired character string, when the two are equal need to end the algorithm the outermost loop, otherwise there will be disruption. Thus, the algorithm is complete, encountered many problems in testing time and set many breakpoints, and finally also resolved, needs to achieve success. Hold the source code. .

Guess you like

Origin blog.csdn.net/a2111221996/article/details/102730177