Pinyin analysis and search--Automatic analysis of Pinyin Hanzi combinations (including polyphonic words, pinyin abbreviations)

Recently, I have used pinyin search in my work. Currently, I have made a set with reference to online examples, and I will share it with you here.

This set of codes can identify the combination of pinyin and Chinese characters including the abbreviation of Baokuai Pinyin (for example: xiug mobile phone h --> modify mobile phone number)

Without further ado, let's get started:

1. First, there is a table of Chinese words corresponding to pinyin, and then a table of word clicks (to record the common degree of words)

 

PinyinWord table

CREATE TABLE "public"."pinyinword" (
    "id" text COLLATE "default" NOT NULL,
    "word" text COLLATE "default" NOT NULL,
    "whole" text COLLATE "default" NOT NULL,
    "acronym" text COLLATE "default" NOT NULL,
    "wordlength" int4 NOT NULL,
    "wholelength" int4 NOT NULL,
    "acronymlength" int4 NOT NULL
)

 WordClick table

CREATE TABLE "public"."wordclick" (
    "wordcontent" text COLLATE "default",
    "id" text COLLATE "default" NOT NULL
)

 The data in the table is initialized by itself

2. Next, two data types are introduced, which play an important role in analyzing input

/**
* word element
*/
public class Lexeme {
    private String content; //Letter content
    private LexemeType lexemeType; //Lexeme type
}

public enum LexemeType {
    CHINESE, //Chinese
    WHOLE, //full spell
    ACRONYM // Pinyin acronym
}

 

/**
* Chinese sentences (classes that handle user input)
*/
public class ChineseSentence {
    private String content; // user input content
    private List<Lexeme> sentenceUnits; // words contained in content
    private SentenceType sentenceType; // The lowest type of sentence (cannot be set, please see initSentenceType() for assignment)

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    public List<Lexeme> getSentenceUnits() {
        return sentenceUnits;
    }

    public SentenceType getSentenceType() {
        return sentenceType;
    }

    public void setSentenceUnits (List <Lexeme> sentenceUnits) {
        this.sentenceUnits = sentenceUnits;
        initSentenceType();
    }

    private void initSentenceType() {
        sentenceType = SentenceType.CHINESE_SENTENCE;
        for (Lexeme lexeme : sentenceUnits) {
            if (lexeme.getLexemeType() == LexemeType.ACRONYM) {
                sentenceType = SentenceType.ACRONYM_SENTENCE;
                break;
            } else if (lexeme.getLexemeType() == LexemeType.WHOLE
                    && sentenceType == SentenceType.CHINESE_SENTENCE) {
                sentenceType = SentenceType.WHOLE_SENTENCE;
            }
        }
    }
}

 

3. The next step is to process the user input (xiug mobile phone h), use regular expressions to decompose the target into lemmas (Lexeme) and generate sentences

//Regular expression (copied from the Internet and made some modifications to identify Chinese and suspected pinyin)
private static final String SUSPECTED_PINYIN_REGEX  
    = "[\\ u4e00-\\ u9fa5] | (sh | ch | zh | [^aoeiuv])? [iuv]? (ai | ei | ao | ou | er | ang? | eng? | ong | a | o | e | i | u | ng | n)? ";

 Using this regular expression may intercept non-existing pinyin combinations, such as jvao

This is directly into j, v, a, o (find a library of pinyin combinations, and see if the pinyin that is cut out belongs to the library)

After intercepting and attaching a lexemeType to each token, the following result is obtained

4. The next step is to analyze the words in the sentence one by one

First, briefly explain the principle of analysis

First look at the query conditions

Explain the query parameters. The first is lexemeType. This field is the word level of the specified search. It must be searched according to the lowest word level of the sentence.

For example: 'Modify' --> LexemeType.CHINESE

              'xiu改'  --> LexemeType.WHOLE

              '修g'  --> LexemeType.ACRONYM

The three parameters at the end of Search are used for searching, they are used for like operation in SQL, so that the index can be hit

and pinyinword.acronym like #{acronymSearch} || '%'

 Since the sentences entered by the user may contain Chinese or Pinyin, these two types need to be filtered

For example, if the user enters 'repair g', we use the lowest word level to search like 'xg' %. This may search for 'shoe cabinet', so I use ChineseFilter and pinyinFilter to filter (append '% repair%' to ChineseFilter. ), so the query condition becomes  

and pinyinword.acronym like #{acronymSearch} || '%'
and pinyinword.word like #{chineseFilter}

 This way you won't find 'shoe cabinet'

Let's take a look at the SQL under mybatis, where the wordclick table is joined, and the clicks of each word are obtained for sorting

<select id="searchByClickCount" resultType="model.value.WordClickCount" parameterType="model.options.PinyinWordAnalyzeSearchOptions">
        select
        pw.word word, count(wc.id) clickCount
        from
        PinyinWord pw left join wordclick wc on wc.wordcontent = pw.word
        where 1=1
        <choose>
            <when test="lexemeType.equals('CHINESE')">
                <if test="chineseSearch!=null">
                    and pw.word like #{chineseSearch} || '%'
                </if>
                group by pw.word
                order by clickCount desc
                <if test="paging">
                    limit 5 offset 0
                </if>
            </when>
            <when test="lexemeType.equals('WHOLE')">
                <if test="wholeSearch!=null">
                    and pw.whole like #{wholeSearch} || '%'
                </if>
                <if test="chineseFilter!=null">
                    and pw.word like #{chineseFilter}
                </if>
                group by pw.word
                order by clickCount desc
                <if test="paging">
                    limit 5 offset 0
                </if>
            </when>
            <otherwise>
                <if test="acronymSearch!=null">
                    and pw.acronym like #{acronymSearch} || '%'
                </if>
                <if test="chineseFilter!=null">
                    and pw.word like #{chineseFilter}
                </if>
                <if test="pinyinFilter!=null">
                    and pw.whole like #{pinyinFilter}
                </if>
                group by pw.word
                order by clickCount desc
                <if test="paging">
                    limit 5 offset 0
                </if>
            </otherwise>
        </choose>
    </select>

 Then it analyzes the user's input and generates the query conditions.

This is part of the code, enough to understand the principle of query condition generation

        LexemeType currentLexemeType; //The current token type
        LexemeType lastLexemeType = null; //Lexeme before the lowest level
        List<Lexeme> lexemes = sentence.getSentenceUnits(); //The lowest level of cumulative word units
        for (int i = 0; i < lexemes.size(); i++) {
            Lexeme lexeme = lexemes.get(i);
            currentLexemeType = lexeme.getLexemeType();

            String content = lexeme.getContent();
            switch (currentLexemeType) {
                case CHINESE: //If the current word is Chinese
                    String pinyin = convertSmartAll(content); //Convert to Pinyin (pinyin4j)
                    chineseSearch.append(content); //append到chineseSearch字段
                    wholeSearch.append(pinyin); //append到wholeSearch字段
                    acronymSearch.append(pinyin.charAt(0)); //append to the acronymSearch field
                    chineseFilter.append(content).append("%"); //append到chineseFilter字段
                    break;
                case WHOLE: //If it is pinyin, the same is Chinese
                    wholeSearch.append(content);
                    acronymSearch.append(content.charAt(0));
                    pinyinFilter.append(content).append("%");
                    break;
                case ACRONYM: //Similarly
                    acronymSearch.append(content);
                    break;
            }
            //Convert lastLexemeType into the current lemma and the LexeType of the first level in the current lastLexemeType, because the lowest lemma is required when searching
            lastLexemeType = LexemeType.changeDown(lastLexemeType, currentLexemeType);
            //new searchOptions
            PinyinWordAnalyzeSearchOptions options = new PinyinWordAnalyzeSearchOptions(
                    chineseSearch.toString(), wholeSearch.toString(), acronymSearch.toString(),
                    chineseFilter.toString(), pinyinFilter.toString(), lastLexemeType
            );
            // The result is out. . .
            List<WordClickCount> wordClickCounts = mapper.searchByClickCount(options);

 

have a test:

@Test
    public void analyzeAndSearchTest() throws Exception {
        List<List<WordClickCount>> results = pinyinWordService.analyzeSearch("xiugaishoujihao"); //为了初始化 pinyin4j
        long start1 = System.currentTimeMillis();
        for (int i = 0; i < 100; i++) {
            long start = System.currentTimeMillis();
            List<List<WordClickCount>> results1 = pinyinWordService.analyzeSearch("xiugais机haoqyxgai修改");
            long end = System.currentTimeMillis();
            System.out.println(end - start + " ms");
        }
        long end1 = System.currentTimeMillis();
        System.out.println(end1 - start1 + " ms");
    }

 Test Results

The test decomposes 100 user inputs of 11 words, the call cost is 22.1 seconds, and the average is 221ms each, and the effect is not bad

 

Future optimization:

In sql, table association and count operations are used. When the amount of data is relatively large, you can consider adding a field to pinyinword, and update the count to the pinyinword table every day, so that you can query pinyinword in a single table.

 

 Open source China blog address

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326835088&siteId=291194637