Recently, I have used pinyin search in my work. Currently, I have made a set with reference to online examples, and I will share it with you here.
This set of codes can identify the combination of pinyin and Chinese characters including the abbreviation of Baokuai Pinyin (for example: xiug mobile phone h --> modify mobile phone number)
Without further ado, let's get started:
1. First, there is a table of Chinese words corresponding to pinyin, and then a table of word clicks (to record the common degree of words)
PinyinWord table
CREATE TABLE "public"."pinyinword" ( "id" text COLLATE "default" NOT NULL, "word" text COLLATE "default" NOT NULL, "whole" text COLLATE "default" NOT NULL, "acronym" text COLLATE "default" NOT NULL, "wordlength" int4 NOT NULL, "wholelength" int4 NOT NULL, "acronymlength" int4 NOT NULL )
WordClick table
CREATE TABLE "public"."wordclick" ( "wordcontent" text COLLATE "default", "id" text COLLATE "default" NOT NULL )
The data in the table is initialized by itself
2. Next, two data types are introduced, which play an important role in analyzing input
/** * word element */ public class Lexeme { private String content; //Letter content private LexemeType lexemeType; //Lexeme type } public enum LexemeType { CHINESE, //Chinese WHOLE, //full spell ACRONYM // Pinyin acronym }
/** * Chinese sentences (classes that handle user input) */ public class ChineseSentence { private String content; // user input content private List<Lexeme> sentenceUnits; // words contained in content private SentenceType sentenceType; // The lowest type of sentence (cannot be set, please see initSentenceType() for assignment) public String getContent() { return content; } public void setContent(String content) { this.content = content; } public List<Lexeme> getSentenceUnits() { return sentenceUnits; } public SentenceType getSentenceType() { return sentenceType; } public void setSentenceUnits (List <Lexeme> sentenceUnits) { this.sentenceUnits = sentenceUnits; initSentenceType(); } private void initSentenceType() { sentenceType = SentenceType.CHINESE_SENTENCE; for (Lexeme lexeme : sentenceUnits) { if (lexeme.getLexemeType() == LexemeType.ACRONYM) { sentenceType = SentenceType.ACRONYM_SENTENCE; break; } else if (lexeme.getLexemeType() == LexemeType.WHOLE && sentenceType == SentenceType.CHINESE_SENTENCE) { sentenceType = SentenceType.WHOLE_SENTENCE; } } } }
3. The next step is to process the user input (xiug mobile phone h), use regular expressions to decompose the target into lemmas (Lexeme) and generate sentences
//Regular expression (copied from the Internet and made some modifications to identify Chinese and suspected pinyin) private static final String SUSPECTED_PINYIN_REGEX = "[\\ u4e00-\\ u9fa5] | (sh | ch | zh | [^aoeiuv])? [iuv]? (ai | ei | ao | ou | er | ang? | eng? | ong | a | o | e | i | u | ng | n)? ";
Using this regular expression may intercept non-existing pinyin combinations, such as jvao
This is directly into j, v, a, o (find a library of pinyin combinations, and see if the pinyin that is cut out belongs to the library)
After intercepting and attaching a lexemeType to each token, the following result is obtained
4. The next step is to analyze the words in the sentence one by one
First, briefly explain the principle of analysis
First look at the query conditions
Explain the query parameters. The first is lexemeType. This field is the word level of the specified search. It must be searched according to the lowest word level of the sentence.
For example: 'Modify' --> LexemeType.CHINESE
'xiu改' --> LexemeType.WHOLE
'修g' --> LexemeType.ACRONYM
The three parameters at the end of Search are used for searching, they are used for like operation in SQL, so that the index can be hit
and pinyinword.acronym like #{acronymSearch} || '%'
Since the sentences entered by the user may contain Chinese or Pinyin, these two types need to be filtered
For example, if the user enters 'repair g', we use the lowest word level to search like 'xg' %. This may search for 'shoe cabinet', so I use ChineseFilter and pinyinFilter to filter (append '% repair%' to ChineseFilter. ), so the query condition becomes
and pinyinword.acronym like #{acronymSearch} || '%' and pinyinword.word like #{chineseFilter}
This way you won't find 'shoe cabinet'
Let's take a look at the SQL under mybatis, where the wordclick table is joined, and the clicks of each word are obtained for sorting
<select id="searchByClickCount" resultType="model.value.WordClickCount" parameterType="model.options.PinyinWordAnalyzeSearchOptions"> select pw.word word, count(wc.id) clickCount from PinyinWord pw left join wordclick wc on wc.wordcontent = pw.word where 1=1 <choose> <when test="lexemeType.equals('CHINESE')"> <if test="chineseSearch!=null"> and pw.word like #{chineseSearch} || '%' </if> group by pw.word order by clickCount desc <if test="paging"> limit 5 offset 0 </if> </when> <when test="lexemeType.equals('WHOLE')"> <if test="wholeSearch!=null"> and pw.whole like #{wholeSearch} || '%' </if> <if test="chineseFilter!=null"> and pw.word like #{chineseFilter} </if> group by pw.word order by clickCount desc <if test="paging"> limit 5 offset 0 </if> </when> <otherwise> <if test="acronymSearch!=null"> and pw.acronym like #{acronymSearch} || '%' </if> <if test="chineseFilter!=null"> and pw.word like #{chineseFilter} </if> <if test="pinyinFilter!=null"> and pw.whole like #{pinyinFilter} </if> group by pw.word order by clickCount desc <if test="paging"> limit 5 offset 0 </if> </otherwise> </choose> </select>
Then it analyzes the user's input and generates the query conditions.
This is part of the code, enough to understand the principle of query condition generation
LexemeType currentLexemeType; //The current token type LexemeType lastLexemeType = null; //Lexeme before the lowest level List<Lexeme> lexemes = sentence.getSentenceUnits(); //The lowest level of cumulative word units for (int i = 0; i < lexemes.size(); i++) { Lexeme lexeme = lexemes.get(i); currentLexemeType = lexeme.getLexemeType(); String content = lexeme.getContent(); switch (currentLexemeType) { case CHINESE: //If the current word is Chinese String pinyin = convertSmartAll(content); //Convert to Pinyin (pinyin4j) chineseSearch.append(content); //append到chineseSearch字段 wholeSearch.append(pinyin); //append到wholeSearch字段 acronymSearch.append(pinyin.charAt(0)); //append to the acronymSearch field chineseFilter.append(content).append("%"); //append到chineseFilter字段 break; case WHOLE: //If it is pinyin, the same is Chinese wholeSearch.append(content); acronymSearch.append(content.charAt(0)); pinyinFilter.append(content).append("%"); break; case ACRONYM: //Similarly acronymSearch.append(content); break; } //Convert lastLexemeType into the current lemma and the LexeType of the first level in the current lastLexemeType, because the lowest lemma is required when searching lastLexemeType = LexemeType.changeDown(lastLexemeType, currentLexemeType); //new searchOptions PinyinWordAnalyzeSearchOptions options = new PinyinWordAnalyzeSearchOptions( chineseSearch.toString(), wholeSearch.toString(), acronymSearch.toString(), chineseFilter.toString(), pinyinFilter.toString(), lastLexemeType ); // The result is out. . . List<WordClickCount> wordClickCounts = mapper.searchByClickCount(options);
have a test:
@Test public void analyzeAndSearchTest() throws Exception { List<List<WordClickCount>> results = pinyinWordService.analyzeSearch("xiugaishoujihao"); //为了初始化 pinyin4j long start1 = System.currentTimeMillis(); for (int i = 0; i < 100; i++) { long start = System.currentTimeMillis(); List<List<WordClickCount>> results1 = pinyinWordService.analyzeSearch("xiugais机haoqyxgai修改"); long end = System.currentTimeMillis(); System.out.println(end - start + " ms"); } long end1 = System.currentTimeMillis(); System.out.println(end1 - start1 + " ms"); }
Test Results
The test decomposes 100 user inputs of 11 words, the call cost is 22.1 seconds, and the average is 221ms each, and the effect is not bad
Future optimization:
In sql, table association and count operations are used. When the amount of data is relatively large, you can consider adding a field to pinyinword, and update the count to the pinyinword table every day, so that you can query pinyinword in a single table.