Java implements sensitive word filtering

[Transfer] http://blog.csdn.net/chenssy/article/details/26961957

The original text is from: http://cmsblogs.com/?p=1031. Respect the author's achievements, please indicate the source when reprinting! --

-- Personal site: http://cmsblogs.com      

At the same time, there is no state transition, no action, and some are just Query (search). We can think that through S query U, V, through U query V, P, through V query UP. With such a transition we can turn state transitions into lookups using Java collections.
It is true that the following sensitive words have been added to our sensitive thesaurus: Japanese, Japanese devils, Mao Zedong. So what kind of structure do I need to build into?
First: query day--->{ben}, query this--->{person, devil}, query person--->{null}, query ghost--->{child}. The structure is as follows:
In this way, we build our sensitive thesaurus into a tree similar to one by one, so that when we judge whether a word is a sensitive word, the matching range of the search is greatly reduced. For example, if we want to judge the Japanese, we can confirm that the tree that needs to be retrieved is based on the first word, and then retrieve it in this tree.
         But how to tell if a sensitive word has ended? Use the flag to determine.
         So the key to this is how to build such a sensitive word tree. Below I have used HashMap in Java as an example to implement the DFA algorithm. The specific process is as follows:
Japanese, Japanese devils as an example
         1. Query "Day" in the hashMap to see if it exists in the hashMap. If it does not exist, it proves that the sensitive word starting with "Day" does not exist yet, so we directly build such a tree. Skip to 3.
         2. If it is found in the hashMap, it indicates that there are sensitive words starting with "day", set hashMap = hashMap.get("day"), skip to 1, and match "ben" and "person" in turn.
         3. Determine whether the word is the last word in the word. If it indicates the end of the sensitive word, set the flag bit isEnd = 1, otherwise set the flag bit isEnd = 0;

  we have implemented a simple method for the sensitive word database, so how to realize the retrieval? The retrieval process is nothing more than the get implementation of hashMap, and if found, it proves that the word is a sensitive word, otherwise it is not a sensitive word. The process is as follows: Suppose we match "Long Live the Chinese People".
         1. The first word "中" can be found in hashMap. Get a new map = hashMap.get("").
         2. If map == null, it is not a sensitive word. Otherwise
         , skip to 3 3. Get isEnd in the map, and judge whether the word is the last one by whether isEnd is equal to 1. If isEnd == 1, the word is a sensitive word, otherwise skip to 1.
         Through this step, we can judge that "Chinese people" is a sensitive word, but if we enter "Chinese woman", it is not a sensitive word.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326449760&siteId=291194637