DFA algorithm based on sensitive word filtering

This switched Analysis sensitive word filtering algorithms (C ++) , on the basis of which he himself was doing a little modify according to their own situation.

https://blog.csdn.net/u012755940/article/details/51689401?utm_source=app

In order to improve search efficiency, there will be sensitive to the word with the tree structure storage, each node has a map members, whose mapping is a string corresponding to a WordNode. 
For example there are sensitive thesaurus shooter, a pistol that the words, after the reading into a tree structure as shown in FIG. 
Write pictures described here

STL :: map in accordance with the operator <comparison determination whether the same element, and comparing the size of the element, then select the appropriate position is inserted into the tree. 
The following main achievement of the WordNode class, insert nodes and queries.

WordNode.h

#ifndef __WORDNODE_H__
#define __WORDNODE_H__

#define PACE       1

#include <string>
#include <map>
#include <stdio.h>

class CWordNode
{
public:
    CWordNode(std::string character);
    CWordNode(){ m_character = ""; };
    ~CWordNode();
    std::string getCharacter() const{ return m_character; };
    CWordNode* findChild(std::string& nextCharacter);
    CWordNode* insertChild(std::string& nextCharacter);
private:
    friend class CWordTree;
    typedef std::map<std::string, CWordNode> _TreeMap;
    typedef std::map<std::string, CWordNode>::iterator _TreeMapIterator;

    std::string m_character;
    _TreeMap m_map;
    CWordNode* m_parent;
};

#endif

WordNode.cpp

#include "WordNode.h"

using namespace std;

CWordNode::~CWordNode()
{

}

CWordNode::CWordNode(std::string character)
{
    if (character.size() == PACE)
    {
        m_character.assign(character);
    }
}


CWordNode* CWordNode::findChild(std::string& nextCharacter)
{
    _TreeMapIterator TreeMapIt = m_map.find(nextCharacter);
    if (TreeMapIt == m_map.end())
    {
        return NULL;
    }
    else
    {
        return &TreeMapIt->second;
    }
}

CWordNode* CWordNode::insertChild(std::string& nextCharacter)
{
    if (!findChild(nextCharacter))
    {
        m_map.insert(pair<std::string, CWordNode>(nextCharacter, CWordNode(nextCharacter)));
        return &(m_map.find(nextCharacter)->second);
    }
    return NULL;
}

In addition,

#define PACE 1

PACE 2 was originally here, because a GBK Chinese characters occupy two characters, but also the original text to say if the case needs to be considered in conjunction with the English or in English, the PACE to 1. 
But then I tried to think whether it is Chinese, English, or in English, PACE 1 applies, the results are right, but the string of each node in the case of Chinese are no longer a complete characters, but characters one character.

Then realize this tree, the tree in the establishment WordNode to establish parent as the root node, a parent begin to m_emptyRoot, then add the keyword follow the rules to the tree, assuming a start m_emptyRoot is empty, keyword is "sensitive words" will with "sensitive words" to establish a branch become 'sensitized' to a branch - before the "sensitive words" and "sensitivity"> the 'word', then, if you want to add "sensitivity", because -> 'sense' the same word, it will in 'sensitive' -> 'sense' - the foundation> 'word' on, from the word 'flu' to start a new life grow branches, namely 'sensitive' -> 'sense' -> ' degree ', which two split common' sensitive '->' sense '.

The following code implements WordTree class, were composed and query tree.

WordTree.h

#ifndef __WORDTREE_H__
#define __WORDTREE_H__

#include "WordNode.h"

class CWordTree
{
public:
    CWordTree();
    ~CWordTree();

    int nCount;
    CWordNode* insert(std::string &keyWord);
    CWordNode* insert(const char* keyword);
    CWordNode* find(std::string& keyword);
private:
    CWordNode m_emptyRoot;
    int m_pace;
    CWordNode* insert(CWordNode* parent, std::string& keyword);
    CWordNode* insertBranch(CWordNode* parent, std::string& keyword);
    CWordNode* find(CWordNode* parent, std::string& keyword);
};

#endif // __WORDTREE_H__

WordTree.cpp

#include "WordTree.h"

CWordTree::CWordTree()
:nCount(0)
{

}

CWordTree::~CWordTree()
{
}

CWordNode* CWordTree::insert(std::string &keyWord)
{
    return insert(&m_emptyRoot, keyWord);
}

CWordNode* CWordTree::insert(const char* keyWord)
{
    std::string wordstr(keyWord);
    return insert(wordstr);
}

CWordNode* CWordTree::insert(CWordNode* parent, std::string& keyWord)
{
    if (keyWord.size() == 0)
    {
        return NULL;
    }
    std::string firstChar = keyWord.substr(0, PACE);
    CWordNode* firstNode = parent->findChild(firstChar);
    if (firstNode == NULL)
    {
        return insertBranch(parent, keyWord);
    }
    std::string restChar = keyWord.substr(PACE, keyWord.size());
    return insert(firstNode, restChar);
}

CWordNode* CWordTree::find(std::string& keyWord)
{
    return find(&m_emptyRoot, keyWord);
}

CWordNode* CWordTree::find(CWordNode* parent, std::string& keyWord)
{
    std::string firstChar = keyWord.substr(0, PACE);
    CWordNode* firstNode = parent->findChild(firstChar);
     if (firstNode == NULL)
    {
        nCount = 0;
        return NULL;
    }
    std::string restChar = keyWord.substr(PACE, keyWord.size());
    if (firstNode->m_map.empty())
    {
        return firstNode;
    }
    if (keyWord.size() == PACE)
    {
        return NULL;
    }
    nCount++;
    return find(firstNode, restChar);
}

CWordNode* CWordTree::insertBranch(CWordNode* parent, std::string& keyWord)
{
    std::string firstChar = keyWord.substr(0, PACE);
    CWordNode* firstNode = parent->insertChild(firstChar);
    if (firstNode != NULL)
    {
        std::string restChar = keyWord.substr(PACE, keyWord.size());
        if (!restChar.empty())
        {
            return insertBranch(firstNode, restChar);
        }
    }
    return NULL;
}

Finally, the Tree is to be implemented using the sensitive word filtering, WordFilter :: censor (string & source) used for sensitive word filtering function, i.e., the input source string, if the source contains sensitive words, with the "**" replaced.

WordFilter :: load (const char * filepath) function through the sensitive files are loaded word, and build WordTree, here I use the txt file.

The following class implements WordFilter.

WordFilter.h

#ifndef __WORDFILTER_H__
#define __WORDFILTER_H__

#include "WordTree.h"
#include "base/CCRef.h"

USING_NS_CC;

class CWordFilter : public Ref
{
public:
    ~CWordFilter();
    bool loadFile(const char* filepath);
    bool censorStr(std::string &source);
    bool censorStrWithOutSymbol(const std::string &source);
    static CWordFilter* getInstance();
    static void release();
private:
    std::string string_To_UTF8(const std::string & str);
    std::string UTF8_To_string(const std::string & str);
    CWordFilter();
    static CWordFilter* m_pInstance;
    CWordTree m_WordTree;
};



#endif // __WORDFILTER_H__

WordFilter.cpp

#include "WordFilter.h"
#include <ctype.h>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <istream>

using namespace std;

USING_NS_CC;

CWordFilter* CWordFilter::m_pInstance = nullptr;
CWordFilter::CWordFilter()
{
}

CWordFilter::~CWordFilter()
{
}

CWordFilter* CWordFilter::getInstance()
{
    if (m_pInstance == NULL)
    {
        m_pInstance = new CWordFilter();
    }
    return m_pInstance;
}

void CWordFilter::release()
{
    if (m_pInstance)
    {
        delete m_pInstance;
    }
    m_pInstance = NULL;
}

bool CWordFilter::loadFile(const char* filepath)
{
    ifstream infile(filepath, ios::in);

    if (!infile)
    {
        return false; 
    }
    else
    {
        string read;
        while (getline(infile, read))
        {
#if (CC_TARGET_PLATFORM == CC_PLATFORM_ANDROID || CC_TARGET_PLATFORM == CC_PLATFORM_IOS)
            string s;
            s = read.substr(0, read.length() - 1);
            m_WordTree.insert(s);
#else
            m_WordTree.insert(read);
#endif
        }
    }

    infile.close();
    return true;
}

bool CWordFilter::censorStr(string &source)
{
    int lenght = source.size();
    for (int i = 0; i < lenght; i += 1)
    {
        string substring = source.substr(i, lenght - i);
        if (m_WordTree.find(substring) != NULL)
        {
            source.replace(i, (m_WordTree.nCount + 1), "**");
            lenght = source.size();
            return true;
        }
    }
    return false;
}

bool CWordFilter::censorStrWithOutSymbol(const std::string &source)
{    
    string sourceWithOutSymbol;

    int i = 0;
    while (source[i] != 0)
    {
        if (source[i] & 0x80 && source[i] & 0x40 && source[i] & 0x20)
        {
            int byteCount = 0;
            if (source[i] & 0x10)
            {
                byteCount = 4;
            }
            else
            {
                byteCount = 3;
            }
            for (int a = 0; a < byteCount; a++)
            {
                sourceWithOutSymbol += source[i];
                i++;
            }
        }
        else if (source[i] & 0x80 && source[i] & 0x40)
        {
            i += 2;
        }
        else
        {
            i += 1;
        }
    }
    return censorStr(sourceWithOutSymbol);
}

Here to point out, I do Cocos2d-x mobile game client development, the program is to migrate to Android or iOS platform. When read line by line txt file and constitutes a sensitive word tree, read behind the string getline (infile, read) function to get with a terminator, such as "gunmen \ 0 ', this time with the characters we need to detect the string "... the gunmen ..." obviously does not comply, it can not be detected in. I just know that this situation now exists in the Android or iOS platform, and in the windows environment VS is such a problem does not occur. So I made a deal to read the string, the last character is terminator removed, and then the next step.

I use the lua, lua string is sent to the C ++ with utf-8 encoded string is removed again so when not using short answer (a & 0x80) to determine

Guess you like

Origin www.cnblogs.com/kpxy/p/11256682.html