Determine whether a text contains a word in a dictionary
Bloom algorithm
Bloom filter required under what circumstances? - Avoid high memory
First look at some of the more common examples
- Word processing software, it is necessary to check whether an English word spelled correctly
- In the FBI, a suspect's name is already on the list of suspects
- In the crawler where a URL is being visited
- yahoo, gmail and other mail spam filtering
These few examples have one thing in common: how to determine whether there is an element in a collection?
Conventional thinking
- Array
- List
- Tree, balanced binary tree, Trie
- Map (red-black tree)
- Hash table
For low-dictionary memory, as follows:
1
import jieba 2 def check(s): 3 huangfan_path = 'path/to/dict.txt' 4 jieba.load_userdict(huangfan_path) 5 huangfan_words_dict = set() 6 with open(huangfan_path, 'rb') as fr: 7 for line in fr.readlines(): 8 huangfan_words_dict.add(line.strip().decode('utf-8')) 9 return set(jieba.lcut(s)) & self.huangfan_words_dict