Article directory
Search Engines: Document Lemmatization and Corpus Processing
Lemmatization: the process of splitting a given string into a series of subsequences, where each subsequence is called a token
The user's input in the search engine should be lexicalized before being sent to the backend for retrieval.
(1) Symbol clear
Python
The punctuation under python's string module contains all English punctuation marks
The removal of symbols can be accomplished by using the string replacement method that comes with Python.
punctuation_string = string.punctuation
lines = re.sub('[{}]'.format(punctuation_string), " ", text)
输入 'I believe you!And you?'
输出 'I believe you And you '
JAVA
Using regular expressions and replacing JAVA strings
Matcher m = Pattern.compile("[\n`~!@#$%^&*()+=|{}':;',\\[\\].<>/?~!@#¥%……&*()——+|{}【】‘;:”“’。, ·、?]").matcher(text);
String newtext = m.replaceAll("").trim();
(2) Case conversion
Python
The words 'Apple' and 'apple' should be treated as the same term and we treat them as lowercase
string = string.lower()
输入 'I believe you And you '
输出 'i believe you and you '
JAVA
String text = text.toLowerCase()
(3) English slice
Python
The cumbersome English words, which are different from Chinese slices, have their own natural separators.
lsplit = lines.split()
输入 'i believe you and you '
输出 ['i', 'believe', 'you', 'and', 'you']
JAVA
Arrays.toString(text.split(" "))
(4) Remove stop words
Words that are too common are of little value when matching documents and user needs, and need to be completely removed from the vocabulary.
Python
Some very common terms like 'you', 'I', 'and' are considered as stop words
Because they are too frequent to appear in any document, they are removed from the statistics.
en_stop = get_stop_words('en')
stopped_tokens = [i for i in token if i not in en_stop]
输入 ['i', 'believe', 'you', 'and', 'you']
输出 ['believe', 'really', 'wait']
(5) Stemming
Due to the different grammatical requirements of languages, documents often contain different forms of a certain vocabulary,
When we retrieve a word, documents containing other forms of the word should also be returned.
The words 'believes' and 'believe' should be regarded as the same term, and we extract them as the same stem
A stem is not necessarily a grammatically correct word, it's just a way of bringing different words into the same form.
Python
p_stemmer = PorterStemmer()
texts = [p_stemmer.stem(i) for i in stopped_tokens]
输入 ['believe', 'really', 'wait', 'believes']
输出 ['believ', 'realli', 'wait', 'believ']
JAVA
PorterStemmer stem = new PorterStemmer();
stem.setCurrent("happyness");
stem.stem();
String result = stem.getCurrent();
(6) Remove duplicates
The terms that make up the term dictionary should not be repeated for simple deduplication
word_list = list(dict.fromkeys(word_list))
输入 ['believ', 'realli', 'wait', 'believ']
输出 ['believ', 'realli', 'wait']