Search engine: simple document entry and corpus processing (Python/Java)

Search Engines: Document Lemmatization and Corpus Processing

Lemmatization: the process of splitting a given string into a series of subsequences, where each subsequence is called a token

The user's input in the search engine should be lexicalized before being sent to the backend for retrieval.

(1) Symbol clear

Python

The punctuation under python's string module contains all English punctuation marks

The removal of symbols can be accomplished by using the string replacement method that comes with Python.

punctuation_string = string.punctuation
lines = re.sub('[{}]'.format(punctuation_string), " ", text)
输入 'I believe you!And you?'
输出 'I believe you And you '

JAVA

Using regular expressions and replacing JAVA strings

Matcher m = Pattern.compile("[\n`~!@#$%^&*()+=|{}':;',\\[\\].<>/?~!@#¥%……&*()——+|{}【】‘;:”“’。, ·、?]").matcher(text);
String newtext = m.replaceAll("").trim();
(2) Case conversion

Python

The words 'Apple' and 'apple' should be treated as the same term and we treat them as lowercase

string = string.lower()
输入 'I believe you And you '
输出 'i believe you and you '

JAVA

String text = text.toLowerCase()
(3) English slice

Python

The cumbersome English words, which are different from Chinese slices, have their own natural separators.

lsplit = lines.split()
输入 'i believe you and you '
输出 ['i', 'believe', 'you', 'and', 'you']

JAVA

Arrays.toString(text.split(" "))
(4) Remove stop words

Words that are too common are of little value when matching documents and user needs, and need to be completely removed from the vocabulary.

Python

Some very common terms like 'you', 'I', 'and' are considered as stop words

Because they are too frequent to appear in any document, they are removed from the statistics.

en_stop = get_stop_words('en')
stopped_tokens = [i for i in token if i not in en_stop]
输入 ['i', 'believe', 'you', 'and', 'you']
输出 ['believe', 'really', 'wait']
(5) Stemming

Due to the different grammatical requirements of languages, documents often contain different forms of a certain vocabulary,

When we retrieve a word, documents containing other forms of the word should also be returned.
insert image description here

The words 'believes' and 'believe' should be regarded as the same term, and we extract them as the same stem

A stem is not necessarily a grammatically correct word, it's just a way of bringing different words into the same form.

Python

p_stemmer = PorterStemmer()
texts = [p_stemmer.stem(i) for i in stopped_tokens]
输入 ['believe', 'really', 'wait', 'believes']
输出 ['believ', 'realli', 'wait', 'believ']

JAVA

 PorterStemmer stem = new PorterStemmer();
 stem.setCurrent("happyness");
 stem.stem();
 String result = stem.getCurrent();
(6) Remove duplicates

The terms that make up the term dictionary should not be repeated for simple deduplication

word_list = list(dict.fromkeys(word_list))
输入 ['believ', 'realli', 'wait', 'believ']
输出 ['believ', 'realli', 'wait']

Guess you like

Origin blog.csdn.net/yt266666/article/details/127427285