Text normalization

2. Text normalization

Before further analytical or NLP, you first need to standardize the corpus of text documents. For this purpose, again using standardized modules, in addition to require the use of new technologies specifically for content.

After analyzing many corpus, carefully selected a number of new words, and update them ban a stop word list, the following code shows:

stopword_list  =  nltk.corpus.stopwords.words( 'english' )
stopword_list  =  stopword_list  +  [ 'mr' 'mrs' 'come' 'go' 'get' ,
                                  'tell' 'listen' 'one' 'two' 'three' ,
                                  'four' 'five' 'six' 'seven' 'eight' ,
                                  'nine' 'zero' 'join' 'find' 'make' ,
                                  'say' 'ask' 'tell' 'see' 'try' 'back' ,
                                  'also' ]

Most can see the newly added words are generic, not much sense of a verb or a noun. Update them into the list of stop words is useful for extracting text clustering feature. Standardized pipeline also added a new function that uses regular expressions to extract text from the text identifies the theme, as follows:

import  re
 
def  keep_text_characters(text):
     filtered_tokens  =  []
     tokens  =  tokenize_text(text)
     for  token  in  tokens:
         if  re.search( '[a-zA-Z]' , token):
             filtered_tokens.append(token)
     filtered_text  =  ' ' .join(filtered_tokens)
     return  filtered_text

The new function along with the repeated use of a function of the front (including the expansion of abbreviations and decoding HTML, word segmentation, remove stop words and special characters, speech reduction) added together to the final normalization function, as follows:

def  normalize_corpus(corpus, lemmatize = True ,
                      only_text_chars = False ,
                      tokenize = False ):
     
     normalized_corpus  =  []   
     for  text  in  corpus:
         text  =  html_parser.unescape(text)
         text  =  expand_contractions(text, CONTRACTION_MAP)
         if  lemmatize:
             text  =  lemmatize_text(text)
         else :
             text  =  text.lower()
         text  =  remove_special_characters(text)
         text  =  remove_stopwords(text)
         if  only_text_chars:
             text  =  keep_text_characters(text)
         
         if  tokenize:
             text  =  tokenize_text(text)
             normalized_corpus.append(text)
         else :
             normalized_corpus.append(text)
     return  normalized_corpus

As can be seen above function very similar to the functions discussed before, just add keep_text_charachters () function to retain the text characters, the function only_text_chars parameters set to True performed.

Guess you like

Origin www.cnblogs.com/dalton/p/11354009.html