"Hands-on learning deep learning" series of notes - text preprocessing

Text preprocessing

Text is a kind of sequence data, an article can be viewed as a sequence of characters or words, here are the common text data preprocessing step, pre-treatment generally involves four steps:

  1. Read the text
  2. Participle
  3. Establish dictionary, each word is mapped to a unique index (index)
  4. Converts text from word sequence is a sequence index for easy input model

step1: read text

import collections
import re

def read_time_machine():
    with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f:
        lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]
    return lines


lines = read_time_machine()
print('# sentences %d' % len(lines))

step2: word

def tokenize(sentences, token='word'):
    """Split sentences into word or char tokens"""
    if token == 'word':
        return [sentence.split(' ') for sentence in sentences]
    elif token == 'char':
        return [list(sentence) for sentence in sentences]
    else:
        print('ERROR: unkown token type '+token)

tokens = tokenize(lines)
tokens[0:2]

step3: establish dictionary

Build a dictionary (vocabulary), each word is mapped to a unique index number, in order to facilitate the process model.

class Vocab(object):
    def __init__(self, tokens, min_freq=0, use_special_tokens=False):
        counter = count_corpus(tokens)  # : 
        self.token_freqs = list(counter.items())
        self.idx_to_token = []
        if use_special_tokens:
            # padding, begin of sentence, end of sentence, unknown
            self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
            self.idx_to_token += ['', '', '', '']
        else:
            self.unk = 0
            self.idx_to_token += ['']
        self.idx_to_token += [token for token, freq in self.token_freqs
                        if freq >= min_freq and token not in self.idx_to_token]
        self.token_to_idx = dict()
        for idx, token in enumerate(self.idx_to_token):
            self.token_to_idx[token] = idx

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

def count_corpus(sentences):
    tokens = [tk for st in sentences for tk in st]
    return collections.Counter(tokens)  # 返回一个字典,记录每个词的出现次数

The following attempts to build a corpus dictionary with Time Machine:

vocab = Vocab(tokens)
print(list(vocab.token_to_idx.items())[0:10])

step4: the word into the index

for i in range(8, 10):
    print('words:', tokens[i])
    print('indices:', vocab[tokens[i]])

Segment words with existing tools

Word the way we described earlier is very simple, it has at least the following disadvantages:

  1. Punctuation can usually provide semantic information, but we approach it directly discarded
  2. Similar "should not", "does not" such a word can be incorrectly processed
  3. Similar to the "Mr.", "Dr." such words are incorrectly processed

We can be solved by the introduction of more complex rules of these problems, but in fact, there are a number of existing tools can be a good word, we briefly outline the two of them: Spacy and NLTK .

The following is a simple example:

text = "Mr. Chen doesn't agree with my suggestion."

1. spaCy

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])

2. Use MLTK

from nltk.tokenize import word_tokenize
from nltk import data
data.path.append('/home/kesci/input/nltk_data3784/nltk_data')
print(word_tokenize(text))

Guess you like

Origin www.cnblogs.com/KaifengGuan/p/12309155.html