Kaggle-Quora Insincere Questions Classification-Solution

寒假期间参加了Kaggle的一个比赛-QIQC，作为第一个认真参加的Kaggle比赛，最后银牌，感谢啸宇哥的帮助。

比赛链接：https://www.kaggle.com/c/quora-insincere-questions-classification
运行环境：

tensorflow 1.12.0
Keras 2.2.4
torch 1.0.0

1. 数据预处理

数据预处理是这类任务非常重要的一个环节。由于数据是直接爬取，所以数据是非常脏的，首先就要进行数据预处理。数据预处理的另一个作用就是降低OOV，通过数据预处理我们将OOV从30%降低到1%。
我们所用到的数据预处理包括：

大小写转换；
替换引号；
替换拼写错误的词和缩写词；
将标点符号替换为空格+标点符号；
替换数字；

数据预处理可以说是一门玄学，效果的好坏依赖于任务和数据分布。同样的数据预处理再不同的比赛中可能带来很大的差距。关于数据预处理在论文[1]中给出了详细的总结，基于 toxic comment classification 这个比赛，作者进行了大量的实验证明。
详细代码在[4]中，以下为数据预处理部分代码。

def replace_quote(text):
    quote = ['´', '‘', '’', "`"]
    for s in quote:
        text = text.replace(s, "'")
    return text
                      
def re_mapping(mapping):
    res = re.compile('(%s)' % '|'.join(mapping.keys()))
    return res

mapping = dict(set(contraction.items()) | set(mispell.items()))
re_map = re_mapping(mapping)
def replace_mapping(text):
    def replace(match):
        return mapping[match.group(0)]
    return re_map.sub(replace, text)

re_punc = re_mapping(punc)
def replace_punc(text):
    def replace(match):
        return punc[match.group(0)]
    return re_punc.sub(replace, text)

def sep_punc(x):
    for p in puncs:
        x = x.replace(p, f' {p} ')
    return x

def replace_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

# Replace quote
train['question_text'] = train['question_text'].progress_apply(lambda x: replace_quote(x))
test['question_text'] = test['question_text'].progress_apply(lambda x: replace_quote(x))
print("Replace quote done")

# Replace mapping(contraction & mispell)
train['question_text'] = train['question_text'].progress_apply(lambda x: replace_mapping(x))
test['question_text'] = test['question_text'].progress_apply(lambda x: replace_mapping(x))
print("Replace mapping done")

# Replace punc
# train['question_text'] = train['question_text'].progress_apply(lambda x: replace_punc(x))
# test['question_text'] = test['question_text'].progress_apply(lambda x: replace_punc(x))
# print("Replace punc done")

# Sep punc
train['question_text'] = train['question_text'].progress_apply(lambda x: sep_punc(x))
test['question_text'] = test['question_text'].progress_apply(lambda x: sep_punc(x))
print("Sep punc done")

# Replace numbers
train['question_text'] = train['question_text'].progress_apply(lambda x: replace_numbers(x))
test['question_text'] = test['question_text'].progress_apply(lambda x: replace_numbers(x))
print("Replace numbers done")

2. 词向量：

比赛提供了4种词向量：

GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

在这类比赛中，词向量起到了非常重要的作用，对于效果的提升，堪比模型。在[2,3]中对词向量如何组合进行了详细的比较。鉴于训练时间的要求，我们最终选择了对para和Glove取平均。在本地测试中对这两个词向量进行concat能获得更好的结果，但是超时。评论区中有人提出对词向量降维，能在保证准确率的情况下，显著降低时间，不得不说kaggle真的是大神多，能学到很多东西。在Top Solution中，他们对两个词向量进行加权平均，比如0.7Glove+0.3para，此处应该学习。

def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float32')

def build_emb(embeddings_index, max_features, word_index):
    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    emb_size = all_embs.shape[1]

    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, emb_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

emb_glove = build_emb(embeddings_index_glove, max_features, word_index)
emb_para = build_emb(embeddings_index_para, max_features, word_index)
emb = np.mean([emb_glove, emb_para], axis=0)

3.特征：

除了直接利用NN，另一个常见的方法是从文本中提取特征，融入到NN模型中。我们提取的特征如下：

单词数、字符数；
首字母大写的单词数，单词全部大写的单词数；
上述单词数占总次数的比率；
“？，！”的数量；
单词的最大长度，平均长度；
停用词的数量；

def add_features(df):
    df['question_text'] = df['question_text'].progress_apply(lambda x: str(x))
    df['num_chars'] = df['question_text'].progress_apply(len)
    df['num_words'] = df.question_text.str.count('\S+')

    df['num_capital'] = df['question_text'].progress_apply(lambda x: sum(1 for c in x if c.isupper()))
    df['capital_rate'] = df['num_capital'] / df['num_words']

    df['num_uniquewords'] = df['question_text'].progress_apply(lambda x: len(set(x.split())))
    # df['unique_rate'] = df['num_uniquewords'] / df['num_words']

    # df["num_titlewords"] = df["question_text"].progress_apply(lambda x: len([w for w in x.split() if w.istitle()]))
    # df['title_rate'] = df['num_titlewords'] / df['num_words']
    
    # df["num_upperwords"] = df["question_text"].progress_apply(lambda x: len([w for w in x.split() if w.isupper()]))
    # df['upper_rate'] = df['num_upperwords'] / df['num_words']
    
    df["num_exc"] = df["question_text"].progress_apply(lambda x: x.count("!")).astype('uint16')
    df["num_q"] = df['question_text'].progress_apply(lambda x: x.count("?")).astype('uint16')
    df["num_,"] = df['question_text'].progress_apply(lambda x: x.count(",")).astype('uint16')
    df["num_."] = df['question_text'].progress_apply(lambda x: x.count(".")).astype('uint16')
    df["mean_word_len"] = df["question_text"].progress_apply(lambda x: np.mean([len(w) for w in x.split()]))
    df["max_word_len"] = df['question_text'].progress_apply(lambda x: max([len(w) for w in x.split()]))

    df["num_unpunc"] = df["question_text"].progress_apply(lambda x: sum(x.count(p) for p in unpunc)).astype('uint16')
    df["num_punc"] = df["question_text"].progress_apply(lambda x: sum(x.count(p) for p in punctuation)).astype('uint16')
    # df["num_mispell"] = df["question_text"].progress_apply(lambda x: sum(x.count(p) for p in mispell)).astype('uint16')

    return df

4. 模型：

我们最终选择的模型是双向LSTM+Attention，模型参数和结构如下：
在这里插入图片描述

class LstmFAtn():
    def model(self, embedding_matrix, maxlen, max_features):
        inp_seq = Input(shape=(maxlen,), name='seq')
        inp_feature = Input(shape=(len(feature_cols),), name='feature')
        emb_size = embedding_matrix.shape[1]
        x_emb = Embedding(max_features, emb_size, weights=[embedding_matrix], trainable=False)(inp_seq)
        x = SpatialDropout1D(0.2)(x_emb)
        x = Bidirectional(CuDNNLSTM(64, return_sequences=True))(x)
        y = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)

        atn_1 = Attention(maxlen)(x)
        atn_2 = Attention(maxlen)(y)
        avg_pool = GlobalAveragePooling1D()(y)
        max_pool = GlobalMaxPooling1D()(y)

        x = concatenate([atn_1, atn_2, avg_pool, max_pool, inp_feature])
        x = Dense(32, activation='relu', kernel_initializer=glorot_normal(seed=SEED))(x)
        x = Dropout(0.1)(x)

        output = Dense(1, activation="sigmoid")(x)
        model = Model(inputs=[inp_seq, inp_feature], outputs=output)
        model.compile(loss='binary_crossentropy', optimizer='adam')
        return model

比赛有2h的时间限制，Keras的运行时间要明显比pytorch长，但我们实验过程中，keras版更加稳定，最后提交了keras版和pytorch两个版本。这个模型参数并不是最优的，在时间限制和性能之间寻找一个平衡。kernel中另一个常见的模型是：
在这里插入图片描述

5.感想：

后续有时间会详细更新比赛的整个过程和细节。比赛到最后很多都没来得及去做，中途甚至想放弃，非常感谢队友，最后坚持了下来，感谢队友做出了很多。kaggle是个很好的学习平台，要比国内的众多比赛平台更加完善而且能学到更多的东西。
未实现的遗憾：

特征筛选未实现；
LSA+LDA特征和NN融合，没有调到最优；
论文中的数据预处理很多关键的没有做；
融合只用了简单平均相加，并未尝试更复杂的模型融合方法，而且是单模型5折；

希望下次比赛能取得更好的成绩，留有遗憾才显得更美好。代码会放在Github上[4]。

[1] Is preprocessing of text really worth your time for toxic comment classification?
[2] Dynamic Meta-Embeddings for Improved Sentence Representations
[3] Frustratingly Easy Meta-Embedding – Computing Meta-Embeddings by Averaging Source Word Embeddings
[4] https://github.com/linxid/Competition/tree/master/QIQC