kaggle Quora Insincere Questions 总结

第三名：

https://www.kaggle.com/wowfattie/3rd-place

用到了词向量的拼写检查：

https://www.kaggle.com/cpmpml/spell-checker-using-word2vec

第四名：

https://www.kaggle.com/tks0123456789/pme-ema-6-x-8-pochs

还没仔细看

第13名：

https://www.kaggle.com/canming/ensemble-mean-iii-64-36

词向量的权重：

 np.mean((1.28*embedding_matrix_1, 0.72*embedding_matrix_3), axis=0)

不同的模型：

poolRNN(spatialdropout=0.2, gru_units=128, weight_decay=0.04):

 LSTM_GRU(spatialdropout=0.20, rnn_units=64, weight_decay=0.07)

BiLSTM_CNN(spatialdropout=0.2, rnn_units=128, filters=[100, 80, 30, 12], weight_decay=0.10)

singleRNN(spatialdropout=0.20, rnn_units=120, weight_decay=0.08)
跑三趟，保存每个模型生成的结果

疑问：
1、整个kernel 的参数量是很大的，不知道如何调
2、用到了AttentionWeightedAverage(Layer):但是不是很清楚怎么控制权重的大小，衰减率怎么选择

第十五名：
https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/80540

the first model is RCNN.
the second model is LSTM(128) + GRU(96) + maxpooling1D + dropout(0.1).
the third model is LSTM(128) + GRU(64) + Conv1D + maxpooling_concatenate.
the fourth model is LSTM(128) + GRU(64) + Conv1D + Attention.

we used the word vector concatenated by glove and fasttext.
we set maxfeatures = None and we set maxlen = 57.

主要集中模型融合上.

第十八名：
https://www.kaggle.com/kentaronakanishi/18th-place-solution
每个epoch逐渐增大batch_size

第20名：
https://www.kaggle.com/jihangz/lt-conc-g-f-lg-mean-g-p-light

loss_fn1 = torch.nn.BCEWithLogitsLoss()
loss_fn2 = f1_loss

 optimizer1 = torch.optim.Adam(model1.parameters(), lr=0.0035)
  scheduler1 = CosineLRWithRestarts(optimizer1, batch_size, len(x_train_fold), restart_period=4, t_mult=1, verbose=True)
  optimizer2 = torch.optim.Adam(model2.parameters(), lr=0.0035)
  scheduler2 = CosineLRWithRestarts(optimizer2, batch_size, len(x_train_fold), restart_period=4, t_mult=1, verbose=True)

两个模型用的两个损失函数， 使用mixed loss(BCE+F1 loss)优化网络

第22名：
使用词性标记来消除单词的歧义问题
https://www.kaggle.com/ryches/22nd-place-solution-6-models-pos-tagging

These choices actually seemed to make some sense given that we have a CNN model,
 our strongest LSTM/GRU models, use our strongest embedding 3 times and 
use POS tagging as an augmentor/differentiator to our weaker embeddings.

思路：So the embedding matrix with pos tags will different without pos tags.

第27名：
https://www.kaggle.com/dicksonchin93/kfold-tfidf-trial

其中一个模型使用了tf-idf作为训练特征

第29名：
https://www.kaggle.com/luudactam/final-sub

 neg1, neg2 = train_test_split(negative, test_size = 0.5, random_state = C*100)
 df1, df2 = pd.concat([neg1,positive], ignore_index=True), pd.concat([neg2,positive], ignore_index=True)
对positive 的样本进行过采样，然后训练

疑问：试过对0做过采样，为什么自己的不work

第79名：（我自己的）

https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/79414

这个discussion里面说到了quroa里面存在了很多关于性别和种族的误分类样本，本人去到数据集检查也发现了确实如此：

girls hate me , but they hate me even more when boys are around me , what do i do ?
are muslims doing love jihad sex pervert ?
will sociopaths have sex with women who are unattractive ?
why do so many quora readers seem to be ignorant of web searching for answers ?
how can a man with an md and a phd be mean to his patients and assault them for being transgender ?
what percentage of the anti - trumpers here are russian bots ?
are women attracted to men 's anus ?
are [unk] stupid ?

这些设计性别和种族的句子，大都有主观因素存在，这也是标签存在噪声的一大根本原因。

'insincere' thredshold 可能会根绝这些分类改变。所以我觉得应该用一个特征去表示这个分类的句子【0，1】。

kaggle Quora Insincere Questions 总结

猜你喜欢