kaggle——Santander Customer Transaction Prediction

比赛地址
https://www.kaggle.com/c/santander-customer-transaction-prediction

一、赛后总结

1.1学习他人

1.1.1 List of Fake Samples and Public/Private LB split

https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split
首先测试集和训练集统计分析非常相似,但是unique value统计差别大。所以猜想测试集由真实样本由真实样本采样而生成的。由此可以找到100000个fake example、100000个real example。又假设采样是再分public/private LB后进行的,所以real example可以分出50000 public+50000 private

1.1.2 giba single model public 0.9245 private 0.9234

https://www.kaggle.com/titericz/giba-single-model-public-0-9245-private-0-9234

>Reverse features
不理解为什么要反转负相关特征

#Reverse features
for var in features:
    if np.corrcoef( train_df['target'], train_df[var] )[1][0] < 0:
        train_df[var] = train_df[var] * -1
        test_df[var]  = test_df[var]  * -1

>生成特征
每个特征生成四个特征,为本值、count、feature_id、rank值。(不是很理解feature_id的作用),后面还对var做了归一化。 但生成的是(40000000, 4)的矩阵,也就是竖列的堆叠。

def var_to_feat(vr, var_stats, feat_id ):
    new_df = pd.DataFrame()
    new_df["var"] = vr.values
    new_df["hist"] = pd.Series(vr).map(var_stats)
    new_df["feature_id"] = feat_id
    new_df["var_rank"] = new_df["var"].rank()/200000.
    return new_df.values

TARGET = np.array( list(train_df['target'].values) * 200 )

TRAIN = []
var_mean = {}
var_var  = {}
for var in features:
    tmp = var_to_feat(train_df[var], var_stats[var], int(var[4:]) )
    var_mean[var] = np.mean(tmp[:,0]) 
    var_var[var]  = np.var(tmp[:,0])
    tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]
    TRAIN.append( tmp )
TRAIN = np.vstack( TRAIN )

del train_df
_=gc.collect()

print( TRAIN.shape, len( TARGET ) )

>模型LGBM
用LGBM模型训练

model = lgb.LGBMClassifier(**{
     'learning_rate': 0.04,
     'num_leaves': 31,
     'max_bin': 1023,
     'min_child_samples': 1000,
     'reg_alpha': 0.1,
     'reg_lambda': 0.2,
     'feature_fraction': 1.0,
     'bagging_freq': 1,
     'bagging_fraction': 0.85,
     'objective': 'binary',
     'n_jobs': -1,
     'n_estimators':200,})

MODELS = []
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=11111)
for fold_, (train_indexes, valid_indexes) in enumerate(skf.split(TRAIN, TARGET)):
    print('Fold:', fold_ )
    model = model.fit( TRAIN[train_indexes], TARGET[train_indexes],
                      eval_set = (TRAIN[valid_indexes], TARGET[valid_indexes]),
                      verbose = 10,
                      eval_metric='auc',
                      early_stopping_rounds=25,
                      categorical_feature = [2] )
    MODELS.append( model )
    
del TRAIN, TARGET
_=gc.collect()

>预测
对test数据进行同样的特征工程,对每个特征每个模型预测,然后logx-log(1-x),为什么??

为什么最后还要sub[‘target’] = sub[‘target’].rank() / 200000.???
原话: rank or not it produces the same score since the metric is rank based (AUC). I used rank just to normalize to the range [0-1]

ypred = np.zeros( (200000,200) )
for feat,var in enumerate(features):
    tmp = var_to_feat(test_df[var], var_stats[var], int(var[4:]) )
    tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]
    for model_id in range(10):
        model = MODELS[model_id]
        ypred[:,feat] += model.predict_proba( tmp )[:,1] / 10.
ypred = np.mean( logit(ypred), axis=1 )

sub = test_df[['ID_code']]
sub['target'] = ypred
sub['target'] = sub['target'].rank() / 200000.
sub.to_csv('golden_sub.csv', index=False)
print( sub.head(10) )

I studied your code some more. This is a brilliant solution !! Reversing some variables and stacking all of them into 4 columns is really ingenious. It simulates ideas from an NN convolution where the model can use patterns it learns from one variable to assist in its pattern detection of another variable. This also prevents LGBM from modeling spurious interactions between variables. But it’s more advanced than a convolution (that uses the same weights for all variables) because you provide column 3 which has the original variable’s number (0-199), so your LGBM can customize its prediction for each variable. Lastly you combine everything back together mathematically accurate by using mean logit. Very very nice. Setting the frequency count as a categorical value is a nice touch which allows LGBM to efficiently divide the different distributions. You maximized the modeling ability of an LGBM duplicating other participants’ success with NNs. I am quite impressed !!

持续更新。。。

猜你喜欢

转载自blog.csdn.net/yuandong_D/article/details/89392079