Kaggle beautiful little sister readme: how I became race Top 0.3% of?



This article reprinted from quantum bits, the second reprint ban



Researchers dealing with data every day, has become a top Kaggle Master (Grandmaster) dream.

But many of Kaggle participating teams every year, usually a project, thousands of people-oriented people have signed up, how to stand out where?

Recently, automated data preparation and collaboration platform Dataland co-founder Lavanya Shukla, shared on the blog she eventually became 0.3% in Kaggle contest winners experience.

640?wx_fmt=png

Little sister said in Twitter, the Raiders are all dry goods this, users have their thumbs. Some netizens said that this piece of work is great, and they knew ridge regression so powerful!

640?wx_fmt=jpeg

* First put the original address: *

*https://www.kaggle.com/lavanyashukla01/how-i-made-top-0-3-on-a-kaggle-competition*

Qubit will dry the little sister of translation points are summarized as follows, dry long article warning, it is recommended after the first code to see:


The start of a data science competition is a huge job, so I wrote this classic in Kaggle house prices predicted to get TOP 0.3% performance of the title (Advanced Regression Techniques) in game experience.

Welcome to this fork dry goods, also welcomed the hands-on practical problems in the code.

good luck!

aims

  • Each row in the data set described features of the house.

  • Our goal is based on these characteristics, forecast sales price.

  • The quality assessment model is based on the root mean square error between the model predictions and the actual sales price sales price (RMSE). The RMSE convert logarithmic scale, predicted scores to ensure that the impact of costly errors when houses and cheap houses have generated considerable.

Key features of the model training process

  • Cross-validation: using 12-fold cross validation.

  • Model: each cross-validation model fitting 7 (including ridge, svr, gradient boost, random forest, xgboost, lightgbm regressors, etc.)

  • Stacking: In addition, I used xgboost training a meta StackingCVRegressor.

  • 混合:所有训练过的模型在不同程度上都存在对训练数据的过拟合。因此,为了做出最终的预测,我将它们的预测混合在一起以得到更可靠的预测。

模型表现

从下图可以看出,混合模型的RMSLE(均方根对数误差)为0.075,远优于其他模型。

这是我用来做最终预测的模型:

640?wx_fmt=png

640?wx_fmt=png

现在我们已经知道了一些信息,可以开始着手了:

640?wx_fmt=png


640?wx_fmt=png

EDA

目标

数据集中每一行都描述了房子的特征。

我们的目标是根据这些特征预测销售价格。

640?wx_fmt=png

640?wx_fmt=png

销售价格:我们打算预测的变量

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

特征处理

我们先将数据集中特征进行可视化:

640?wx_fmt=png

640?wx_fmt=png

并绘制出这些特征之间的关系,以及它们与销售价格的关系。

640?wx_fmt=png

640?wx_fmt=png

让绘制销售价格与数据集中的一些特性之间的关系。

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

特征工程

来看一下房子售价的分布情况。


640?wx_fmt=png

可以看出,销售价格在右边倾斜,这是因为大多数ML模型不能很好地处理非正态分布数据。

我们可以应用log(1+x)变换来修正倾斜。

640?wx_fmt=png

再画一次销售价格的分布:

640?wx_fmt=png

640?wx_fmt=png

现在,销售价格是正态分布的了。

640?wx_fmt=png

640?wx_fmt=png

添补缺失值

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

现在,我们可以为每个特性添加缺失的值。

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

这样一来,这不就没有缺失值了……

解决倾斜特征

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

我们用scipy函数boxcox1p来计算Box-Cox转换。我们的目标是找到一个简单的转换方式使数据规范化。

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

现在,所有的特种看起来都是正态分布的了。

创造有趣的特征

ML模型很难识别更复杂的模式,所以我们可以基于对数据集的直觉创建一些特征来帮助我们的模型,比如,每个房子地板总面积、浴室和门廊面积。

640?wx_fmt=png

特征转换

我们通过计算数值特征的对数和平方变换来创建更多的特征。

640?wx_fmt=png

640?wx_fmt=png

编码分类特征

因为大多数模型只能处理数字特征,所以采用数字编码分类特征。

640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

重新创建训练和测试集

640?wx_fmt=png

可视化我们要训练模型的一些特性。

640?wx_fmt=png

640?wx_fmt=png

训练模型

设置交叉验证并定义错误度量

640?wx_fmt=png

设置模型

640?wx_fmt=png

训练模型

获得每个模型的交叉验证分数。

640?wx_fmt=png

混合模型逼格得到预测值

640?wx_fmt=png

确定性能最佳的模型

640?wx_fmt=png

640?wx_fmt=png

从上图中我们可以看出,混合模型的RMSLE为0.075,远远优于其他模型。这是我用来做最终预测的模型。

提交预测值

640?wx_fmt=png

传送门

Original article:
https://www.kaggle.com/lavanyashukla01/how-i-made-top-0-3-on-a-kaggle-competition

Little sister's blog:
https://lavanya.ai/


Letter to switch to AI students in school


[AI] complete AI self-learning course, the most detailed resource consolidation!


AI switch need to look at some of the articles


Switch to learn AI, how to choose the direction of specific segments, insights from the front line engineers


[] In fact, the Sino-US dispute has lost its suspense (depth Haowen) | Bay Area Artificial Intelligence


[PDF] to send the book Python programming from entry to practice


Python from entry to the master, the depth of learning and machine learning materials spree!


[Free] an institution latest 3980 yuan machine learning / course high-speed downloads of large data, limited to 200 copies


640?wx_fmt=jpeg

Press scan code tease returnees


640?wx_fmt=png



640?wx_fmt=png


  Feel good, feel free to forward, a look at the trouble spots!


Guess you like

Origin blog.csdn.net/BTUJACK/article/details/92871249