This article reprinted from quantum bits, the second reprint ban
Researchers dealing with data every day, has become a top Kaggle Master (Grandmaster) dream.
But many of Kaggle participating teams every year, usually a project, thousands of people-oriented people have signed up, how to stand out where?
Recently, automated data preparation and collaboration platform Dataland co-founder Lavanya Shukla, shared on the blog she eventually became 0.3% in Kaggle contest winners experience.
Little sister said in Twitter, the Raiders are all dry goods this, users have their thumbs. Some netizens said that this piece of work is great, and they knew ridge regression so powerful!
* First put the original address: *
*https://www.kaggle.com/lavanyashukla01/how-i-made-top-0-3-on-a-kaggle-competition*
Qubit will dry the little sister of translation points are summarized as follows, dry long article warning, it is recommended after the first code to see:
The start of a data science competition is a huge job, so I wrote this classic in Kaggle house prices predicted to get TOP 0.3% performance of the title (Advanced Regression Techniques) in game experience.
Welcome to this fork dry goods, also welcomed the hands-on practical problems in the code.
good luck!
aims
Each row in the data set described features of the house.
Our goal is based on these characteristics, forecast sales price.
The quality assessment model is based on the root mean square error between the model predictions and the actual sales price sales price (RMSE). The RMSE convert logarithmic scale, predicted scores to ensure that the impact of costly errors when houses and cheap houses have generated considerable.
Key features of the model training process
Cross-validation: using 12-fold cross validation.
Model: each cross-validation model fitting 7 (including ridge, svr, gradient boost, random forest, xgboost, lightgbm regressors, etc.)
Stacking: In addition, I used xgboost training a meta StackingCVRegressor.
混合:所有训练过的模型在不同程度上都存在对训练数据的过拟合。因此,为了做出最终的预测,我将它们的预测混合在一起以得到更可靠的预测。
模型表现
从下图可以看出,混合模型的RMSLE(均方根对数误差)为0.075,远优于其他模型。
这是我用来做最终预测的模型:
现在我们已经知道了一些信息,可以开始着手了:
EDA
目标
数据集中每一行都描述了房子的特征。
我们的目标是根据这些特征预测销售价格。
销售价格:我们打算预测的变量
特征处理
我们先将数据集中特征进行可视化:
并绘制出这些特征之间的关系,以及它们与销售价格的关系。
让绘制销售价格与数据集中的一些特性之间的关系。
特征工程
来看一下房子售价的分布情况。
可以看出,销售价格在右边倾斜,这是因为大多数ML模型不能很好地处理非正态分布数据。
我们可以应用log(1+x)变换来修正倾斜。
再画一次销售价格的分布:
现在,销售价格是正态分布的了。
添补缺失值
现在,我们可以为每个特性添加缺失的值。
这样一来,这不就没有缺失值了……
解决倾斜特征
我们用scipy函数boxcox1p来计算Box-Cox转换。我们的目标是找到一个简单的转换方式使数据规范化。
现在,所有的特种看起来都是正态分布的了。
创造有趣的特征
ML模型很难识别更复杂的模式,所以我们可以基于对数据集的直觉创建一些特征来帮助我们的模型,比如,每个房子地板总面积、浴室和门廊面积。
特征转换
我们通过计算数值特征的对数和平方变换来创建更多的特征。
编码分类特征
因为大多数模型只能处理数字特征,所以采用数字编码分类特征。
重新创建训练和测试集
可视化我们要训练模型的一些特性。
训练模型
设置交叉验证并定义错误度量
设置模型
训练模型
获得每个模型的交叉验证分数。
混合模型逼格得到预测值
确定性能最佳的模型
从上图中我们可以看出,混合模型的RMSLE为0.075,远远优于其他模型。这是我用来做最终预测的模型。
提交预测值
传送门
Original article:
https://www.kaggle.com/lavanyashukla01/how-i-made-top-0-3-on-a-kaggle-competition
Little sister's blog:
https://lavanya.ai/
Letter to switch to AI students in school
[AI] complete AI self-learning course, the most detailed resource consolidation!
AI switch need to look at some of the articles
[PDF] to send the book Python programming from entry to practice
Python from entry to the master, the depth of learning and machine learning materials spree!
Press scan code tease returnees