HKUST IFLYTEK-Personal Summary of Greenhouse Temperature Prediction Challenge (not Top)

Preface

This is regarded as the first data competition that the blogger has completely participated in. It has really gained a lot of things, has a comprehensive understanding of the data competition, and witnessed the fight between the big bosses in the front row and the gods. This article can be regarded as a review of the course of this competition, because I didn't do much in the semi-finals, here is mainly to summarize the tricks I gained in the preliminary competition, welcome friends to exchange and learn together! Tips: Of course, I would also like to thank the two great gods Yumen and Aze for sharing their baselines and ideas!

Here are the main members of the team:

  • Yi Lei, second student of Operations Research and Cybernetics at Beijing Jiaotong University
  • yiyang, graduate student of Applied Statistics at Donghua University
  • Concentrated, studying the second year of computer technology major at Shanghai Normal University

1. Introduction to competition questions: time series problems

Competition website: http://challenge.xfyun.cn/topic/info?type=temperature

Because the rules of the preliminary and the semi-finals are almost completely different, it can be said that they are two different games. The preliminary round can be positioned as a multivariate regression problem, while the semi-final round becomes a problem of complementing results based on time series forecasting. Ashamed, our team was busy with various things during the quarter-finals, and basically only one teammate was doing it occasionally, so this article is mainly my summary of the feature engineering of the preliminary round.


2. Preliminary: Allow to use current values ​​and characteristics to traverse

In the preliminary competition, our team members have not met each other, so we modeled on our own, and we used xgboost single mode without exception. The final results were 0.104, 0.106 and 0.117. The advantage of this approach is that our respective models There are certain differences, which also created conditions for subsequent model integration. Finally, after model integration and adjustment, our team's preliminary A team score was fixed at 0.10034, and finally the A list was ranked 36. Switch to the B list and the ranking rose to 28. First name (28/771), successfully qualified for the rematch.

A list

B list

Below, I mainly summarize the work I have done in data processing and feature engineering. Many of them have borrowed from the baseline of the fish guy, and have not done the bucketing feature, and rarely involve feature crossing. (Single mode score: 0.106, A ranking: about 60, I think there should be a lot of improvement in B ranking)

  1. Missing value completion (use fillna operation)
# 缺失值补全
f = ['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo']
train_df[f] = train_df[f].fillna(method='ffill')
test_df[f] = test_df[f].fillna(method='ffill')
  1. Outlier truncation (assuming that the data obey a normal distribution, use the 3σ rule for truncation, which can be replaced by the mean before and after)
# 气压异常值前后均值替换
for f in tqdm(['indoorAtmo', 'outdoorAtmo']):
    upper = data_df[f].mean()+ 3 * data_df[f].std()
    lower = data_df[f].mean()- 3 * data_df[f].std()
    for i in data_df[data_df[f] > upper].index:
        data_df.loc[i,f] = (data_df.loc[i-1,f] + data_df.loc[i+1,f])/2
    for i in data_df[data_df[f] < lower].index:
        data_df.loc[i,f] = (data_df.loc[i-1,f] + data_df.loc[i+1,f])/2
  1. One hour ago, the corresponding minute synchronization value and difference (using shift operation)
# 一小时前同期值
for f in tqdm(['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo']):
    train_df['ago_1hour_{}'.format(f)] = train_df[f].shift(1*60)
    test_df['ago_1hour_{}'.format(f)] = test_df[f].shift(1*2)
  1. Sliding window average of data half an hour ago (using rolling operation)
# 开窗
for f in tqdm(['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo']):
    train_rolling = train_df[f].rolling(window=30)
    train_df['rolling_{}_mean'.format(f)] = train_rolling.mean()
    test_rolling = test_df[f].rolling(window=2)
    test_df['rolling_{}_mean'.format(f)] = test_rolling.mean()
  1. Basic aggregation characteristics by month, day and hour, including mean, median, etc. (groupby, there is characteristic crossing)
# 按月日小时基本聚合特征
group_feats = []
for f in tqdm(['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo']):
    data_df['MDH_{}_medi'.format(f)] = data_df.groupby(['month','day','hour'])[f].transform('median')
    data_df['MDH_{}_mean'.format(f)] = data_df.groupby(['month','day','hour'])[f].transform('mean')
	......(各种统计量)
    group_feats.append('MDH_{}_mean'.format(f))
  1. Basic cross characteristics (addition, subtraction, multiplication and division, mainly ratio and difference)
# 基本交叉特征比值
for f1 in tqdm(['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo'] + group_feats):
    for f2 in ['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo']+group_feats:
        if f1 != f2:
            colname = '{}_{}_ratio'.format(f1, f2)
            data_df[colname] = data_df[f1].values / data_df[f2].values

# 基本交叉特征差值(大-小)
for f1 in tqdm(['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo']):
    for f2 in ['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo']:
        if (f1 != f2) & (data_df[f1].mean() > data_df[f2].mean()):
            colname = '{}_{}_differ'.format(f1, f2)
            data_df[colname] = data_df[f1].values - data_df[f2].values          
  1. Basic cross characteristics of the same period one hour ago
# 一小时前同期值基本交叉特征
for f1 in tqdm(['ago_1hour_outdoorTemp','ago_1hour_outdoorHum','ago_1hour_outdoorAtmo','ago_1hour_indoorHum','ago_1hour_indoorAtmo']):  
    for f2 in ['ago_1hour_outdoorTemp','ago_1hour_outdoorHum','ago_1hour_outdoorAtmo','ago_1hour_indoorHum','ago_1hour_indoorAtmo']:
        if f1 != f2:
            colname = 'ago_1hour_{}_{}_ratio'.format(f1, f2)
            data_df[colname] = data_df[f1].values / data_df[f2].values
  1. Historical information extraction-the mean value (dt) of basic features (including initial features and cross features) at the same hour in the previous n days
# 2days历史信息提取
data_df['dt'] = data_df['day'].values + (data_df['month'].values - 3) * 31
for f in tqdm(['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo']+ratio_feats):
    tmp_df = pd.DataFrame()
    for t in range(15, 45):
        tmp = data_df[data_df['dt'].isin([t-1,t-2])].groupby(['hour'])[f].agg({
    
    'mean'}).reset_index()
        tmp.columns = ['hour','hit2days_{}_mean'.format(f)]
        tmp['dt'] = t
        tmp_df = tmp_df.append(tmp)
    data_df = data_df.merge(tmp_df, on=['dt','hour'], how='left')
  1. Historical information extraction-the mean and difference of basic features (including initial features and cross features) in the first n hours (dh, using diff operation)
# 前1和2小时历史信息差分
data_df['dh'] = data_df['hour'].values + (data_df['dt'].values -14) * 24
for f in tqdm(['outdoorTemp','outdoorHum','outdoorAtmo','indoorHum','indoorAtmo']+ratio_feats):
    tmp_df = data_df.groupby(['dh'])[f].agg(['mean']).reset_index()
    tmp_df.columns = ['dh','bef_{}_mean'.format(f)]
    tmp_df['diff1_{}_mean'.format(f)] = tmp_df['bef_{}_mean'.format(f)].diff(1)
    tmp_df['diff2_{}_mean'.format(f)] = tmp_df['bef_{}_mean'.format(f)].diff(2)
    data_df = data_df.merge(tmp_df, on=['dh'], how='left')

3. Semi-finals: It is not allowed to use current values ​​and characteristics to cross

Because the topic of the semi-finals has changed a lot compared with the preliminary rounds, and each of us has a tight schedule, there is nothing to do. For better solutions, you can refer to the open source code of the top gods! (I'm also waiting!)


Four, summary

  1. Data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit. Feature engineering is invincible and important!
  2. A simple weighted fusion of two models with large feature engineering differences can get very good results!
  3. The characteristic structure of the time series problem is really a university question!
  4. It's a wonderful process to do data science competitions and switch between rankings and AB rankings every day!

Guess you like

Origin blog.csdn.net/xylbill97/article/details/108706620