DataWhale Machine Learning Summer Camp Phase 3
DataWhale Machine Learning Summer Camp Phase 3
- User New Prediction Challenge
Learning Record II (2023.08.23) - Visual Analysis
2023.08.17
Already run through the baseline, change to lightgbm baseline, without adding any feature online scoring; 0.52214
add
baseline features, online scoring 0.78176
;
violently derived features and fine-tune model parameters, online scoring 0.86068
2023.08.23
data analysis, derived features: 0.87488
derived Features, model tuning:0.89817
Communication and sharing video:
[DataWhale "User Addition Prediction Challenge" Communication and Sharing - Bilibili] https://b23.tv/zZMLtFG
1. Comprehension of the questions
The characteristics of this competition can be mainly divided into the following three dimensions:
- Behavioral dimensions:
eid
,udmap
udmap
The key is processed into a category feature
- Time dimension:
common_ts
- Extraction of timestamp features:
day
,hour
,minute
- Extraction of timestamp features:
- User dimension:
x1~x8
2. Data visualization analysis
Before using the following code to draw, some settings need to be done. For details, please refer to the following link:
https://www.kaggle.com/code/jcaliz/ps-s03e02-a-complete-eda/notebook
This notebook provides rich visualization Analysis code and ideas are worth referring to.
Drawing code:
def plot_cate_large(col):
data_to_plot = (
all_df.groupby('set')[col]
.value_counts(True)*100
)
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(
data=data_to_plot.rename('Percent').reset_index(),
hue='set', x=col, y='Percent', ax=ax,
orient='v',
hue_order=['train', 'test']
)
x_ticklabels = [x.get_text() for x in ax.get_xticklabels()]
# Secondary axis to show mean of target
ax2 = ax.twinx()
scatter_data = all_df.groupby(col)['target'].mean()
scatter_data.index = scatter_data.index.astype(str)
ax2.plot(
x_ticklabels,
scatter_data.loc[x_ticklabels],
linestyle='', marker='.', color=colors[4],
markersize=15
)
ax2.set_ylim([0, 1])
# Set x-axis tick labels every 5th value
x_ticks_indices = range(0, len(x_ticklabels), 5)
ax.set_xticks(x_ticks_indices)
ax.set_xticklabels(x_ticklabels[::5], rotation=45, ha='right')
# titles
ax.set_title(f'{
col}')
ax.set_ylabel('Percent')
ax.set_xlabel(col)
# remove axes to show only one at the end
handles = []
labels = []
if ax.get_legend() is not None:
handles += ax.get_legend().legendHandles
labels += [x.get_text() for x in ax.get_legend().get_texts()]
else:
handles += ax.get_legend_handles_labels()[0]
labels += ax.get_legend_handles_labels()[1]
ax.legend().remove()
plt.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, 1.08), fontsize=12)
plt.tight_layout()
plt.show()
2.1 Distribution Analysis of User Dimensional Features
Visual Analysis Instructions:
- Study
['eid', 'x3', 'x4', 'x5‘,'x1', 'x2', 'x6','x7', 'x8'']
the distribution of discrete variables, blue is the training set, yellow is the verification set, the distribution is basically the same - The pink point is the mean value of the target for each value of each category under the training set, that is,
target=1
the proportion of
This graph mainly analyzes discrete variables with a small number of categories:
- The distribution of the training set and the test set is relatively uniform
x1
Mainly concentrated inx1=4
,x2
the distribution is relatively uniform,x6
basically concentrated in1
and4
two values,x7
the distribution is relatively uniform, it may be a key featurex8
May be a gender characteristic, the characteristic is less importantudmap_key
For the extracted features, there are missing values
x3
Mainly concentrated41
on the bottom , the proportion is too large, and the feature importance is very low
x4
target
The distribution under each category in , may be a key feature
x5
The distribution of each category in the same categoryx4
variestarget
greatly, which may be a key feature, but the number of features is too large, and care should be taken to avoid sparsity when deriving features
2.2 Time characteristic distribution analysis
Mainly plots the changes common_ts
in day
andhour
day
The value of has a great relationship with user growth, and it can be found that there is a significant increase in new users at 10, 14 and 17- Correspondence to old users also shows a growing trend, and the changes of new and old users
fromday=10
today=18
- The number of new and old users shows basically the same trend in each time period of the day
- Looking further at the raw data, it can be found that the three peaks appear because the amount of data in these three time periods is more than that in other time periods
- You can further draw a graph of the proportion of the number of people in each time period to the number of people in the whole day to further analyze the data