DataWhale Machine Learning Summer Camp Phase III - Task 2: Visual Analysis

DataWhale Machine Learning Summer Camp Phase 3
- User New Prediction Challenge


Learning Record II (2023.08.23) - Visual Analysis

2023.08.17
Already run through the baseline, change to lightgbm baseline, without adding any feature online scoring; 0.52214add
baseline features, online scoring 0.78176;
violently derived features and fine-tune model parameters, online scoring 0.86068
2023.08.23
data analysis, derived features: 0.87488
derived Features, model tuning:0.89817

Communication and sharing video:
[DataWhale "User Addition Prediction Challenge" Communication and Sharing - Bilibili] https://b23.tv/zZMLtFG

1. Comprehension of the questions

insert image description here

The characteristics of this competition can be mainly divided into the following three dimensions:

  • Behavioral dimensions: eid,udmap
    • udmapThe key is processed into a category feature
  • Time dimension:common_ts
    • Extraction of timestamp features: day, hour,minute
  • User dimension:x1~x8

2. Data visualization analysis

Before using the following code to draw, some settings need to be done. For details, please refer to the following link:
https://www.kaggle.com/code/jcaliz/ps-s03e02-a-complete-eda/notebook
This notebook provides rich visualization Analysis code and ideas are worth referring to.

Drawing code:

def plot_cate_large(col):
    data_to_plot = (
        all_df.groupby('set')[col]
        .value_counts(True)*100
    )

    fig, ax = plt.subplots(figsize=(10, 6))

    sns.barplot(
        data=data_to_plot.rename('Percent').reset_index(),
        hue='set', x=col, y='Percent', ax=ax,
        orient='v',
        hue_order=['train', 'test']
    )

    x_ticklabels = [x.get_text() for x in ax.get_xticklabels()]

    # Secondary axis to show mean of target
    ax2 = ax.twinx()
    scatter_data = all_df.groupby(col)['target'].mean()
    scatter_data.index = scatter_data.index.astype(str)

    ax2.plot(
        x_ticklabels,
        scatter_data.loc[x_ticklabels],
        linestyle='', marker='.', color=colors[4],
        markersize=15
    )
    ax2.set_ylim([0, 1])

    # Set x-axis tick labels every 5th value
    x_ticks_indices = range(0, len(x_ticklabels), 5)
    ax.set_xticks(x_ticks_indices)
    ax.set_xticklabels(x_ticklabels[::5], rotation=45, ha='right')

    # titles
    ax.set_title(f'{
      
      col}')
    ax.set_ylabel('Percent')
    ax.set_xlabel(col)

    # remove axes to show only one at the end
    handles = []
    labels = []
    if ax.get_legend() is not None:
        handles += ax.get_legend().legendHandles
        labels += [x.get_text() for x in ax.get_legend().get_texts()]
    else:
        handles += ax.get_legend_handles_labels()[0]
        labels += ax.get_legend_handles_labels()[1]

    ax.legend().remove()

    plt.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, 1.08), fontsize=12)
    plt.tight_layout()
    plt.show()

2.1 Distribution Analysis of User Dimensional Features

Visual Analysis Instructions:

  1. Study ['eid', 'x3', 'x4', 'x5‘,'x1', 'x2', 'x6','x7', 'x8'']the distribution of discrete variables, blue is the training set, yellow is the verification set, the distribution is basically the same
  2. The pink point is the mean value of the target for each value of each category under the training set, that is, target=1the proportion of

insert image description here
This graph mainly analyzes discrete variables with a small number of categories:

  • The distribution of the training set and the test set is relatively uniform
  • x1Mainly concentrated in x1=4, x2the distribution is relatively uniform, x6basically concentrated in 1and 4two values, x7the distribution is relatively uniform, it may be a key feature
  • x8May be a gender characteristic, the characteristic is less important
  • udmap_keyFor the extracted features, there are missing values

insert image description here

  • x3Mainly concentrated 41on the bottom , the proportion is too large, and the feature importance is very low

insert image description here

  • x4targetThe distribution under each category in , may be a key feature
    insert image description here
  • x5The distribution of each category in the same category x4varies targetgreatly, which may be a key feature, but the number of features is too large, and care should be taken to avoid sparsity when deriving features
    insert image description here

2.2 Time characteristic distribution analysis

Mainly plots the changes common_tsin dayandhour

insert image description here

  • dayThe value of has a great relationship with user growth, and it can be found that there is a significant increase in new users at 10, 14 and 17
  • Correspondence to old users also shows a growing trend, and the changes of new and old users
    insert image description here
    from day=10today=18
  • The number of new and old users shows basically the same trend in each time period of the day
  • Looking further at the raw data, it can be found that the three peaks appear because the amount of data in these three time periods is more than that in other time periods
  • You can further draw a graph of the proportion of the number of people in each time period to the number of people in the whole day to further analyze the data

Guess you like

Origin blog.csdn.net/qq_38869560/article/details/132461146
Recommended