DataWhale Machine Learning Summer Camp Phase 3

DataWhale Machine Learning Summer Camp Phase 3
- User New Prediction Challenge


Study Record 1 (2023.08.18)

Already run through the baseline, change to lightgbm baseline, without adding any feature online score 0.52214;
add baseline feature, online score 0.78176;
violently derive features and fine-tune model parameters, online score0.86068

1. Comprehension of the questions

The competition data consists of about 620,000 training sets and 200,000 test sets, and contains a total of 13 fields.

  • Where uuid is the unique identifier of the sample,
  • eid is the access behavior ID,
  • udmap is a behavior attribute, where key1 to key9 represent different behavior attributes, such as project name, project id and other related fields,
  • common_ts is the application access record occurrence time (millisecond timestamp),
  • The remaining fields x1 to x8 are user-related attributes and are anonymous processing fields.
  • The target field is the prediction target, that is, whether it is a new user.

2. Missing value analysis

print('-----Missing Values-----')
print(train_data.isnull().sum())

print('\n')
print('-----Classes-------')
display(pd.merge(
    train_data.target.value_counts().rename('count'),
    train_data.target.value_counts(True).rename('%').mul(100),
    left_index=True,
    right_index=True
))

Analysis : The data has no missing values, 533155 (85.943394%) negative samples, 87201 (14.056606%) positive samples

Handling of uneven data distribution:

  • threshold shift
  • Set sample weights
weight_0 = 1.0  # 多数类样本的权重
weight_1 = 8.0  # 少数类样本的权重
dtrain = lgb.Dataset(X_train, label=y_train, weight=y_train.map({
    
    0: weight_0, 1: weight_1}))
dval = lgb.Dataset(X_val, label=y_val, weight=y_val.map({
    
    0: weight_0, 1: weight_1}))

3. Simple feature extraction

Behavior related features: eid and udmap related feature extraction

  • Value feature extraction in udmap: already given in baseline
  • Key feature extraction in udmap
import json

def extract_keys_as_string(row):
    if row == 'unknown':
        return None
    else:
        parsed_data = json.loads(row)
        keys = list(parsed_data.keys())
        keys_string = '_'.join(keys)  # 用下划线连接 key
        return keys_string

train_df['udmap_key'] = train_df['udmap'].apply(extract_keys_as_string)
train_df['udmap_key'].value_counts()

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-PkbowYDJ-1692365546794) (C:\Users\ZYM\AppData\Roaming\Typora\typora-user-images\ image-20230818195454065.png)]

Observe the correspondence between eid and udmap_key

train_df.groupby('eid')['udmap_key'].unique()

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-9zqnrzDe-1692365546795) (C:\Users\ZYM\AppData\Roaming\Typora\typora-user-images\ image-20230818195553955.png)]

Analysis : It can be seen that eid and key are strongly correlated or even correspond to each other, and behavior-related features can be constructed around eid, key, and value.

4. Data visualization

discrete variable

View individual characteristics:

for i in train_data.columns:
    if train_data[i].nunique() < 10:
        print(f'{
      
      i}, {
      
      train_data[i].nunique()}: {
      
      train_data[i].unique()}')
    else:
        print(f'{
      
      i}, {
      
      train_data[i].nunique()}: {
      
      train_data[i].unique()[:10]}')

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-sPwmt4rl-1692365546795) (C:\Users\ZYM\AppData\Roaming\Typora\typora-user-images\ image-20230818200557544.png)]
analyze:

  • ['eid', 'x3', 'x4', 'x5'] are the category features with more values

  • ['x1', 'x2', 'x6','x7, 'x8'] are category features with fewer values, and x8 is basically determined as a gender feature

Discrete Variable Distribution Analysis

Study ['eid', 'x3', 'x4', 'x5‘,'x1', 'x2', 'x6','x7', 'x8'']the distribution of discrete variables, the blue is the training set, the yellow is the verification set, and the distribution is basically the same. The
pink point is the mean value of the target of each value of each category under the training set, that is, target=1the proportion of

Drawing code:

def plot_cate_large(col):
    data_to_plot = (
        all_df.groupby('set')[col]
        .value_counts(True)*100
    )

    fig, ax = plt.subplots(figsize=(10, 6))

    sns.barplot(
        data=data_to_plot.rename('Percent').reset_index(),
        hue='set', x=col, y='Percent', ax=ax,
        orient='v',
        hue_order=['train', 'test']
    )

    x_ticklabels = [x.get_text() for x in ax.get_xticklabels()]

    # Secondary axis to show mean of target
    ax2 = ax.twinx()
    scatter_data = all_df.groupby(col)['target'].mean()
    scatter_data.index = scatter_data.index.astype(str)

    ax2.plot(
        x_ticklabels,
        scatter_data.loc[x_ticklabels],
        linestyle='', marker='.', color=colors[4],
        markersize=15
    )
    ax2.set_ylim([0, 1])

    # Set x-axis tick labels every 5th value
    x_ticks_indices = range(0, len(x_ticklabels), 5)
    ax.set_xticks(x_ticks_indices)
    ax.set_xticklabels(x_ticklabels[::5], rotation=45, ha='right')

    # titles
    ax.set_title(f'{
      
      col}')
    ax.set_ylabel('Percent')
    ax.set_xlabel(col)

    # remove axes to show only one at the end
    handles = []
    labels = []
    if ax.get_legend() is not None:
        handles += ax.get_legend().legendHandles
        labels += [x.get_text() for x in ax.get_legend().get_texts()]
    else:
        handles += ax.get_legend_handles_labels()[0]
        labels += ax.get_legend_handles_labels()[1]

    ax.legend().remove()

    plt.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, 1.08), fontsize=12)
    plt.tight_layout()
    plt.show()

insert image description here

insert image description here

insert image description here
insert image description here
insert image description here

The next step is to analyze the data and build features.

Guess you like

Origin blog.csdn.net/qq_38869560/article/details/132370248