User added prediction challenge study notes (iFlytek)

User-added prediction challenge:

2023 iFLYTEK AI Developer Competition-iFLYTEK Open Platform

Organizer: iFlytek

1. Competition background

iFlytek's open platform provides corresponding AI capabilities and solutions for different industries and scenarios, empowers developers' products and applications, helps developers solve relevant practical problems through AI, and enables products to listen, speak, see and recognize. , can understand and think.

Prediction of new user additions is a key step in analyzing user usage scenarios and predicting user growth, which is helpful for subsequent iterative upgrades of products and applications.

2. Competition tasks

This competition provides massive application data from the iFlytek open platform as training samples. Participants need to build models based on the provided samples to predict new user additions.

3. Data description

The competition question data consists of approximately 620,000 training set data and 200,000 test set data, containing a total of 13 fields. where uuid is the unique identifier of the sample, eid is the access behavior ID, udmap is the behavior attribute, key1 to key9 represent different behavior attributes, such as project name, project id and other related fields, common_ts is the application access record occurrence time (millisecond timestamp) ), the remaining fields x1 to x8 are user-related attributes and are anonymous processing fields. The target field is the prediction target, that is, whether it is a new user. The evaluation standard for this competition is f1_score. The higher the score, the better the effect.

4. Practical process

This learning was conducted through the Datawhale learning platform, and further operations were performed on the baseline provided by the platform .

1. Data preprocessing

train_data.describe(include='all')

Data preprocessing mainly includes missing values, outlier processing and memory optimization.

1.1 Missing value processing

  • For category features: you can choose the most common filling method, that is, filling the mode; or directly fill in a new category, such as 0, -1, or negative infinity.
  • For numerical features: you can fill in the mean, median, mode, maximum value, minimum value, etc. The specific choice of which statistical value requires specific analysis of specific problems.
  • For ordered data (such as time series): adjacent values ​​next or previous can be filled.
  • Model prediction filling: Ordinary filling is just the normal state of a result and does not consider the impact of interactions between other features. It can model the column containing missing values ​​and predict the results of the missing values. Although this method is more complicated, the final result is intuitively better than direct filling, but the effect in actual competitions requires specific testing.

There are no missing values ​​in this data, so there is no need to process missing values ​​for the time being.

1.2 Exception value processing

  • Delete records containing outliers. The advantage of this method is that it can eliminate the uncertainty caused by samples containing outliers, but the disadvantage is that it reduces the sample size.
  • treated as missing values. Treat outliers as missing values ​​and use missing value processing methods to process them. The advantage of this method is that outliers are concentrated into one category, which increases the availability of data; the disadvantage is that outliers and missing values ​​are mixed together, which affects the accuracy of the data.
  • Mean (median) correction. The outlier can be corrected using the average value of values ​​corresponding to the same category. The advantages and disadvantages are the same as "treat as missing value".
  • Not processed. Perform data mining directly on data sets with outliers. The effectiveness of this method depends on the source of the outliers. If the outliers are caused by input errors, it will have a negative impact on the data mining effect; if the outliers are just records of the real situation, direct data mining can Keep the most authentic and trustworthy information.

1.3 Memory optimization:

When using pandas to operate on small-scale data (less than 100 MB), performance is generally not an issue. When faced with larger-scale data (100 MB to several GB), performance issues will make the running time longer, and the operation may completely fail due to insufficient memory. After subsequent feature engineering, this data set occupies a large amount of memory, and the data needs to be memory optimized to improve running speed and model performance.

The following is the code used to reduce data memory through numerical type optimization in this study:

train_int = train_data.select_dtypes(include=['int'])
converted_int = train_int.apply(pd.to_numeric,downcast='unsigned')
train_float = train_data.select_dtypes(include=['float'])
converted_float = train_float.apply(pd.to_numeric,downcast='float')

optimized_train = train_data.copy()

optimized_train[converted_int.columns] = converted_int
optimized_train[converted_float.columns] = converted_float

The code uses the function pd.to_numeric() to downcast numeric types, uses DataFrame.select_dtypes to select integer columns, and then optimizes their data types.

2. Feature transformation

2.1 Non-numeric variable processing

The udmap feature in the original data is in the form of a dictionary, and udmap is a behavioral attribute. Key1 to key9 represent different behavioral attributes, such as project name, project id and other related fields. They are processed into a 9-column feature vector to indicate whether each key exists. . Then encode the udmap feature to generate the udmap_isunknown feature, indicating whether the feature is empty. The processed udmap features are concatenated with the original data to form a new data frame. The code is implemented as follows:

# 定义函数 udmap_onethot,用于将 'udmap' 列进行 One-Hot 编码
def udmap_onethot(d):
    v = np.zeros(9)  # 创建一个长度为 9 的零数组
    if d == 'unknown':  # 如果 'udmap' 的值是 'unknown'
        return v  # 返回零数组
    d = eval(d)  # 将 'udmap' 的值解析为一个字典
    for i in range(1, 10):  # 遍历 'key1' 到 'key9', 注意, 这里不包括10本身
        if 'key' + str(i) in d:  # 如果当前键存在于字典中
            v[i-1] = d['key' + str(i)]  # 将字典中的值存储在对应的索引位置上
            
    return v  # 返回 One-Hot 编码后的数组

# 使用 apply() 方法将 udmap_onethot 函数应用于每个样本的 'udmap' 列
# np.vstack() 用于将结果堆叠成一个数组
train_udmap_df = pd.DataFrame(np.vstack(train_data['udmap'].apply(udmap_onethot)))
test_udmap_df = pd.DataFrame(np.vstack(test_data['udmap'].apply(udmap_onethot)))
# 为新的特征 DataFrame 命名列名
train_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]
test_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]
# 将编码后的 udmap 特征与原始数据进行拼接,沿着列方向拼接
train_data = pd.concat([train_data, train_udmap_df], axis=1)
test_data = pd.concat([test_data, test_udmap_df], axis=1)

# 使用比较运算符将每个样本的 'udmap' 列与字符串 'unknown' 进行比较,返回一个布尔值的 Series
# 使用 astype(int) 将布尔值转换为整数(0 或 1),以便进行后续的数值计算和分析
train_data['udmap_isunknown'] = (train_data['udmap'] == 'unknown').astype(int)
test_data['udmap_isunknown'] = (test_data['udmap'] == 'unknown').astype(int)

For the timestamp common_ts, extract features such as minutes, hours, time period of the day, day of the month, number of weeks, whether it is a weekend, etc., and fully extract the time information to generate new features.

# 提取时间戳
train_data['common_ts'] = pd.to_datetime(train_data['common_ts'], unit='ms')
test_data['common_ts'] = pd.to_datetime(test_data['common_ts'], unit='ms')

train_data['minute'] = train_data['common_ts'].dt.minute
test_data['minute'] = test_data['common_ts'].dt.minute

train_data['hour'] = train_data['common_ts'].dt.hour
test_data['hour'] = test_data['common_ts'].dt.hour

def encode_time_period(time):
    hour = time.hour

    if 2 <= hour < 5:
        return 1  # 凌晨
    elif 5 <= hour < 8:
        return 2  # 早上
    elif 8 <= hour < 12:
        return 3  # 上午
    elif 12 <= hour < 14:
        return 4  # 中午
    elif 14 <= hour < 18:
        return 5  # 下午
    elif 18 <= hour < 22:
        return 6  # 晚上
    else:
        return 7  # 深夜

train_data['period'] = train_data['common_ts'].apply(encode_time_period)
test_data['period'] = test_data['common_ts'].apply(encode_time_period)

train_data['day'] = train_data['common_ts'].dt.day
test_data['day'] = test_data['common_ts'].dt.day

train_data['weekday'] = train_data['common_ts'].dt.weekday
test_data['weekday'] = test_data['common_ts'].dt.weekday

train_data['weekend'] = train_data['common_ts'].apply(lambda x: 1 if x.dayofweek in [5, 6] else 0)
test_data['weekend'] = test_data['common_ts'].apply(lambda x: 1 if x.dayofweek in [5, 6] else 0)

2.2 Dimensionless continuous variables

Standardization: The eigenvalues ​​must conform to the normal distribution. After standardization, the eigenvalues ​​obey the standard normal distribution. The simplest transformation is zero-mean normalization.

Interval scaling: Scale the feature value interval to a specific range, such as [0,1].

Single feature conversion is the key to building models such as linear regression, KNN, and neural networks. It has no impact on models such as decision trees. Finally, I plan to use random forest, XGBoost, Lightbgm, CatBoost and other models for training and prediction, because this part of the process has not been processed. .

2.3 Discretization of continuous variables

The discretized features are very robust to abnormal data and make it easier to explore the correlation of data. Here we use supervised discretization and process it using the scorecardpy library developed by Dr. Xie Shichen,

The scorecardpy library is a commonly used library for credit scorecards in Python. This package is the python version of the R package scorecard. It aims to make the development of traditional credit risk scorecard models easier and more effective by providing functionality for some common tasks. The functions and corresponding functions of this package are as follows:

  • Data division (split_df)
  • Filter variables (var_filter())
  • Decision tree binning (woebin, woebin_plot, woebin_adj, woebin_ply)
  • Score conversion (scorecard, scorecard_ply)
  • Model evaluation (perf_eva, perf_psi)

Reference links:

Logistic regression: German Credit risk control scorecard model based on Scorecardpy library - Zhihu

Full interpretation of scorecard modeling tool scorecardpy - Zhihu

GitHub - ShichenXie/scorecardpy: Scorecard Development in python, scorecard

In this study, we use the decision tree binning method in the library and combine it with the decision tree model in the sklearn library to binning some continuous variables.

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

dtree = DecisionTreeClassifier(max_depth=2)
dtree.fit(train_data[['key1']],train_data[['target']])
train_data['key1_bin'] = dtree.predict_proba(train_data.key1.to_frame())[:,1].round(4)
test_data['key1_bin'] = dtree.predict_proba(test_data.key1.to_frame())[:,1].round(4)
dtree = DecisionTreeClassifier(max_depth=2)
dtree.fit(train_data[['key6']],train_data[['target']])
train_data['key6_bin'] = dtree.predict_proba(train_data.key6.to_frame())[:,1].round(4)
test_data['key6_bin'] = dtree.predict_proba(test_data.key6.to_frame())[:,1].round(4)
dtree = DecisionTreeClassifier(max_depth=4)
dtree.fit(train_data[['x3']],train_data[['target']])
train_data['x3_bin'] = dtree.predict_proba(train_data.x3.to_frame())[:,1].round(4)
test_data['x3_bin'] = dtree.predict_proba(test_data.x3.to_frame())[:,1].round(4)

import scorecardpy as sc

bins = sc.woebin(train_data[[
    'x4', 'x5', 'key2', 'key3', 'key4',
    'key5', 'minute', 'hour', 'period', 'day', 'weekday', 'target'
]],
                 y='target')

train_data[[
    'x4_bin', 'x5_bin', 
    'key2_bin', 'key3_bin', 'key4_bin', 'key5_bin', 'minute_bin', 'hour_bin','period_bin', 
    'day_bin', 'weekday_bin'
]] = sc.woebin_ply(
    train_data[[
        'x4', 'x5',  'key2', 'key3',
        'key4', 'key5', 'minute', 'hour', 'period', 'day', 'weekday'
    ]], bins)

test_data[[
    'x4_bin', 'x5_bin', 
    'key2_bin', 'key3_bin', 'key4_bin', 'key5_bin', 'minute_bin', 'hour_bin','period_bin', 
    'day_bin', 'weekday_bin'
]] = sc.woebin_ply(
    test_data[[
        'x4', 'x5',  'key2', 'key3',
        'key4', 'key5', 'minute', 'hour', 'period', 'day', 'weekday'
    ]], bins)

eid is the access behavior ID, which should be a discrete variable. The frequency (number of occurrences) and target mean of the eid feature can be extracted and added as a new feature. When using target variables, it is very important not to leak any validation set information. All features based on the target encoding should be calculated on the training set, and then only the validation and test sets are merged or concatenated. Even if the target variable is present in the validation set, it cannot be used in any encoding calculations, otherwise it will give an overly optimistic estimate of the validation error.

# 提取 eid 的频次特征
# 使用 map() 方法将每个样本的 eid 映射到训练数据中 eid 的频次计数
# train_data['eid'].value_counts() 返回每个 eid 出现的频次计数
train_data['eid_freq'] = train_data['eid'].map(train_data['eid'].value_counts())
test_data['eid_freq'] = test_data['eid'].map(train_data['eid'].value_counts())
# 提取 eid 的标签特征
# 使用 groupby() 方法按照 eid 进行分组,然后计算每个 eid 分组的目标值均值
# train_data.groupby('eid')['target'].mean() 返回每个 eid 分组的目标值均值
train_data['eid_mean'] = train_data['eid'].map(train_data.groupby('eid')['target'].mean())
# 继续使用训练集的信息来对测试集进行编码
test_data['eid_mean'] = test_data['eid'].map(train_data.groupby('eid')['target'].mean())

Guess you like

Origin blog.csdn.net/qq_42959513/article/details/132368829