[Mathorcup Cup Big Data Challenge Rematch A Question Used Car Valuation] Ideas and Python Implementation


Update time: At 8:30 on March 10, 2022, the
competition has ended, please check the summary of the competition, this article will no longer be updated

Related Links

(1) Summary of the preliminary round and semi-finals, program code and paper

(2) Question 1 complete idea and Python implementation code download

(3) Complete idea of ​​question 2 and Python implementation code download

topic

Question 1: On the basis of question 2 of the preliminary round, if you need to accurately estimate the transaction cycle of the vehicle, how will you model it? Please use Attachment 4 "Store Transaction Training Data" to build a transaction cycle prediction model, and make predictions on Attachment 5 "Store Transaction Verification Data", and save the prediction results in Attachment 6 "Store Transaction Model Results" file, be careful not to modify Format. Among them, Attachment 5 "Store Transaction Verification Data" only includes the first 1 to 4 fields of Attachment 4 "Store Transaction Training Data". All carid and other related information in Annex 5 are included in Annex 2 "Valuation Verification Data".

Question 2: In the process of selling vehicles in stores, in addition to accurately predicting the future transaction cycle of the vehicles in the warehouse, it is also necessary to effectively manage the inventory (assuming that the site and staff conditions of the store remain unchanged during the evaluation period) to ensure the cost (Vehicles have capital occupancy costs, parking space occupancy costs) to maximize the sales profit of the store. The price of the vehicle is a very important factor affecting the transaction of the vehicle. When the store is doing inventory management, it is necessary to price or adjust the price of the vehicle according to the conditions of the vehicles in the warehouse and the newly received vehicles. On the one hand, the hot-selling vehicles can be sold at a more suitable price. Make a deal to preserve the profit of the store. At the same time, it is also necessary to reduce the price of unsalable vehicles to avoid greater losses. Based on this, assuming that you are the store manager of the store, what you can decide is when to adjust the price of a certain vehicle, and How much to adjust to ensure that the store's business goals (maximizing store gross profit while minimizing costs) are achieved, and the labor costs and other costs of employees are not considered here. Please describe the mathematical model of the abstract problem by yourself, build the store management model, and give the solution ideas and algorithm steps of the model. Here, it is assumed that the business target is evaluated once a month.

Complete the preliminary paper based on the answers to questions 1 and 2, and clarify your ideas, models, methods and results.

1 idea

1.1 The first question

is a regression problem

Using Annex 4 as the training set and Annex 5 as the test set, the LGB regression model is used for regression prediction, and the predicted value is rounded up. It is necessary to pay attention to the calculation of the transaction cycle. . . Please download the complete idea

The feature construction of the regression model, in addition to the feature intersection of the baseline I provided below, there are other feature construction methods. as follows

Reference: Methods of feature construction

(1) Basic transformation of a single variable: x, x^2, sqrt x , log x, scaling

(2) If the distribution of the variable is long-tailed, apply the Box-Cox transformation (using the log transformation is fast but not necessarily a good choice)

(3) You can also check residuals (Residuals) or log-odds (for linear models) to analyze whether it is strongly nonlinear.

(4) For data with a relatively large cardinality, for categorical variables, it is useful to create a feature that represents the frequency of occurrence of each category. Of course, these categories can also be expressed as percentages or percentages of the total.

(5) For each possible value of the variable, estimate the mean of the target variable, and use the result as the feature of creation.

(6) Create a feature with the ratio of the target variable.

(7) Select the two most important variables, calculate their second-order cross-actions with each other and with other variables and put them into the model, and compare the resulting model results with the results of the original linear model .

(8) If you want a smoother solution, you can apply a Kadial Basis function kernel. This is equivalent to applying a smooth transition.

(9) If you feel you need covariates, you can apply polynomial kernels, or explicitly add their covariates.

(10) High cardinality features: In the preprocessing stage, they are transformed into numerical variables by out-of-fold averaging.

。。。。

1.2 The second question

Title requirements: Mathematical model of three things: whether to cut prices, the extent of price cuts, and the time of price cuts

If you think about it simply, it can also be a regression problem. If you want to make it complex, it is a planning problem. Because there is no data for this question, if it is to be done by planning, it is pure theoretical mathematical modeling. The ideas and implementations I give below.
. . . Omitted, please download the complete idea
problem 2 complete idea and Python implementation code download

2 Implementation

2.1 TXT to CSV

import scipy.stats as st
import pandas as pd 
import seaborn as sns
from pylab import mpl 
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
tqdm.pandas()
import warnings

warnings.filterwarnings('ignore')
# plt.rcParams['font.sans-serif'] = ['STSong']
# mpl.rcParams['font.sans-serif'] = ['STSong'] # 指定默认字体 
mpl.rcParams['axes.unicode_minus'] = False
import csv
import os
import pickle



data = pd.read_csv('./data/附件5:门店交易验证数据.txt',sep ='\t',header=None)
data.columns=['carid','pushDate','pushPrice','updatePriceTimeJson']
data.to_csv('./data/file5.csv',index=0)

2.2 Data preprocessing

df4 = pd.read_csv('./data/file4.csv')
df5 = pd.read_csv('./data/file5.csv')

(1) Calculate the transaction cycle

# 去除没卖出的样本
df_trans = df4[df4.withdrawDate.notna()]

。。。。略,请下载完整代码 https://mianbaoduo.com/o/bread/YpiXlpZx

train_cols = ['pushDate','pushPrice','transcycle']
df_train = df_trans[train_cols]
test_cols = ['pushDate','pushPrice']
df_test = df5[test_cols]
df_train

insert image description here

import scipy.stats as st
import seaborn as sns 
import matplotlib.pyplot as plt 

plt.figure(figsize=(14, 5))
plt.subplot(122)
plt.title('正态分布拟合-已处理', fontsize=20)
sns.distplot(np.log1p(df_train['pushPrice']), kde=False, fit=st.norm)
plt.xlabel('上架价格', fontsize=20)
plt.subplot(121)
plt.title('正态分布拟合-未处理', fontsize=20)
sns.distplot(df_train['pushPrice'], kde=False, fit=st.norm)
plt.xlabel('上架价格', fontsize=20)
plt.savefig('img/上架价格正态分布拟合.png',dpi=300)

insert image description here

(2) Extract temporal features

# # 时间处理(提取年月日)
df_train['pushDate'] = pd.to_datetime(df_train['pushDate'])
df_test['pushDate'] = pd.to_datetime(df_test['pushDate'])
df_train['pushDate_year'] = df_train['pushDate'].dt.year
df_train['pushDate_month'] = df_train['pushDate'].dt.month
df_train['pushDate_day'] = df_train['pushDate'].dt.day

df_test['pushDate_year'] = df_test['pushDate'].dt.year
df_test['pushDate_month'] = df_test['pushDate'].dt.month
df_test['pushDate_day'] = df_test['pushDate'].dt.day

del df_train['pushDate']
del df_test['pushDate']

(3) Conversion of data distribution

df_train['pushPrice'] = np.log1p(df_train['pushPrice'])
df_test['pushPrice'] = np.log1p(df_test['pushPrice'])
df_train.columns

Index([‘pushPrice’, ‘update_price’, ‘barging_times’, ‘barging_price’, ‘transcycle’, ‘pushDate_year’, ‘pushDate_month’, ‘pushDate_day’], dtype=‘object’)

(4) Feature Intersection

#定义交叉特征统计
def cross_cat_num(df, num_col, cat_col):
    for f1 in tqdm(cat_col):
        g = df.groupby(f1, as_index=False)
        for f2 in tqdm(num_col):
            feat = g[f2].agg({
    
    
                '{}_{}_max'.format(f1, f2): 'max', '{}_{}_min'.format(f1, f2): 'min',
                '{}_{}_median'.format(f1, f2): 'median',
                '{}_{}_sum'.format(f1, f2): 'sum',
                '{}_{}_mad'.format(f1, f2): 'mad',
            })
            df = df.merge(feat, on=f1, how='left')
    return(df)
### 用数值特征 与类别特征做交叉
cross_num = ['pushPrice']

cross_cat = ['pushDate_year', 'pushDate_month','pushDate_day']
data_train = cross_cat_num(df_train, cross_num, cross_cat)  # 一阶交叉
data_test = cross_cat_num(df_test, cross_num, cross_cat)  # 一阶交叉
data_train.shape

(8000, 20)

2.3 Model training

(1) Model training

from sklearn import metrics
from sklearn.model_selection import KFold
import lightgbm as lgb
import pandas as pd
from sklearn.model_selection import KFold

import numpy as np
from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import StandardScaler
train = data_train
test = data_test
train_y = train['transcycle']
del train['transcycle']
scaler = StandardScaler()
train_x = scaler.fit_transform(train)
test_x = scaler.fit_transform(test)

from sklearn import metrics

params = {
    
    
    'boosting_type': 'gbdt',
    'objective': 'regression_l1',
    'metric': 'mae',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1,
}

def MAE_metric(y_true, y_pred):
    return metrics.mean_absolute_error(y_true, y_pred)

folds = 5
kfold = KFold(n_splits=folds, shuffle=True, random_state=5421)
preds_lgb = np.zeros(len(test_x))
for fold, (trn_idx, val_idx) in enumerate(kfold.split(train_x, train_y)):
    import lightgbm as lgb
    print('-------fold {}-------'.format(fold))
    x_tra, y_trn, x_val, y_val = train_x[trn_idx], train_y.iloc[trn_idx], train_x[val_idx], train_y.iloc[val_idx]

    train_set = lgb.Dataset(x_tra, y_trn)
    val_set = lgb.Dataset(x_val, y_val)
    # lgb
    lgbmodel =。。。。略,请下载完整代码
    
    val_pred_xgb = lgbmodel.predict(
        x_val, predict_disable_shape_check=True)
    preds_lgb += lgbmodel.predict(test_x,
                                    predict_disable_shape_check=True) / folds
    val_mae = MAE_metric(y_val, val_pred_xgb)
    print('lgb val_mae {}'.format(val_mae))

------- fold 0 ------- lgb val_mae 0.808706443185115 -
------ fold 1 ------- lgb val_mae 0.955760771009792
------- fold 2 --- ---- lgb val_mae 0.897388380375197
------- fold 3 ------- lgb val_mae 0.883798531878621
------- fold 4 ------- lgb val_mae 0.878992579304203

(2) Store the prediction result as TXT

import math
file5 = pd.read_csv('./data/file5.csv')
submit_file  = pd.DataFrame(columns=['id'])
submit_file['id'] = file5['carid']
# 向上取整
submit_file['transcycle'] = [math.ceil(i) for i in list(preds_lgb)]
submit_file['transcycle'].astype(int)

i = 0
with open('./submit/附件6:门店交易模型结果.txt','a+', encoding='utf-8') as f:
    for line in submit_file.values:
        if i==0:
            i += 1
            continue
        else:
            i += 1
            f.write((str(line[0])+'\t'+str(line[1])+'\n'))

See the top link for complete ideas and code downloads

Guess you like

Origin blog.csdn.net/weixin_43935696/article/details/123343796