Zero-based entry of data mining - used car transaction price forecast (Day3 modeling parameter adjustment)

Reduce the space occupied by the data in memory

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df
sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

// make good on a csv

Memory usage of dataframe is 62099672.00 MB
Memory usage after optimization is: 16719236.00 MB
Decreased by 73.1%

continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]

Linear Regression & half of the cross-validation & simulate real business conditions

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']

Simple modeling

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)

Drawing features Scatter v_9 the value of the label, found picture prediction model (blue dot) and the distribution of the actual label (black dots) are quite different, and some predictive value appeared in the case of less than 0, indicating our there are some problems model
Model problems

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

We found that by mapping tag data (price) presented long-tailed distribution, is not conducive to our modeling predictions. The reason is that many models assume that data are normally distributed error term, and long-tailed distribution data contrary to this assumption.

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
'''
np.quantile(train_y, 0.9) - 求train_y 的90%的分位数
下面这个代码是把价格大于90%分位数的部分截断了,就是长尾分布截断
'''
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

Here Insert Picture Description

Here we label a
log (x + 1) transform, the label close to the normal distribution

train_y_ln = np.log(train_y + 1)
import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

Here Insert Picture Description

Again visualization, prediction results found closer to the real value, and no abnormal situation

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

Here Insert Picture Description

Half of the cross-validation

When using a training set of parameters for training, often we find that people will usually a whole training set is divided into three parts (such as mnist handwriting training set). Generally divided into: training set (train_set), assessment set (valid_set), test set (test_set) three parts. In fact, this is to ensure that the training effect and deliberately set. Where the test set is well understood, in fact, completely not involved in the training data, the data is only used to test the effect of observation. The training set and evaluation set is involved in the following knowledge.
Because in the actual training, the results of the training set for training fitting degree is usually quite good (sensitive to initial conditions), but the degree to fit the data outside the training set is usually not less satisfactory. Therefore, we usually do not put all the data sets are used to training, but the separation of part (this part does not participate in the training) training set of parameters to generate the test, relatively objective data to determine these parameters outside the training set the degree of compliance. This idea is called cross-validation (Cross
Validation)

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer

Definition of a function to process and log transformed prediction value of the true value

def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))

Using a linear regression model, the characteristic data of the untreated half of the label is cross-validation

print('AVG:', np.mean(scores))

AVG: 1.3641908155886227 (5 times the average MAE)

MAE (Mean Absolute Error) is an average absolute error of the mean absolute error. Can better reflect the actual situation of the prediction error.

Using a linear regression model, the treated half of the tag characteristic data cross-validation (Error 0.19)

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

The AVG: .19382863663604424
MAE decreased from 1.365 to 0.193, much reduced error

In fact, half of the cross-validation on some time-related data sets instead reflect the true situation

2017 is unreasonable second-hand car with a 2018 price forecast, so we can use the time as a front 4/5 sample of the training set, validation set 1/5 when rearward

import datetime     # 这里我没看到datetime的作用,只能认为数据集是按照时间排列的
sample_feature = sample_feature.reset_index(drop=True)      # 重置索引
split_point = len(sample_feature) // 5 * 4      # 设置分割点

train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_In = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_In = np.log(val['price'] + 1)

model = model.fit(train_X, train_y_In)
mean_absolute_error(val_y_In, model.predict(val_X))

- MAE 0.196, and half of the cross-validation is not very different

And the learning curve plotted curve verification

from sklearn.model_selection import learning_curve, validation_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)     # 如果规定了ylim的值,那么ylim就用规定的值
    plt.xlabel('Training example')
    plt.ylabel('score')
    train_size, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                                                           train_sizes=train_sizes,
                                                           scoring=make_scorer(mean_absolute_error))
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()
    '''
    fill_between()
            train_sizes - 第一个参数表示覆盖的区域
            train_scores_mean - train_scores_std - 第二个参数表示覆盖的下限
            train_scores_mean + train_scores_std - 第三个参数表示覆盖的上限
            color - 表示覆盖区域的颜色
            alpha - 覆盖区域的透明度,越大越不透明 [0,1]
    '''
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1, color='r')
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std)
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r', label='Training score')
    plt.plot(train_sizes, test_scores_mean, 'o-', color='g', label='Cross-validation score')
    plt.legend(loc='best')
    return plt

plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_In[:1000],
                    ylim=(0.0, 0.5), cv=5, n_jobs=1)

Embedded Feature Selection - in most cases it is to do with the embedded feature selection

1.L1 regularization - Lasso regression -
model is limited to the square area (two-dimensional region), the minimum loss function often on a square (bound) angle, many of the right value of 0 (multi-dimensional), the model can be realized sparsity (weight generating sparse matrix, and further wherein selecting for
2.L2 regularization - ridge regression -
model is restricted to a circular region (two-dimensional region), the minimum loss function because there is no constraint rounded corners, so it will not make the weight to zero, but you can make the weights are as small as possible, and finally get all the parameters are a relatively small model, so the model is relatively simple to adapt to different data sets, to some extent, to avoid over-fitting

We look at the three models # effect of contrast: linear regression; Add the Lasso regression L1; and L2 is added to the ridge regression

from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
models = [LinearRegression(), Ridge(), Lasso()]
result = dict()     # 创建一个用来装结果的字典
for model in models:
    model_name = str(model).split('(')[0]   # 把括号去掉,只保留名字
    scores = cross_val_score(model, X=train_X, y=train_y_In, verbose=0, cv=5,       # 五折交叉验证
                             scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

model_Lr = LinearRegression().fit(train_X, train_y_In)
print('intercept:' + str(model_Lr.intercept_))
sns.barplot(abs(model_Lr.coef_), continuous_feature_names)

Here Insert Picture Description

Found v6 \ v8 \ v9 significant weight
L2 regularization usually tend to make the weight as low as possible, all parameters are finally construct a smaller model in the fitting process. Because generally considered a small parameter values of the model is relatively simple, we can adapt to different sets of data, but also to avoid over-fitting phenomenon to some extent. Imagine for a linear regression equation, if a large parameter, as long as the data is shifted a little bit, it will cause a great impact on the results; but if the argument is small enough, it will not have much data offset resulting What impact, professional thing to say is "anti-disturbance ability."

- Ridge Regression: found to have more influence on model parameters play, and the parameters are relatively small, and avoid the fitting phenomenon to some extent, strong anti-disturbance capability

model_Lasso = Lasso().fit(train_X, train_y_In)
print('intercept:' + str(model_Lasso.intercept_))
sns.barplot(abs(model_Lasso.coef_), continuous_feature_names)

Here Insert Picture Description
- lasso Regression: discovery and power used_time these two features are important, Ll regularization weight helps to generate a sparse matrix, and further wherein selecting for

Published 34 original articles · won praise 24 · views 1941

Guess you like

Origin blog.csdn.net/fengshiyu1997/article/details/105234851