Rates actual model building (a) - problem of the game

Understanding data


Before building a model or before data analysis requires a clear understanding of the data, whether for competition or enterprise project data, the data of knowledge is often the first one, which directly affect the final result. This article is the use of a contest introduced to everyone, so understanding how data is based on competition Background. Understand the background of the game, you know the kind of model you want to build. Competition required contestants based on a given set of data, modeling, forecasting rents. Data centralized data categories include rental housing, residential, second-hand housing, infrastructure, houses, land, population, customers, and other real rent. This is a typical regression prediction.

 

1, the data set introduced

#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')

# GBDT
from sklearn.ensemble import GradientBoostingRegressor
# XGBoost
import xgboost as xgb
# LightGBM
import lightgbm as lgb

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#载入数据
data_train = pd.read_csv('./dataset/train_data.csv')
data_train['Type'] = 'Train'
data_test = pd.read_csv('./dataset/test_a.csv')
data_test['Type'] = 'Test'
data_all = pd.concat([data_train, data_test], ignore_index=True)

 

2, the basic situation View

info () function can be exhibited in the case where all the data variables and describe () function displays only the case of numerical data. Here are a look at the results.
info ()

 

As can be seen in this data includes numeric data also include discrete data, and there are two variables "pv" and "uv" represent "the plate of the month tenant Views page" and "the sector of the month the total number of web browsers tenant "; target variable tradeMoney, most data are int or float type; some fields are object type, that type Chinese or English text, such as rentType fields, which needs to be done after treatment.


describe()

 

3, missing value analysis

From the above analysis actually already we know which variables with missing values. Here a value analyzed by deletion function.

def missing_values(df):
    alldata_na = pd.DataFrame(df.isnull().sum(),columns={'missingNum'})
    alldata_na['existNum'] = len(df) - alldata_na['missingNum'] #显示变量分缺失值的个数
    alldata_na['sum'] = len(df) #总的样本个数
    alldata_na['missingRatio'] = alldata_na['missingNum']/len(df)*100  #缺失率
    alldata_na['dtype'] = df.dtypes
    #ascending:m默认True升序
    alldata_na = alldata_na[alldata_na['missingNum']>0].reset_index().sort_values(by=['missingNum','index'],ascending=[False,True])
    alldata_na.set_index('index',inplace=True)
    
    return alldata_na

missing_values(data_train)

 

 

4, wherein the column monotone Analysis

#是否有单调特征列(单调特征列很大可能为时间列)
def incresing(vals):
    cnt = 0
    len_ = len(vals)
    for i in range(len_-1):
        if vals[i+1] > vals[i]:
            cnt += 1
    return cnt

fea_cols = [col for col in data_train.columns]
for col in fea_cols:
    cnt = incresing(data_train[col].values)
    if cnt / data_train.shape[0] >= 0.55:
        print('单调特征:',col)
        print("单调特征值个数:",cnt)
        print("单调特征值比例:",cnt / data_train.shape[0])

5, wherein the distribution nunique

#特征nunique分布
for feature in categorical_feas:
    print(feature + "的特征分布如下:")
    print(data_train[feature].value_counts())
    if feature != 'communityName':#值太多,暂不画图
        plt.hist(data_train[feature],bins=3)
        plt.show()

 

 

rentType:4种,且绝大多数是无用的未知方式;
houseType:104种,绝大多数在3室及以下;
houseFloor:3种,分布较为均匀;
region: 15种;
plate: 66种;
houseToward: 10种;
houseDecoration: 4种,一大半是其他;
buildYear: 80种;
communityName: 4236种,且分布较为稀疏;
此步骤是为之后数据处理和特征工程做准备,先理解每个字段的含义以及分布,之后需要根据实际含义对分类变量做不同的处理。
 

6、统计特征频次大于100的特征

# 统计特征值出现频次大于100的特征
for feature in categorical_feas:
    df_value_counts = pd.DataFrame(data_train[feature].value_counts())
    df_value_counts = df_value_counts.reset_index()
    df_value_counts.columns = [feature, 'counts'] # change column names
    print(df_value_counts[df_value_counts['counts'] >= 100])

 

 

7、Label分布

# Labe 分布
fig,axes = plt.subplots(2,3,figsize=(20,5))
fig.set_size_inches(20,12)
sns.distplot(data_train['tradeMoney'],ax=axes[0][0])
sns.distplot(data_train[(data_train['tradeMoney']<=20000)]['tradeMoney'],ax=axes[0][1])
sns.distplot(data_train[(data_train['tradeMoney']>20000)&(data_train['tradeMoney']<=50000)]['tradeMoney'],ax=axes[0][2])
sns.distplot(data_train[(data_train['tradeMoney']>50000)&(data_train['tradeMoney']<=100000)]['tradeMoney'],ax=axes[1][0])
sns.distplot(data_train[(data_train['tradeMoney']>100000)]['tradeMoney'],ax=axes[1][1])

 

 

发布了8 篇原创文章 · 获赞 1 · 访问量 195

Guess you like

Origin blog.csdn.net/Moby97/article/details/103881014