Zero-based entry of data mining - used car transaction price forecast analysis of data

0. Introduction

Exploratory data analysis, namely EDA (exploratory Data Analysis) objectives are:

  • EDA's main value lies in the familiar set of data to understand the data set, the data set to validate the data set to determine the machine can be used for subsequent access to learning or the depth learning to use.
  • Once we understand the data collection the next step is going to understand there is a relationship between the variables and the relationship between variables and the predicted value.
  • Scientific data to guide practitioners step data processing and engineering features of the structure and feature set make the next set of data is more reliable prediction problem.
  • For complete exploratory data analysis, and some charts or text and summary data for punch.

Code, see my Github

1. Code Example

All codes are in jupyter notebookoperation, you can view the results in real-time processing of data.

1.1 Loading a variety of data and scientific visualization library

  • Data Science Library: pandas, numpy, scipy
  • Visualization library: matplotlib, seabon
  • other
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

1.2 Loading data

## 1) 载入训练集和测试集;
path = './'
Train_data = pd.read_csv(path+'used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv(path+'used_car_testA_20200313.csv', sep=' ')

I put the data in the current folder, ' "./" is representative of the current folder.
All features are the desensitization process

Field Description
SaleID Transaction ID, a unique coding
name Automotive trade name, desensitized
regDate Car registration dates, such as 20160101, January 1, 2016
model Coding models, desensitized
brand Car brand, has been desensitized
bodyType Body type: Limousine: 0, mini-cars: 1, vans: 2, bus: 3, convertible cars: 4, two-door cars: 5, commercial vehicles: 6, mixer: 7
fuelType Fuel type: Gasoline: 0, diesel oil: 1, LPG: 2, Gas: 3, hybrid: 4, Other: 5, power: 6
gearbox Transmission: Manual: 0, automatically: 1
power Engine power: the range [0, 600]
kilometer Car kilometers traveled, the unit Wan km
notRepairedDamage The car has not been repaired damage: Yes: 0 No: 1
regionCode Area code, desensitized
seller Seller: individual: 0, non-individual: 1
offerType Offer type: Providing: 0,: ​​1
creatDate Car on-line time, began selling time
price Used car prices (predicted target)
v Series Features Anonymous features include v0, v1, v2, ..., v14, including 15 anonymous feature

And after a few previous observation data:

## 2) 简略观察数据(head()+shape)
Train_data.head().append(Train_data.tail())

Here Insert Picture Description
Similarly you can view the test data:
Here Insert Picture Description

1.3 Overview Data Overview

  • 5 Pre View data: dataframe.head ()
  • Check information data, including data type name of each field, the number of non-empty, the fields: data.info ()
  • View summary statistical data (count / mean / std / min / 25% / 50% / 75% max): data.describe ()
  • View dataframe size: dataframe.shape
  • Columns / Sorting an array
    • Sort by a column: positive sequence (reverse) df.groupby ([ 'Column Name']) cumcount ().
    • A value of the row or column of the sort: sort_values ​​(by = "name column / row name")
    • An array in ascending order, returns the index. Descending, then you can give a negative sign. numpy.argsort (a) or a.argsort ()
  • Adding data
    • a.sum (axis = 1): a array, sum (axis = 1) represents the sum of each line, usually without axis 0 is the default, 0 represents the sum of each column.
  • Dictionary operations
    • sorted list on the back, or a value of the dictionary ordering

      sorted(dic.items() , key=lambda x:x[1] , reverse=True )

      sorted (dic.items(),key=operator.itemgetter(1) ,reverse=True)

    • Dictionary get function:

      dic.get (key, 0) is equivalent if ...... else, if the key value dic [key] dic in the dictionary is returned in the return 0 if not.

Missing data and abnormality judgment 1.4

The type of data missing

  • Completely random deletions (missing completely at random, MCAR): refers to the missing data is completely random, independent of any incomplete or complete variable variable does not affect the unbiasedness samples, such as home address deletion;
  • Missing at random (missing at random, MAR): refers to the missing data is not completely random, that is kind of missing data is completely dependent on other variables, such as the lack of financial data related to the case of small enterprises;
  • Not missing at random (missing not at random, MNAR) : refers to the missing variable and incomplete data about their own values, such as high-income people do not intent to provide family income;
      for the missing at random and non-random missing, records are deleted directly inappropriate reasons given above. Random deletion of missing values can be estimated by the known variables, rather than a random non-random missing there is no good solution.

View missing cases:

  • dataframe.isnull ()
      element level is determined, the corresponding positions of all the elements are listed, the element is empty or NA displayed True, otherwise is False

  • dataframe.isnull (). any ()
      column level determination, as long as the empty or NA and promising elements in the column, it is True, False otherwise

  • missing = dataframe.columns [dataframe.isnull (). any ()] .tolist ()
      will find out an empty column or NA

  • dataframe [missing] .isnull (). sum ()
      the column count is the number of empty or out of NA

  • len (data [ "feature"] [pd.isnull (data [ "feature"])]) / len (data))
      percentage of missing values 
    approach:
    Here Insert Picture Description

Whether playing or in the actual project, the situation will encounter missing data, if the data set is small, but also in excel or other visualization software generally look at the cause of the missing data, then the data set is large, want to explore the law which, no doubt the difficulty is increasing, as used herein, a missing value visualization package missingno, this bag features a very simple, only a few methods, it is also particularly convenient to use, but it can only pandas together use, you can He said that the pandas derived features a large artifact.

1.5 understand the distribution of predicted values

Select the columns that you want to predict and analyze.
If you do not follow a normal distribution, it must be converted prior to return. While we are doing very well logarithmic transformation, but it is the best fit Johnson unbounded distribution.
Specific reference code.

  • The overall distribution profile (Johnson unbounded distribution, etc.)
  • View skewness and kurtosis
  • View the predicted value of the specific frequency

1.6 features are divided into categories and characteristics of digital features, characteristics and categories to see unique distribution

1.7 digital feature analysis

  • Correlation analysis
  • See skewness and several features have peak
  • Each digital signature obtained distribution visualization
  • The relationship between each digital signature visualized
  • Multivariate regression visualization of mutual relations

1.8 type features analysis

  • unique distribution
  • FIG class characteristic visualization box
  • FIG violin class characteristic visualization
  • Visualization bar graph class characteristic category
  • Each category feature frequency visualized (count_plot)

1.9 with report data pandas_profiling

to sum up

EDA step is given broad general steps in the actual project whether or during the race, this is just the beginning step, and it is the most basic step.

Next, the effect of generally binding model features engineering and modeling to analyze the actual situation of the data, according to some of their own understanding of literature, make judgments and in-depth understanding of the practical problems.

Finally, with the ongoing data processing and mining EDA, to reach a better distribution of the data structures and features relatively strong and related

Explore data in machine learning, we generally referred to as EDA (Exploratory Data Analysis):

是指对已有的数据(特别是调查或观察得来的原始数据)在尽量少的先验假定下进行探索,通过作图、制表、方程拟合、计算特征量等手段探索数据的结构和规律的一种数据分析方法。

Data exploration will help us find the correlation between some of the characteristics of the data, the data for the subsequent build feature is helpful.

  1. For the preliminary analysis of the data (see the data directly, or .sum (), .mean () ,. descirbe () and other statistical functions) from: number of samples, the number of training set, if there is time characteristics, whether it is am issue, meaning (non-anonymous feature) feature represented, feature types (character similar, int, float, time), deletions, features (attention deficit manifestations in the data, some empty and some are "NAN" symbols ), mean-variance characteristics of the situation.

  2. Analysis of recording certain features missing missing values ​​accounted for more than 30% of the sample processing, the model validation and facilitate subsequent adjustment, characterized in analysis should be filled (filling what way, mean fill, zero padding, a mode such as filling), or lay down, or do first sample classification to predict models with different features.

  3. Outliers do specialized analysis, whether the feature is abnormal abnormality label value (or mean deviation from distant or something special symbols), whether the outlier should be eliminated, or filled with normal, recording is abnormal, abnormality or the machine itself .

  4. For Label to do a special analysis, distribution and other labels.

  5. Progress analysis can be achieved by mapping features, characteristics and label co-do chart (charts, scatter charts), an intuitive understanding of the characteristics of the distribution, by this step can also find some outliers in the data, by box plot analysis Some deviation of the characteristic values ​​for the characteristic features and association mapping, features and label for association mapping, wherein the analysis of some relevance.

Published 21 original articles · won praise 1 · views 1122

Guess you like

Origin blog.csdn.net/Elenstone/article/details/105067891