Data mining Task2- data analysis ()

-Task2 data analysis, data mining Datawhale zero Basics

Two, EDA- exploratory data analysis

2.1 EDA goal

  • EDA's main value lies in the familiar set of data to understand the data set, the data set to validate the data set to determine the machine can be used for subsequent access to learning or the depth learning to use.

  • Once we understand the data collection the next step is going to understand there is a relationship between the variables and the relationship between variables and the predicted value.

  • Cited guide data science practitioners to step data processing and engineering characteristics of sudden, the structure and characteristics of the data set so that the next set of questions is more reliable prediction.

  • For complete exploratory data analysis, and some charts or text and summary data for punch.

2.2 Introduction

  1. Loading various data and scientific visualization library:
    • Data Science Library pandas, numpy, scipy;
    • Visualization library matplotlib, seabon;
    • other;
  2. Loading data:
    • Loading training and test sets;
    • Brief observations (head () + shape);
  3. Data overview:
    • Statistics related to data through familiar describe ()
    • By info () be familiar with the data type
  4. And abnormality judgment data deletion
    • View nan situation exists for each column
    • Outlier detection
  5. Understand the distribution of predicted values
    • The overall distribution profile (Johnson unbounded distribution, etc.)
    • View skewness and kurtosis
    • View the predicted value of the specific frequency
  6. Characteristic features are divided into categories and digital features, characteristics and categories to see unique distribution
  7. Digital feature analysis
    • Correlation analysis
    • See skewness and several features have peak
    • Each digital signature obtained distribution visualization
    • The relationship between each digital signature visualized
    • Multivariate regression visualization of mutual relations
  8. Type characteristic analysis
    • unique distribution
    • FIG class characteristic visualization box
    • FIG violin class characteristic visualization
    • Visualization bar graph class characteristic category
    • Each category feature frequency visualized (count_plot)
  9. Report data with pandas_profiling

 

Code Example 2.3

2.3.1 Loading various data and scientific visualization library, importing data

import pandas as pd
import numpy as np
from tqdm import tqdm
import datetime
import time
import warnings
import missingno as msno
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder,OneHotEncoder %matplotlib inline # raw = pd.read_csv("./used_car_train_20200313.csv", parse_dates=['regDate']) train_data = pd.read_csv("./used_car_train_20200313.csv",sep=' ',parse_dates=['regDate']) test_data = pd.read_csv("./used_car_testA_20200313.csv",sep=' ',parse_dates=['regDate']) warnings.filterwarnings("ignore")

All features are set desensitization process (to facilitate viewing)

  • name - Automotive coding
  • regDate - car registration time
  • model - model coding
  • brand - brand
  • bodyType - Body Type
  • fuelType - fuel type
  • gearbox - Gearbox
  • power - Automotive Power
  • kilometer - kilometer cars
  • notRepairedDamage - Automotive repair of damage yet
  • regionCode - see the car area code
  • seller - the seller
  • offerType - Quotation Type
  • creatDate - advertising time
  • price - car prices
  • v_0 ',' v_1 ',' v_2 ',' v_3 ',' v_4 ',' v_5 ',' v_6 ',' v_7 ',' v_8 ',' v_9 ',' v_10 ',' v_11 ',' v_12 ' , 'v_13', 'v_14' [anonymous features, including v0-14 comprising 15] anonymous wherein

2.3.2 Overview Data Overview

  1. describe planted with each column of the statistics, the number count, the average mean, variance std, the minimum value min, median 25% 50% 75%, and the maximum value is mainly to see this information instantly grasp data about the scope and abnormality determination values ​​for each value, such as sometimes find another expression 9999999-1 nan these are actually equivalent manner, sometimes the need to pay attention
  2. info info to understand the type of data through each column, in addition to help to understand whether there is a special symbol abnormal nan
# # 1) by describe () to become familiar with the relevant statistic data 
Train_data.describe ()

 

 

# # 2) via info () be familiar with the data type 
Train_data.info ()

 

 

And abnormality judgment data deletion 2.3.3

# # 1) of each column to view the presence nan case 
Train_data.isnull (). Sum ()

 

 

 

# nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

 

 

There is "nan" by more than two can be very intuitive understanding which columns, and you can print the number of nan, the main purpose is to present nan whether the number is really great, if you choose to fill a small general, if lgb and other tree models can be directly vacancies, so that the tree themselves to optimization, but if there is too much nan, can be considered deleted 

 

# Visual look Default 
msno.matrix (Train_data.sample (250))

 

 

 

msno.bar(Train_data.sample(1000))

 

 

 

# # Nan non-outlier detection (numeric variable is not within the preset range, object some meaningless value) 
Train_data [ ' notRepairedDamage ' ] .value_counts ()

 

 Can be seen '-' value is also vacant, because a lot of models nan has a direct deal, let's not deal with here, first replaced nan

# # For categorical variables heavily skewed in general is not predicting any help, you can consider removing 
Train_data [ " Seller " ] .value_counts ()

 

 

2.3.4 understand the distribution of predicted values

 

# # 1) overall distribution profile (Johnson unbounded distribution, etc.) 
Import scipy.stats AS ST 
Y = Train_data [ ' . Price ' ] 
plt.figure ( 1); plt.title ( ' Johnson the SU ' ) 
sns.distplot (Y, KDE False =, Fit = st.johnsonsu) 
plt.figure ( 2); plt.title ( ' Normal ' ) 
sns.distplot (Y, KDE = False, Fit = st.norm) 
plt.figure ( . 3); plt.title ( ' the Log Normal ' ) 
sns.distplot (Y, KDE = False, Fit = st.lognorm)

 

 Prices do not follow a normal distribution, so before making the return, it must be converted. While we are doing very well logarithmic transformation, but it is the best fit Johnson unbounded distribution.

## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())

 

 (To be completed)

 

 

The main observations way learning portion :( system, some visual learning tools)

The main reference: https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.12.1cd8593aw4bbL5&postId=95457

Guess you like

Origin www.cnblogs.com/z1141000271/p/12588986.html