Datawhale entry zero-based data mining - Data exploratory analysis

Datawhale zero-based entry-word - Data Analysis

Tags: Data analysis used car prediction


Note: this post is to add some of their own ideas and comments in the tutorial basis for the preparation of a flock of God, the name of the key to make a lot of notes. A deeper learning to help themselves, on the other hand can help lower the threshold for newcomers.

Exploratory data analysis

1. Objectives

  • Familiar with the field meaning, understand the data collection
  • Using a visual relationship between the predicted value of the variable See
  • Data processing and engineering features

2. Content

2.1 python library

  • Data Science Library (pandas, numpy, scipy)
  • Visualization library (matplotlib, seabon)

Loading and observation data 2.2

  • Load training and testing sets
  • Observation data
  • By describe () method familiar with variables
  • By info () familiar with the data type
  • Check the value of each column nan case
  • Outlier detection

2.3 Understanding the predictive value distribution

  • The overall distribution of
  • View skewness and kurtosis
  • View the predicted value of the specific frequency

2.4 Class Feature and digital features, characteristics and categories to see unique distribution

2.5 Characteristics of Digital

  • Correlation analysis
  • See skewness and features have peak
  • Each digital visualization of the distribution of
  • The relationship between each digital signature visualized
  • Multivariate regression visualization of each other

2.6 Characteristics of the type

  • unique distribution
  • Wherein the visualization box Category
  • FIG violin class characteristic visualization
  • Each category visual frequency characteristic
  • Visualization bar graph class characteristic category
  • Each category visual frequency characteristic

2.7 generate data reports with pandas_profiling


3. Sample Code and Notes

3.1 python guide package

#pandas、numpy是两个超级好用的数据科学库
import pandas as pd
import numpy as np
#matplotlib和seaborn为常见的可视化库
import matplotlib.pyplot as plt
import seaborn as sns
#缺失值可视化处理包
import missingno as msno

3.2 training load training set and test data set

#注意相对路径和绝对路径
Train_data = pd.read_csv('./data/used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv('./data/used_car_testA_20200313.csv', sep=' ')

3.3 characteristic data set

Attributes meaning
SaleID Transaction ID, a unique coding
name Automotive trade name, desensitized
regDate Car registration dates, such as 20160101, January 1, 2016
model Coding models, desensitized
brand Car brand, has been desensitized
bodyType Body type: Limousine: 0, mini-cars: 1, vans: 2, bus: 3, convertible cars: 4, two-door cars: 5, commercial vehicles: 6, mixer: 7
fuelType Fuel type: Gasoline: 0, diesel oil: 1, LPG: 2, Gas: 3, hybrid: 4, Other: 5, power: 6
gearbox Transmission: Manual: 0, automatically: 1
power Engine power: the range [0, 600]
kilometer Car kilometers traveled, the unit Wan km
notRepairedDamage The car has not been repaired damage: Yes: 0 No: 1
regionCode Area code, desensitized
seller Seller: individual: 0, non-individual: 1
offerType Offer type: Providing: 0,: ​​1
creatDate Car on-line time, began selling time
price Car prices
v_0, ..., v_14 Anonymous features

Transaction ID: used car trade unique ID, primary key
automotive trade names: for example (Audi A6L 2006 paragraph 2.4 CVT comfort) personal understanding of the
car registration date [1] : Vehicle Administration vehicle owners to apply for registration (provided the application form, school proof of origin of motor vehicle inspection, vehicle factory certificate, exemption certificate, proof of compulsory insurance card, etc.) approved date
model coding [2] : according to the encoding rules, business models coded by the code, vehicle classes, the main parameters, products serial number, enterprise custom code composed of
anonymous characteristics: from v_0 after desensitization treatment v_14 other transaction data according to the car artificially constructed to the anonymous feature

3.4 observational data sets

#显示训练数据集的头和尾(图1)
Train_data.head().append(Train_data.tail())
#显示测试数据集的头和尾(图2)
TestA_data.head().append(TestA_data.tail())

##### **figure 1**
##### **figure 2**

As can be seen from the data in the figure above the line number data training data set of 150,000 (0-149999), the number of columns is 31; number of rows of test data set 50000, the number of columns is 30 (the need to predict the price column); observations set of benefits for those in the case of a large amount of data super easy to use and does not exel open, the ranks quickly understand the status of the data set, so that they have an intuitive understanding of the representation of the data.

3.5 observations from the perspective of God

Get the look of the entire data set we have the same force to rip off an idea: spicy what large data set, I start from where? I too hard now!
At this time I stood up and backhand give you two (tons) Recommended: describe () function and info () function

#显示常见的数据统计指标
Train_data.describe()

**image 3**
describe the function of official documents
Bowen describe the function
Statistics for each column represent the: count the number of statistics, mean the average, std variance, min the minimum, 25%, 50%, 75% of the median, max maximum. Here you can understand the distribution of data values substantially the entire data set. It is noteworthy problems may occur such as the accuracy of the count value regDate.

#用来查看各列数据的non值和数据类型情况  
Train_data.info()

**Figure 4**
info function official document
info function is mainly to see the data type and value and their nan distribution data. Follow-up to help us fill nan values or delete

3.6 View nan value distribution

#统计训练集数据为空的情况
Train_data.isnull().sum()

**Figure 5**
Visualization default value distribution

#用于无效矩阵的数据密集分布观测,
msno.matrix(Train_data.sample(250))

Here Insert Picture Description

Published 37 original articles · won praise 22 · views 60000 +

Guess you like

Origin blog.csdn.net/BigCabbageFy/article/details/105079989