"Data Mining Getting Started Series" data exploration

After get the sample data set, before we conduct data mining, we require a basic understanding of sample data sets. We want to know that this sample data set, whether there are some obvious rules or trends, whether some unusual data value exists.

We can graphically display data sets through data quality verification data, or calculate some important features to understand the value of a data set. Understand the entire process of data collection is to explore the process data.

Before data preprocessing, we need to understand the quality of case data, so as to effectively carry out data preprocessing. Quality of the data analysis, mainly to check whether a sample data set came some garbage data or called dirty data. If the data set of dirty data are not handled well, it will seriously affect the results of the analysis. In common enterprise dirty data into the following types:

  1. Missing data
  2. Data anomalies
  3. Data inconsistencies
  4. Data duplication, etc.

Missing data processing

In the centralized corporate data, often because for some reason the saved data is incomplete. Some fields may be lost. In this case, during data mining, it can significantly affect the results of the data analysis. There are two common treatments for missing data:

  1. Delete missing values
  2. Fill the missing values
  3. No treatment

Data Exception Handling

In the centralized corporate data, because there is artificial entry may result in some data may be irrational. The influence of outliers on the results of data is very large, if not in such a pretreatment process to dispose of some of the data, it will cause a lot of adverse effects. We can analyze abnormal data in the following ways.

  1. Simple statistical analysis
  2. 3δ principle
  3. Box plot analysis

1, a simple statistical analysis

It can be a simple statistical analysis of the data sets to discover outliers in the data set. For example: the maximum value, minimum value.

2,3δ principle

If the data is normally distributed, if a set of measured values ​​over three times the standard deviation of the mean, such data as an outlier data. If the data were not normally distributed, we can also calculate the standard deviation from the mean to determine whether an abnormal value

image

3, box plot analysis

It provides a box plot outlier identification criteria. By analyzing the measured values ​​can be calculated a lower quartile, upper quartile. Between the lower quartile and the upper quartile interquartile range. Half of the measured data values ​​lower digit data distribution, and between the upper digit.

image

Box plot analysis of the data does not have any limits, the face can intuitively feedback original data segment. Up to 25 percent of the data does not greatly affect the quartiles, outliers will not affect the standard. There are some advantages in terms of identifying outliers.

Case data exploration

The following data sets, saving the sales figures of a business day. We will have these sales data for statistical analysis, to find outliers them.

Files have been uploaded to the Baidu network disk: https://pan.baidu.com/s/1aiFN3GdAngD4ylN3bvPKoA

image

Note: The code is not on the Chinese path! ! !


# - * - Coding: UTF-. 8 - * - 
# 1. Import pandas library to read the test data set, and the alias database is pandas PD 
Import pandas AS PD

# 2 read test dataset read_excel method 
# read_excel Description of Function 
# . 1) Excel file name 
# 2) index_col indexed columns (since the data read out of pandas DataFrame, table-like structure is equivalent to specifying a primary key 
data = pd.read_excel ( ' catering_sale.xls ' , index_col = U ' date ' )
print (u 'original value' + str (len (data)))

# 3. describe a simple statistical analysis methods 
describe = data.describe ()

# 4. 打印描述信息
print describe

运行结果如下:

                销量
count   200.000000
mean(平均值)   2755.214700
std(标准差)     751.029772
min(最小值)      22.000000
25%(1/4分位数)    2451.975000
50%(1/2分位数)    2655.850000
75%(3/4分位数)    3026.125000
max(最大值)    9106.440000

count统计的是非空值,此处非空值为200,但原始值为201,表示存在一条数据存在数据缺失。

我们接着使用matplotlib来绘制箱型图,来发现异常值

"""
需求二:
使用matplotlib来绘制箱型图,发现异常值
"""
# 1. 导入matplotlib绘图库
import matplotlib.pyplot as plot

# Configuring matplotlib drawing parameters 
plot.rcParams [ ' font.sans serif- ' ] = [ ' SimHei ' ]    # normal display Chinese label 
plot.rcParams [ ' axes.unicode_minus ' ] = False      # normal display negative 
# 3. create an image 
plot.figure ()
 # 4. drawing FIG box 
Boxplot = data.boxplot ()

plot.show()

image

By box plot, we find these outliers:

image

Combined with specific business, we can find 22,51,60,6607,9106 are outliers. We can determine the subsequent filtering rules.

Guess you like

Origin www.cnblogs.com/ilovezihan/p/12240918.html