Python Tianjie Fighting Skills--Data Exploration of Data Mining

Table of contents

1. Data quality analysis 

1. Missing value analysis

2. Outlier Analysis

2.1 Simple statistical analysis

2.2 3​Editing principles

2.3 Box plot analysis

2.4 Consistency Analysis

3. Analysis of data characteristics 

3.1 Distribution analysis

3.2 Comparative analysis 

3.3 Statistical analysis 

3.4 Periodic Analysis

3.5 Contribution Analysis

3.6 Sexual relationship analysis


        After we collect the data, the next question is to check the quality and quantity of the data to see if the collected data satisfies the subsequent modeling process. Here we explore from the data quality analysis and data feature analysis data.

1. Data quality analysis 

        Data quality analysis is mainly to check whether the collected data has dirty data. The so-called dirty data refers to data that does not meet the requirements or cannot be directly analyzed. Dirty data includes: missing values, outliers, inconsistent values, duplicate data or data with special symbols (#, *, $).

1. Missing value analysis

        The missing value of the data mainly includes the missing of data records or the missing of each information.

        The generation of missing values: 1. Some data cannot be obtained or the cost of obtaining the data is too high, so we have to discard it. 2. Data is missing. 3. The attribute value does not exist

        The impact of missing values: 1. A lot of useful data will be lost in the data mining modeling stage. 2. The uncertainty shown after modeling is more significant. 3. Data containing null values ​​can lead to unreliable output

2. Outlier Analysis

        Outlier analysis is to check whether the data is entered incorrectly or whether there are unreasonable data. There are several methods of outlier analysis.

2.1 Simple statistical analysis

        Simple statistical analysis is to perform a descriptive statistics on variables, and then see if the data is reasonable in our cognition. Common statistics have maximum and minimum values.

2.2 3 \varepsilonprinciples

        Then the data obeys a normal distribution, and we can use this principle to check whether the data is an outlier. Under one principle, an outlier is defined as a value that deviates from the mean in a set of measured values ​​by more than 3 standard deviations

2.3 Box plot analysis

        The standard for judging outliers by the box plot is based on the quartile and the quartile distance. The quartile has a certain degree of robustness. What is robustness? That is, 25% of the data can be made arbitrarily far away without seriously disturbing the quartiles, and all outliers cannot exert influence on this criterion.

 After reading the data in python, you can use the describe() method of the pandas library to view the data, for example:

import pandas as pd

catering_sale = 'catering_sale.xls'
data=pd.read_excel(catering_sale,index_col='日期')
print(data.describe())
print(len(data))
"""
               销量
count   200.000000
mean   2755.214700
std     751.029772
min      22.000000
25%    2451.975000
50%    2655.850000
75%    3026.125000
max    9106.440000
201
"""

After the output, you can see that the value of count is one less than the record of data, which is the number of missing values. Among them, mean means the mean, std means the standard deviation, min means the minimum value, and max means the maximum value. These can help us view the information of the data. If you want to represent the data more intuitively or check the outliers, you can also use the box plot:

import pandas as pd
catering_sale = 'catering_sale.xls'
data = pd.read_excel(catering_sale, index_col = u'日期')  # 读取数据,指定“日期”列为索引列

import matplotlib.pyplot as plt  # 导入图像库
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

plt.figure()  # 建立图像
p = data.boxplot(return_type='dict')  # 画箱线图,直接使用DataFrame的方法
x = p['fliers'][0].get_xdata()  #  'flies'即为异常值的标签
y = p['fliers'][0].get_ydata()
y.sort()  # 从小到大排序,该方法直接改变原对象
for i in range(len(x)):
    if i>0:
        plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.05 -0.8/(y[i]-y[i-1]),y[i]))
    else:
        plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.08,y[i]))

plt.show()  # 展示箱线图

Such as a box plot, the 7 sales data exceeding the upper and lower bounds may be outliers

2.4 Consistency Analysis

        Data inconsistency refers to the inconsistency and incompatibility of data. Direct mining of inconsistency data is likely to produce results that are contrary to reality.

3. Analysis of data characteristics 

        After analyzing the data quality, we can analyze the characteristics of the data by drawing charts and calculating certain characteristics

3.1 Distribution analysis

        Distribution analysis can reveal the distribution characteristics and distribution types of data. The data of distribution analysis is mainly divided into quantitative data and qualitative data.

1. Distribution analysis of quantitative data

        For the distribution analysis of quantitative data, we generally analyze it through the frequency distribution, and perform the following steps:

Find the range, determine the group distance and number of groups, determine the division point, list the frequency distribution diagram, and draw the frequency histogram

        Here we look at plotting a frequency histogram:

import pandas as pd
catering_sale = 'catering_fish_congee.xls' 
data = pd.read_excel(catering_sale, names=['date', 'sale'])  # 读取数据,指定“日期”列为索引

import matplotlib.pyplot as plt

d = 500  # 设置组距
num_bins = round((max(data['sale']) - min(data['sale'])) / d)  # 计算组数
plt.figure(figsize=(10, 6))  # 设置图框大小尺寸
plt.hist(data['sale'], num_bins)
plt.xticks(range(0, 4000, d))
plt.xlabel('sale分层')
plt.grid()
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.title('季度销售额频率分布直方图', fontsize=20)
plt.show()

2. Qualitative data analysis 

        Qualitative data usually use pie charts or bar charts to describe the distribution of qualitative variables, for example:

import pandas as pd
import matplotlib.pyplot as plt
catering_dish_profit = 'catering_dish_profit.xls'
data = pd.read_excel(catering_dish_profit)  # 读取数据,指定“日期”列为索引

# 绘制饼图
x = data['盈利']
labels = data['菜品名']
plt.figure(figsize = (8, 6))  # 设置画布大小
plt.pie(x,labels=labels)  # 绘制饼图
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.title('菜品销售量分布(饼图)')  # 设置标题
plt.axis('equal')
plt.show()

# 绘制条形图
x = data['菜品名']
y = data['盈利']
plt.figure(figsize = (8, 4))  # 设置画布大小
plt.bar(x,y)
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.xlabel('菜品')  # 设置x轴标题
plt.ylabel('销量')  # 设置y轴标题
plt.title('菜品销售量分布(条形图)')  # 设置标题
plt.show()  # 展示图片

3.2 Comparative analysis 

        Comparative analysis is to compare two interrelated indicators to quantitatively show and explain the size and level of the research object. The comparative analysis mainly has the following forms:

1. Absolute number comparison

        Use absolute numbers to compare

2. Relative number comparison

        Through the comparison of two linked data, the comparison of relative numbers can be divided into: relative numbers of structures, relative numbers of proportions, relative numbers of comparisons, relative numbers of strengths, relative numbers of plan completion degree, and relative numbers of dynamics. For example:

 

import pandas as pd
import matplotlib.pyplot as plt
data=pd.read_excel("dish_sale.xls")
plt.figure(figsize=(8, 4))
plt.plot(data['月份'], data['A部门'], color='green', label='A部门',marker='o')
plt.plot(data['月份'], data['B部门'], color='red', label='B部门',marker='s')
plt.plot(data['月份'], data['C部门'],  color='skyblue', label='C部门',marker='x')
plt.legend() # 显示图例
plt.ylabel('销售额(万元)')
plt.show()


#  B部门各年份之间销售金额的比较
data=pd.read_excel("dish_sale_b.xls")
plt.figure(figsize=(8, 4))
plt.plot(data['月份'], data['2012年'], color='green', label='2012年',marker='o')
plt.plot(data['月份'], data['2013年'], color='red', label='2013年',marker='s')
plt.plot(data['月份'], data['2014年'],  color='skyblue', label='2014年',marker='x')
plt.legend() # 显示图例
plt.ylabel('销售额(万元)')
plt.show()

3.3 Statistical analysis 

        Statistical analysis refers to the statistical analysis of quantitative data, often from the two aspects of central tendency and central tendency

        1. Measures of Central Tendency

                Central tendency measures mainly analyze the mean, median, and mode

                Main analysis range, standard deviation, coefficient of variation, interquartile range

3.4 Periodic Analysis

        Periodic analysis is to explore whether a variable shows a certain periodic trend over time

3.5 Contribution Analysis

        Contribution analysis is also known as the Pareto rule, also known as the 2/8 rule, that is, the same input will produce different benefits when placed in different places

3.6 Sexual relationship analysis

        The process of analyzing the strength of the linear correlation between continuous variables and expressing it with appropriate statistical specifications is called correlation analysis. The common correlation analysis methods are:

1. Draw a scatter plot directly 2. Draw a scatter plot matrix 3. Calculate the correlation coefficient (Pearson coefficient, Spearman coefficient, judgment coefficient)

Guess you like

Origin blog.csdn.net/weixin_63009369/article/details/130020720