Beginner Data Mining-Data Exploration (2): Distribution Analysis of Data Feature Analysis

Distribution analysis

Distribution analysis can reveal the distribution characteristics and types of data.
For quantitative data: you can make a frequency distribution table, draw a frequency distribution histogram, and draw a stem and leaf diagram for intuitive analysis;
for qualitative analysis: you can draw a pie chart and a bar chart to intuitively display its distribution.

1. Distribution analysis of quantitative data

General steps:

  1. Find the range: range = maximum-minimum
  2. Determine the group distance and the number of groups: the group distance is the length of each interval, the number of groups = value range / group distance
  3. Decide the points: the points refer to the end points of each interval. This step is to determine the start and end points of each group
  4. List the frequency distribution table
  5. Draw frequency distribution histogram

Follow the principle:

  1. The sub-groups must be mutually exclusive
  2. Each group must include all data
  3. The group width of each group should be equal

Example:
Group by sales volume and calculate frequency

  1. Find the range: 3960-45 = 3915
  2. Grouping: According to the business data, the group distance is 500, then the number of groups is 3915/500=7.83, that is, a total of 8 groups.
  3. Deciding points: in order: [0,500),[500,1000),[1000,1500),[1500,2000),[2000,2500),[2500,3000),[3000,3500),[3500, 4000)
  4. Plot the frequency distribution histogram
  5. Draw frequency distribution histogram
import pandas as pd #导入所需包
import numpy as np
catering_sale = './data/catering_fish_congee.xls'  # 餐饮数据
data = pd.read_excel(catering_sale,names=['date','sale'])  # 读取数据,指定“日期”列为索引

bins = [0,500,1000,1500,2000,2500,3000,3500,4000]
labels = ['[0,500)','[500,1000)','[1000,1500)','[1500,2000)',
       '[2000,2500)','[2500,3000)','[3000,3500)','[3500,4000)'] 

data['sale分层'] = pd.cut(data.sale, bins, labels=labels)
aggResult = data.groupby(by=['sale分层'])['sale'].agg({
    
    'sale': np.size})

pAggResult = round(aggResult/aggResult.sum(), 2, ) * 100

import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))  # 设置图框大小尺寸
pAggResult['sale'].plot(kind='bar',width=0.8,fontsize=10)  # 绘制频率直方图
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.title('季度销售额频率分布直方图',fontsize=20)
plt.show()

Insert picture description here

Examples of original data:

Insert picture description here
Original data access:
link: https://pan.baidu.com/s/1gA6KPwfI5Y26S2qm2Uipow
extraction code: 2677

2. Analysis of qualitative data

Qualitative data are often grouped according to the classification type of variables , and pie charts and bar charts can be used to describe their distribution.

import pandas as pd
import matplotlib.pyplot as plt
catering_dish_profit = './data/catering_dish_profit.xls'  # 餐饮数据
data = pd.read_excel(catering_dish_profit)  # 读取数据,指定“日期”列为索引

# 绘制饼图
x = data['盈利']
labels = data['菜品名']
plt.figure(figsize = (8, 6))  # 设置画布大小
plt.pie(x,labels=labels)  # 绘制饼图
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.title('菜品销售量分布(饼图)')  # 设置标题
plt.axis('equal')
plt.show()

# 绘制条形图
x = data['菜品名']
y = data['盈利']
plt.figure(figsize = (8, 4))  # 设置画布大小
plt.bar(x,y)
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.xlabel('菜品')  # 设置x轴标题
plt.ylabel('销量')  # 设置y轴标题
plt.title('菜品销售量分布(条形图)')  # 设置标题
plt.show()  # 展示图片

Insert picture description here
Each sector of the pie chart represents the percentage or frequency of each type.
Insert picture description here
The height of the bar chart (bar chart) represents the percentage or frequency of each type, and the width is meaningless.

Original data example:
Insert picture description here
Original data access:
Link: https://pan.baidu.com/s/1zHip89y-AkN2smq3ZE-5pw
Extraction code: 2677

Guess you like

Origin blog.csdn.net/qq_45154565/article/details/109283869