Python data analysis and mining practice (data exploration)

       After collecting preliminary sample data sets based on observations and surveys, the next thing to consider is: Does the quantity and quality of sample data sets meet the requirements for model construction? Are there data states that were never envisioned? Are there any obvious patterns and trends? What are the relationships among the factors?

        Data exploration is the process of analyzing the structure and laws of a sample data set by testing the data quality of the data set, drawing charts, and calculating certain feature quantities. Data exploration helps to choose appropriate data preprocessing and construction methods, and can even complete some problems that are usually solved by data mining.

Data 1 reads:

import pandas as pd
import numpy as np

catering_sale = 'D:\\WeChat_Documents\\WeChat Files\\FileStorage\\File\\2023-02\\catering_fish_congee.xls'  # 餐饮数据
data = pd.read_excel(catering_sale,names=['date','sale'])  # 读取数据,指定“日期”列为索引

print(data)
print(data.describe())  #具体描述

1. Outlier Analysis

         Boxplots provide a criterion for identifying outliers: outliers are usually defined as less than the lower quartile - 1.5*interquartile range or greater than the upper quartile + 1.5*interquartile range , code and The picture is as follows:

#箱线图
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']  #用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  #用来正常显示负号

plt.figure()
p = data.boxplot(return_type='dict') 
x = p['fliers'][0].get_xdata()
y = p['fliers'][0].get_ydata()
y.sort()

for i in range(len(x)):
    if i > 0:
        plt.annotate(y[i],xy=(x[i],y[i]),xytext=(x[i]+0.05-0.8/(y[i]-y[i-1]),y[i]))
    else:
        plt.annotate(y[i],xy=(x[i],y[i]),xytext=(x[i]+0.08,y[i]))
plt.title('季度销售额频率分布直方图(3001)',fontsize=20)
plt.show()

 As shown in the figure, 3960 is classified as an outlier.

2. Analysis of data characteristics

       Distribution analysis can reveal the distribution characteristics and distribution types of data. For quantitative data , if you want to know whether its distribution form is symmetrical or asymmetrical, and if you find some extra large or extra small, you can make a frequency distribution table, draw a frequency distribution histogram, and draw a stem-and-leaf diagram for intuitive analysis; For qualitative data , pie charts and bar charts can be used to visually display its distribution.

The frequency distribution histogram is as follows:

bins = [0,500,1000,1500,2000,2500,3000,3500,4000]
labels = ['[0,500)','[500,1000)','[1000,1500)','[1500,2000)',
       '[2000,2500)','[2500,3000)','[3000,3500)','[3500,4000)'] 

data['sale分层'] = pd.cut(data.sale, bins, labels=labels)
aggResult = data.groupby(by=['sale分层'])['sale'].agg([("sale",np.size)])

pAggResult = round(aggResult/aggResult.sum(), 2, ) * 100

 # 绘制频率直方图
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))  # 设置图框大小尺寸
pAggResult['sale'].plot(kind='bar',width=0.8,fontsize=10) 
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.title('季度销售额频率分布直方图(3001)',fontsize=20)
plt.show()

 As can be seen from the figure: the proportion of days with sales volume in [0, 500) is quite high, and there are fewer days with sales volume above 2500 in these three months.

Import new data:

import pandas as pd
import matplotlib.pyplot as plt
catering_dish_profit = 'D:\\WeChat_Documents\\WeChat Files\\FileStorage\\File\\2023-02\\catering_dish_profit.xls'  # 餐饮数据
data = pd.read_excel(catering_dish_profit)

Pie chart, as follows:

#绘制饼图
x=data['盈利']
labels=data['菜品名']
plt.figure(figsize=(10,6)) 
plt.pie(x,labels=labels)
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.title('菜品销售额分布(3001)')
plt.axis('equal')
plt.show()

 Bar chart, as follows:

#条形图
y=data['盈利']
x=data['菜品名']
plt.figure(figsize=(10,6)) 
plt.bar(x,y)
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.xlabel('菜品')
plt.ylabel('销量')
plt.title('菜品销售额分布(3001)')
plt.show()

As shown in the above pie chart and bar chart, you can clearly see the percentage or frequency of each type in the whole .

3. Correlation Analysis

The process of analyzing the strength of linear correlation between continuous variables and expressing it with appropriate statistical indicators is called correlation analysis.

For example, the scatter plot is as follows:

#散点图
years = data['菜品名']
turnovers = data['盈利']
plt.figure()
plt.scatter(years, turnovers, c='red', s=100, label='销量')
plt.xlabel("菜品", fontdict={'size': 16})
plt.ylabel("销量", fontdict={'size': 16})
plt.title("菜品销售额分布(3001)", fontdict={'size': 20})
plt.legend(loc='best')
plt.show()

 

Guess you like

Origin blog.csdn.net/m0_61463713/article/details/129220730