Python with a Hospital drug sales case!

 

Basic data analysis process is generally divided into the following sections: questions, understand the data, data cleaning, building the model, data visualization

The project takes you a detailed analysis of Chaoyang Hospital drug sales data according to the above process!

1. Ask a question

 

Before analyzing the data, we need a clear analysis of what the objective is to avoid us like headless flies took the data can not start, can also help us to more efficiently select data, perform analysis.

The goal of this analysis is to analyze the following business metrics from sales data:

1) the number of times the average monthly consumption

2) the amount of the average monthly consumption

3) customer price

4) consumer trends

With the objective analysis, let's look at the data situation.

 

2. Understand the data

1) into data packets, extracts the data files

In [1]:

# Import numpy, pandas package import numpy as npimport pandas as pd # Import data salesDf = pd.read_excel ( '/ home / kesci / input / medical9242 / Chaoyang Hospital 2018 sales data .xlsx')

 

2) Check the basic situation of import data

In [2]:

# View imported data type type (salesDf)

Out[2]:

pandas.core.frame.DataFrame

In [3]:

salesDf.dtypes

Out[3]:

Consumers in time object 
social security number float64
commodity code float64
Product name object
sales float64
amount receivable float64
amount paid float64
dtype: object

In [4]:

salesDf.shape

Out[4]:

(6578, 7)

In [5]:

# View column names salesDf.columns

Out[5]:

Index ([ 'consumers in time', 'social security number', 'commodity code', 'product name', 'sales', 'amount receivable', 'amount paid'], dtype = 'object')

In [6]:

# View statistics number salesDf.count each column of data ()

Out[6]:

Time consumers in 6576 
social security number 6576
Product Code 6577
Product Name 6577
sales 6577
amount receivable 6577
the amount paid 6577
dtype: int64

In [7]:

# View first five columns salesDf.head ()

Out[7]:

  Time purchase of medicines Social security number Commodity code product name Sales volume Amount receivable The amount paid
0 2018-01-01 Friday 1.616528e+06 236701.0 Strong silver fins VC 6.0 82.8 69.00
1 2018-01-02 Saturday 1.616528e+06 236701.0 Oral detoxification 1.0 28.0 24.64
2 2018-01-06 Wednesday 1.260283e + 07 236701.0 A sense of health 2.0 16.8 15.00
3 2018-01-11 Monday 1.007034e+10 236701.0 Thirty-nine Ganmaoling 1.0 28.0 28.00
4 2018-01-15 Friday 1.015543e+08 236701.0 Thirty-nine Ganmaoling 8.0 224.0 208.00

 

3. Data cleansing

We obtained data, and can not immediately begin data analysis. We usually get the data analysis is not completely in line with our requirements, but there may be missing values, outliers, these data will make our analysis bias the results. Therefore, prior to analysis, the need for subset selection, complement missing data, outliers plurality of processing steps, data type conversion. These are all data clean-up category. In the data analysis, usually up to 60% of the time is spent cleaning the data. The usual cleaning steps following steps: • Select a subset

• Rename the column name

• Missing Data processing

• Data type conversion

• Sorting data

• outlier handling

Some of these steps is not a step can be done, you may need to repeat the operation.

Now for pharmacy sales data for data cleaning.

1) selecting a subset

Pharmacy sales data, fewer items, select a subset can be ignored, we rename the column name from the start.

2) Rename the column name

Sales data sets, consumers in time for the sale of display time is more reasonable, we first do something to change the name of this project.

In [8]:

# Consumers in Time -> Time Sales nameChangeDict = { 'consumers in time': 'Time Sales' #} denotes a cover parameter inplace = True metadata set salesDf.rename (columns = nameChangeDict, inplace = True)

 

3) Missing Data Processing

For missing data, we can have several treatment methods:

▪ Delete

When the proportion of missing data of the total amount of data is small, we usually approach deleted.

▪ filling reasonable value

In some unsuitable delete occasions, we sometimes missing data filling reasonable value, such as mean, median, adjacent data, and so on.

In [9]:

# First look at the presence of missing values ​​salesDf.isnull which items (). Any ()

Out[9]:

Sales time True 
social security number True
Product Code True
Product name True
sales True
Accounts True amount of
paid-up amount True
dtype: BOOL

 

Well, every project there is a missing value. In this sales data, sales time and social security number are required fields, indispensable. So here we sell only the time and social security number to do with missing data deletion process. We take a look at the size of sales data and time loss of social security card, and then do the deletion process.

In [10]:

# Look at the number of missing values ​​can usually be isnull # function to find the missing values ​​salesDf [salesDf [[ 'sell time', 'social security number']]. Isnull (). Values ​​== True]

Out[10]:

  Sales time Social security number Commodity code product name Sales volume Amount receivable The amount paid
6570 NaN 11778628.0 2367011.0 Hytrin 10.0 56.0 56.00
6571 2018-04-25 Tuesday NaN 2367011.0 Hytrin 2.0 11.2 9.86
6574 NaN NaN NaN NaN NaN NaN NaN
6574 NaN NaN NaN NaN NaN NaN NaN

In [11]:

# No. 6574 because the sales time and social security number are missing, it will appear twice. So what we have to get rid of duplicate data. naDf = salesDf [salesDf [[ 'selling time', 'social security number']]. isnull (). values ​​== True] .drop_duplicates () naDf

Out[11]:

  Sales time Social security number Commodity code product name Sales volume Amount receivable The amount paid
6570 NaN 11778628.0 2367011.0 Hytrin 10.0 56.0 56.00
6571 2018-04-25 Tuesday NaN 2367011.0 Hytrin 2.0 11.2 9.86
6574 NaN NaN NaN NaN NaN NaN NaN

 

From the above it is clear that time sales data and social security number missing a total of three, when the amount of data we can show only a few pieces, does not display the data content

In [12]:

# Deleted data rows naDf.shape [0]

Out[12]:

3

 

Now these missing data deletion

In [13]:

# Missing data containing sales time and social security number of deleted salesDf = salesDf.dropna (subset = [ 'selling time', 'social security number'], how = 'any') # deleted after the display size of the dataset salesDf.shape

Out[13]:

(6575, 7)

 

After deleting the data to be updated about the latest serial number, or can be problematic.

In [14]:

# Rename row names (index): the column index value is ordered before the line number, need to be modified according to the index from 0 to N order salesDf value = salesDf.reset_index (drop = True)

 

4) Data Type Conversion

Number ▪, the amount of the project: from a string type to numeric (float) type

In [15]:

salesDf [ 'sales'] = salesDf [' sales'] .astype ( 'float') salesDf [ 'amount receivable'] = salesDf [ 'amount receivable'] .astype ( 'float') salesDf [ 'solid amount received '] = salesDf [' amount paid '] .astype (' float ') print (' data type converted: \ n ', salesDf.dtypes)

 

Data type after conversion: 
sales time object
social security number float64
commodity code float64
Product name object
sales float64
amount receivable float64
amount paid float64
dtype: object

 

▪ date of the project: from a string type to date type contains the date of sale date and day of week, as long as we can retain the date of the content. Here dateChange with a custom function to achieve this functionality.

In [16]:

# DateChange date conversion DEF (dateSer): 
dateList = []
for I in dateSer:
# 2018-01-01 after Friday e.g., divided into: 2018-01-01
STR = i.split ( '') [0]
dateList. the append (STR)
dateChangeSer = pd.Series (dateList)
return dateChangeSerdateChangeSer = dateChange (salesDf [ 'time sales']) dateChangeSer

Out[16]:

0       2018-01-01
1 2018-01-02
2 2018-01-06
3 2018-01-11
4 2018-01-15
5 2018-01-20
6 2018-01-31
7 2018-02-17
8 2018-02-22
9 2018-02-24
10 2018-03-05
11 2018-03-05
12 2018-03-05
13 2018-03-07
14 2018-03-09
15 2018-03-15
16 2018-03-15
17 2018-03-15
18 2018-03-20
19 2018-03-22
20 2018-03-23
21 2018-03-24
22 2018-03-24
23 2018-03-28
24 2018-03-29
25 2018-04-05
26 2018-04-07
27 2018-04-13
28 2018-04-22
29 2018-05-01
...
6545 2018-04-05
6546 2018-04-05
6547 2018-04-09
6548 2018-04-10
6549 2018-04-10
6550 2018-04-10
6551 2018-04-12
6552 2018-04-13
6553 2018-04-13
6554 2018-04-14
6555 2018-04-15
6556 2018-04-15
6557 2018-04-15
6558 2018-04-15
6559 2018-04-16
6560 2018-04-17
6561 2018-04-18
6562 2018-04-21
6563 2018-04-22
6564 2018-04-24
6565 2018-04-25
6566 2018-04-25
6567 2018-04-25
6568 2018-04-26
6569 2018-04-26
6570 2018-04-27
6571 2018-04-27
6572 2018-04-27
6573 2018-04-27
6574 2018-04-28
Length: 6575, dtype: object

In [17]:

salesDf [ 'Time Sales'] = dateChangeSersalesDf.head ()

Out[17]:

  Sales time Social security number Commodity code product name Sales volume Amount receivable The amount paid
0 2018-01-01 1.616528e+06 236701.0 Strong silver fins VC 6.0 82.8 69.00
1 2018-01-02 1.616528e+06 236701.0 Oral detoxification 1.0 28.0 24.64
2 2018-01-06 1.260283e + 07 236701.0 A sense of health 2.0 16.8 15.00
3 2018-01-11 1.007034e+10 236701.0 Thirty-nine Ganmaoling 1.0 28.0 28.00
4 2018-01-15 1.015543e+08 236701.0 Thirty-nine Ganmaoling 8.0 224.0 208.00

 

Observed finish the conversion before it has no new missing values

In [18]:

salesDf [ 'Time Sales'] .isnull (). any ()

Out[18]:

False

In [19]:

salesDf.dtypes

Out[19]:

Sales time object 
social security number float64
commodity code float64
Product name object
sales float64
amount receivable float64
amount paid float64
dtype: object

 

No new data is missing, we continue down the time sales data types into dates.

In [20]:

dateSer = pd.to_datetime (salesDf [ 'selling time'], format = '% Y-% m-% d', errors = 'coerce') dateSer

Out[20]:

0      2018-01-01
1 2018-01-02
2 2018-01-06
3 2018-01-11
4 2018-01-15
5 2018-01-20
6 2018-01-31
7 2018-02-17
8 2018-02-22
9 2018-02-24
10 2018-03-05
11 2018-03-05
12 2018-03-05
13 2018-03-07
14 2018-03-09
15 2018-03-15
16 2018-03-15
17 2018-03-15
18 2018-03-20
19 2018-03-22
20 2018-03-23
21 2018-03-24
22 2018-03-24
23 2018-03-28
24 2018-03-29
25 2018-04-05
26 2018-04-07
27 2018-04-13
28 2018-04-22
29 2018-05-01
...
6545 2018-04-05
6546 2018-04-05
6547 2018-04-09
6548 2018-04-10
6549 2018-04-10
6550 2018-04-10
6551 2018-04-12
6552 2018-04-13
6553 2018-04-13
6554 2018-04-14
6555 2018-04-15
6556 2018-04-15
6557 2018-04-15
6558 2018-04-15
6559 2018-04-16
6560 2018-04-17
6561 2018-04-18
6562 2018-04-21
6563 2018-04-22
6564 2018-04-24
6565 2018-04-25
6566 2018-04-25
6567 2018-04-25
6568 2018-04-26
6569 2018-04-26
2018-04-27 6570
6571 2018-04-27
6572 2018-04-27
6573 2018-04-27
6574 2018-04-28
the Name: selling time, Length: 6575, dtype: datetime64 [ns]

In [21]:

dateSer.isnull().any()

Out[21]:

True

In [22]:

compareDf = pd.DataFrame(dateSer[dateSer.isnull()],salesDf[dateSer.isnull()]['销售时间'])compareDf

Out[22]:

  Sales time
Sales time  
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT
2018-02-29 NaT

 

See the next data, the cause nulls occur due date is '2018-02-29' so that the data do not actually exist. In practical applications, it is best to ask the business sector about the causes, not because the look is projected date incorrectly led to the creation of such a reason or need to look at such data necessary corrections. Here simple to delete data.

In [23]:

salesDf [ 'Time Sales'] = dateSersalesDf.dtypes

Out[23]:

Sales time datetime64 [ns] 
social security number float64
commodity code float64
Product name object
sales float64
amount receivable float64
amount paid float64
dtype: object

In [24]:

salesDf=salesDf.dropna(subset=['销售时间','社保卡号'],how='any')salesDf.shape

Out[24]:

(6552, 7)

In [25]:

salesDf=salesDf.reset_index(drop=True)

 

5)数据排序 销售记录一般是以销售时间为顺序排列的,所以我们对数据进行一下排序

In [26]:

#按销售时间排序salesDf = salesDf.sort_values(by='销售时间')#再次更新一下序号salesDf = salesDf.reset_index(drop = True)

 

6)异常值处理

在下面数据集的描述指标中可以看出,存在销售数量为负的数据,这明显是不合理的,我们把这部分数据也进行删除

In [27]:

salesDf.describe()

Out[27]:

  社保卡号 商品编码 销售数量 应收金额 实收金额
count 6.552000e+03 6.552000e+03 6552.000000 6552.00000 6552.000000
mean 6.095150e+09 1.015031e+06 2.384158 50.43025 46.266972
std 4.888430e+09 5.119572e+05 2.374754 87.68075 81.043956
min 1.616528e+06 2.367010e+05 -10.000000 -374.00000 -374.000000
25% 1.014290e+08 8.614560e+05 1.000000 14.00000 12.320000
50% 1.001650e+10 8.615070e+05 2.000000 28.00000 26.500000
75% 1.004898e+10 8.687840e+05 2.000000 59.60000 53.000000
max 1.283612e+10 2.367012e+06 50.000000 2950.00000 2650.000000

In [28]:

#删除异常值:通过条件判断筛选出数据#查询条件querySer=salesDf.loc[:,'销售数量']>0#应用查询条件print('删除异常值前:',salesDf.shape)salesDf=salesDf.loc[querySer,:]print('删除异常值后:',salesDf.shape)

 

删除异常值前:(6552, 7)
删除异常值后:(6509, 7)

 

数据清洗完了之后,我们终于可以来搭建我们的模型啦。当然如果在模型搭建过程中再次发现数据异常情况,我们还是要对数据进行进一步的清洗。

 

4.构建模型

1)业务指标1:月均消费次数=总消费次数 / 月份数

总消费次数:同一天内,同一个人发生的所有消费算作一次消费。这里我们根据列名(销售时间,社区卡号)结合,如果这两个列值同时相同,只保留1条,将重复的数据删除

月份数:数据已经按照销售时间进行排序,只需将最后的数据与第一条数据相减就可换算出月份数

In [29]:

#总消费次数计算kpDf = salesDf.drop_duplicates(subset=['销售时间','社保卡号'])total = kpDf.shape[0]print('总消费次数为:',total)

 

总消费次数为:5345

In [30]:

#月份数计算startDay = salesDf.loc[0,'销售时间']print('开始日期:',startDay)endDay = salesDf.loc[salesDf.shape[0]-1,'销售时间']print('结束日期:',endDay)monthCount = (endDay - startDay).days//30print('月份数:',monthCount)

 

开始日期: 2018-01-01 00:00:00
结束日期: 2018-07-18 00:00:00
月份数: 6

In [31]:

#业务指标1:月均消费次数=总消费次数 / 月份数kpi1 = total / monthCountprint('业务指标1:月均消费次数=',kpi1)

 

业务指标1:月均消费次数= 890.8333333333334

 

2)指标2:月均消费金额 = 总消费金额 / 月份数

In [32]:

totalMoney = salesDf['实收金额'].sum()kpi2 = totalMoney / monthCountprint('业务指标2:月平均消费金额=',kpi2)

 

业务指标2:月平均消费金额= 50672.494999999995

 

3)指标3:客单价=总消费金额 / 总消费次数

In [33]:

kpi3 = kpi2 / kpi1print('业务指标3:客单价=',kpi3)

 

业务指标3:客单价= 56.88212722170252

 

4)指标4:消费趋势,画图:折线图

In [34]:

#在进行操作之前,先把数据复制到另一个数据框中,防止对之前清洗后的数据框造成影响groupDf=salesDf#第1步:重命名行名(index)为销售时间所在列的值groupDf.index=groupDf['销售时间']groupDf.head()

Out[34]:

  销售时间 社保卡号 商品编码 商品名称 销售数量 应收金额 实收金额
销售时间              
2018-01-01 2018-01-01 1.616528e+06 236701.0 强力VC银翘片 6.0 82.8 69.0
2018-01-01 2018-01-01 1.078916e+08 861456.0 酒石酸美托洛尔片(倍他乐克) 2.0 14.0 12.6
2018-01-01 2018-01-01 1.616528e+06 861417.0 雷米普利片(瑞素坦) 1.0 28.5 28.5
2018-01-01 2018-01-01 1.007397e+10 866634.0 硝苯地平控释片(欣然) 6.0 111.0 92.5
2018-01-01 2018-01-01 1.001429e+10 866851.0 缬沙坦分散片(易达乐) 1.0 26.0 23.0

In [35]:

#第2步:分组gb=groupDf.groupby(groupDf.index.month)#第3步:应用函数,计算每个月的消费总额mounthDf=gb.sum()mounthDf

Out[35]:

  社保卡号 商品编码 销售数量 应收金额 实收金额
销售时间          
1 6.257155e+12 1.073329e+09 2527.0 53561.6 49461.19
2 4.702493e+12 7.438598e+08 1858.0 42028.8 38790.38
3 6.124761e+12 1.007946e+09 2225.0 45318.0 41597.51
4 7.620230e+12 1.226705e+09 3010.0 54324.3 48812.70
5 5.898556e+12 1.004573e+09 2225.0 51263.4 46925.27
6 5.421001e+12 9.289637e+08 2328.0 52300.8 48327.70
7 3.608900e+12 6.259256e+08 1483.0 32568.0 30120.22

In [36]:

import matplotlib.pyplot as pltimport seaborn as snsimport matplotlib as mplmpl.rcParams['font.sans-serif'] = ['SimHei']mpl.rcParams['font.serif'] = ['SimHei']sns.set_style("darkgrid",{"font.sans-serif":['simhei', 'Arial']})import matplotlib.pyplot as plt%matplotlib inline#绘制销售数量图plt.plot(mounthDf['销售数量'],color = 'b')

Out[36]:

[<matplotlib.lines.Line2D at 0x7fa6e872e550>]

 

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.


四月份为最高点,二月份为前期一个最低点,而且在四月份以后销售一直处于向下的趋势,在记录的日期中,七月份达到了历史最低水平。

Guess you like

Origin www.cnblogs.com/Py1233/p/12626495.html