Book-connected pandas data analysis (2)
Article directory
- DataFrame data processing and analysis
-
- Dealing with outliers in supermarket transaction data
- Dealing with missing values in supermarket transaction data
- Dealing with Duplicate Values in Supermarket Transaction Data
- Use data difference to view employee performance fluctuations
- Use pivot tables and crosstabs to view performance summary data
- View employee performance by time period using resampling techniques
DataFrame data processing and analysis
Dealing with outliers in supermarket transaction data
Import Data
import pandas as pd
# 设置列对齐
pd.set_option('display.unicode.ambiguous_as_wide',True)
pd.set_option('display.unicode.east_asian_width',True)
# 读取全部数据,使用默认索引
df=pd.read_excel('./超市营业额2.xlsx')
df[df.交易额<200]#交易额低于200的数据
# 上浮50%之后仍低于200的数据
df.loc[df.交易额<200,'交易额']=df[df.交易额<200]['交易额'].map(lambda num:num*1.5)
df[df.交易额<200]
# 交易额高于3000的数据
df[df['交易额']>3000]
# 交易额低于200或高于3000的数据
df[(df.交易额<200)|(df.交易额>3000)]
# 低于200的交易额替换为固定的200
df.loc[df.交易额<200,'交易额']=200
# 高于3000的交易额替换为固定的3000
df.loc[df.交易额>3000,'交易额']=3000
# 交易额低于200或高于3000的数据
df[(df.交易额<200)|(df.交易额>3000)]
Dealing with missing values in supermarket transaction data
The DataFrame structure supports the dropna() method to discard data rows with missing values, or use the fillna() method to perform batch replacement of missing values.
dropna(axis=0,how='any',thresh=None,subset=None,inplace=False)
- how=any means discarding as long as a row contains missing values; all means discarding all missing values in a row.
- thresh: Used to specify to retain rows containing several non-missing value data.
- subset: Used to specify which columns are only considered when judging missing values.
fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None,**kwargs)
- value: used to specify the value to be replaced
- method: Used to specify how to fill missing values. pad/ffill fills up to the next valid value using the last valid value encountered during the scan. backfill/bfill fills all consecutive missing values previously encountered with the first valid value encountered after the missing value.
- limit: Used to specify how many consecutive missing values are filled when the parameter method is set.
- inplace: True replaces in place and modifies the original data; False returns a new DataFrame without modifying the original data.
len(df)#数据总行数
len(df.dropna())#丢弃缺失值后的行数
df[df['交易额'].isnull()]#包含缺失值的行
#使用固定值替换缺失值
from copy import deepcopy
dff=deepcopy(df)#深复制,不影响原来的df
dff.loc[dff.交易额.isnull(),'交易额']=1000
print(dff.iloc[[110,124,168],:])
#使用每人交易额均值替换缺失值
dff=deepcopy(df)
for i in dff[dff.交易额.isnull()].index:
dff.loc[i,'交易额']=round(dff.loc[dff.姓名==dff.loc[i,'姓名'],'交易额'].mean())
print(dff.iloc[[110,124,168],:])
#使用整体均值的80%替换缺失值
df.fillna({
'交易额':round(df['交易额'].mean()*0.8)},inplace=True)#替换原数据
print(df.iloc[[110,124,168],:])
Dealing with Duplicate Values in Supermarket Transaction Data
len(df)#数据总行数
df[df.duplicated()]#重复行
# 一人同时负责多个柜台的排班
dff=df[['工号','姓名','日期','时段']]
dff=dff[dff.duplicated()]
for row in dff.values:
print(df[(df.工号==row[0])&(df.日期==row[2])&(df.时段==row[3])])
df=df.drop_duplicates()#直接丢弃重复行
print('有效数据行数:',len(df))
#查看是否有录入错误的工号和姓名
dff=df[['工号','姓名']]
print(dff.drop_duplicates())
Use data difference to view employee performance fluctuations
Data difference diff(periods=1, axis=0)
periods=1 and axis=0 means each row of data minus the immediately previous row of data
periods=2 and axis=0 means each row of data minus the second row of data above this row
axis=0 means vertical difference by row, axis=1 means horizontal difference by column
#每天交易额变化情况
dff=df.groupby(by='日期').sum()['交易额'].diff()
#格式化,正数前面带加号
print(dff.map(lambda num:'%+.2f'%num)[:5])
#张三每天交易总额变化情况
dff=df[df.姓名=='张三'].groupby(by='日期').sum()['交易额'].diff()
print(dff.map(lambda num:'%+.2f'%num)[:5])
Use pivot tables and crosstabs to view performance summary data
#每人每天交易总额
dff=df.groupby(by=['姓名','日期'],as_index=False).sum()
dff=dff.pivot(index='姓名',columns='日期',values='交易额')
dff
#交易总额低于5万元的员工前5天业绩
dff[dff.sum(axis=1)<50000].iloc[:,:5]
#交易总额低于5万元的员工姓名
print(dff[dff.sum(axis=1)<50000].index.values)
['周七' '钱八']
df.pivot_table(values='交易额',index='姓名',columns='日期',aggfunc='sum',margins=True)
#每人在各柜台的交易总额
dff=df.groupby(by=['姓名','柜台'],as_index=False).sum()
dff.pivot(index='姓名',columns='柜台',values='交易额')
#每人每天上班次数
df.pivot_table(values='交易额',index='姓名',columns='日期',aggfunc='count',margins=True)
#每人在各柜台上班次数
df.pivot_table(values='交易额',index='姓名',columns='柜台',aggfunc='count',margins=True)
#每人每天上班次数
pd.crosstab(df.姓名,df.日期,margins=True).iloc[:,:5]
#每人在各柜台上班总次数
pd.crosstab(df.姓名,df.柜台,margins=True)
#每人在各柜台交易总额
pd.crosstab(df.姓名,df.柜台,df.交易额,aggfunc='sum')
#每人在各柜台交易额平均值
pd.crosstab(df.姓名,df.柜台,df.交易额,aggfunc='mean').apply(lambda num:round(num,2))#保留两位小数
View employee performance by time period using resampling techniques
A resampling interval of 7D means sampling every 7 days.
label='left' means to use the start time of the sampling period as the index of the resulting DataFrame; label='right' means to use the end time of the sampling period as the index of the resulting DataFrame; on specifies which column to resample according to, and requires the data in this
column is a datetime type.
df.日期=pd.to_datetime(df.日期)
#每7天营业总额
df.resample('7D',on='日期').sum()['交易额']
#每7天营业总额
df.resample('7D',on='日期',label='right').sum()['交易额']
#每7天营业额平均值
func=lambda num:round(num,2)
df.resample('7D',on='日期',label='right').mean().apply(func)['交易额']
#每7天营业额平均值
import numpy as np
func=lambda item:round(np.sum(item)/len(item),2)
df.resample('7D',on='日期',label='right')['交易额'].apply(func)