pandas data analysis (3)

Book-connected pandas data analysis (2)

DataFrame data processing and analysis

Dealing with outliers in supermarket transaction data

Import Data

import pandas as pd
# 设置列对齐
pd.set_option('display.unicode.ambiguous_as_wide',True)
pd.set_option('display.unicode.east_asian_width',True)
# 读取全部数据,使用默认索引
df=pd.read_excel('./超市营业额2.xlsx')
df[df.交易额<200]#交易额低于200的数据

insert image description here

# 上浮50%之后仍低于200的数据
df.loc[df.交易额<200,'交易额']=df[df.交易额<200]['交易额'].map(lambda num:num*1.5)
df[df.交易额<200]

insert image description here

# 交易额高于3000的数据
df[df['交易额']>3000]

insert image description here

# 交易额低于200或高于3000的数据
df[(df.交易额<200)|(df.交易额>3000)]

insert image description here

# 低于200的交易额替换为固定的200
df.loc[df.交易额<200,'交易额']=200
# 高于3000的交易额替换为固定的3000
df.loc[df.交易额>3000,'交易额']=3000
# 交易额低于200或高于3000的数据
df[(df.交易额<200)|(df.交易额>3000)]

insert image description here

Dealing with missing values ​​in supermarket transaction data

The DataFrame structure supports the dropna() method to discard data rows with missing values, or use the fillna() method to perform batch replacement of missing values.

dropna(axis=0,how='any',thresh=None,subset=None,inplace=False)
  • how=any means discarding as long as a row contains missing values; all means discarding all missing values ​​in a row.
  • thresh: Used to specify to retain rows containing several non-missing value data.
  • subset: Used to specify which columns are only considered when judging missing values.
fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None,**kwargs)
  • value: used to specify the value to be replaced
  • method: Used to specify how to fill missing values. pad/ffill fills up to the next valid value using the last valid value encountered during the scan. backfill/bfill fills all consecutive missing values ​​previously encountered with the first valid value encountered after the missing value.
  • limit: Used to specify how many consecutive missing values ​​are filled when the parameter method is set.
  • inplace: True replaces in place and modifies the original data; False returns a new DataFrame without modifying the original data.
len(df)#数据总行数

insert image description here

len(df.dropna())#丢弃缺失值后的行数

insert image description here

df[df['交易额'].isnull()]#包含缺失值的行

insert image description here

#使用固定值替换缺失值
from copy import deepcopy
dff=deepcopy(df)#深复制,不影响原来的df
dff.loc[dff.交易额.isnull(),'交易额']=1000
print(dff.iloc[[110,124,168],:])

insert image description here

#使用每人交易额均值替换缺失值
dff=deepcopy(df)
for i in dff[dff.交易额.isnull()].index:
    dff.loc[i,'交易额']=round(dff.loc[dff.姓名==dff.loc[i,'姓名'],'交易额'].mean())
print(dff.iloc[[110,124,168],:])

insert image description here

#使用整体均值的80%替换缺失值
df.fillna({
    
    '交易额':round(df['交易额'].mean()*0.8)},inplace=True)#替换原数据
print(df.iloc[[110,124,168],:])

insert image description here

Dealing with Duplicate Values ​​in Supermarket Transaction Data

len(df)#数据总行数

insert image description here

df[df.duplicated()]#重复行

insert image description here

# 一人同时负责多个柜台的排班
dff=df[['工号','姓名','日期','时段']]
dff=dff[dff.duplicated()]
for row in dff.values:
    print(df[(df.工号==row[0])&(df.日期==row[2])&(df.时段==row[3])])
df=df.drop_duplicates()#直接丢弃重复行
print('有效数据行数:',len(df))

insert image description here

#查看是否有录入错误的工号和姓名
dff=df[['工号','姓名']]
print(dff.drop_duplicates())

insert image description here

Use data difference to view employee performance fluctuations

Data difference diff(periods=1, axis=0)
periods=1 and axis=0 means each row of data minus the immediately previous row of data
periods=2 and axis=0 means each row of data minus the second row of data above this row
axis=0 means vertical difference by row, axis=1 means horizontal difference by column

#每天交易额变化情况
dff=df.groupby(by='日期').sum()['交易额'].diff()
#格式化,正数前面带加号
print(dff.map(lambda num:'%+.2f'%num)[:5])

insert image description here

#张三每天交易总额变化情况
dff=df[df.姓名=='张三'].groupby(by='日期').sum()['交易额'].diff()
print(dff.map(lambda num:'%+.2f'%num)[:5])

insert image description here

Use pivot tables and crosstabs to view performance summary data

#每人每天交易总额
dff=df.groupby(by=['姓名','日期'],as_index=False).sum()
dff=dff.pivot(index='姓名',columns='日期',values='交易额')
dff

insert image description here

#交易总额低于5万元的员工前5天业绩
dff[dff.sum(axis=1)<50000].iloc[:,:5]

insert image description here

#交易总额低于5万元的员工姓名
print(dff[dff.sum(axis=1)<50000].index.values)
['周七' '钱八']
df.pivot_table(values='交易额',index='姓名',columns='日期',aggfunc='sum',margins=True)

insert image description here

#每人在各柜台的交易总额
dff=df.groupby(by=['姓名','柜台'],as_index=False).sum()
dff.pivot(index='姓名',columns='柜台',values='交易额')

insert image description here

#每人每天上班次数
df.pivot_table(values='交易额',index='姓名',columns='日期',aggfunc='count',margins=True)

insert image description here

#每人在各柜台上班次数
df.pivot_table(values='交易额',index='姓名',columns='柜台',aggfunc='count',margins=True)

insert image description here

#每人每天上班次数
pd.crosstab(df.姓名,df.日期,margins=True).iloc[:,:5]

insert image description here

#每人在各柜台上班总次数
pd.crosstab(df.姓名,df.柜台,margins=True)

insert image description here

#每人在各柜台交易总额
pd.crosstab(df.姓名,df.柜台,df.交易额,aggfunc='sum')

insert image description here

#每人在各柜台交易额平均值
pd.crosstab(df.姓名,df.柜台,df.交易额,aggfunc='mean').apply(lambda num:round(num,2))#保留两位小数

insert image description here

View employee performance by time period using resampling techniques

A resampling interval of 7D means sampling every 7 days.
label='left' means to use the start time of the sampling period as the index of the resulting DataFrame; label='right' means to use the end time of the sampling period as the index of the resulting DataFrame; on specifies which column to resample according to, and requires the data in this
column is a datetime type.

df.日期=pd.to_datetime(df.日期)
#每7天营业总额
df.resample('7D',on='日期').sum()['交易额']

insert image description here

#每7天营业总额
df.resample('7D',on='日期',label='right').sum()['交易额']

insert image description here

#每7天营业额平均值
func=lambda num:round(num,2)
df.resample('7D',on='日期',label='right').mean().apply(func)['交易额']

insert image description here

#每7天营业额平均值
import numpy as np
func=lambda item:round(np.sum(item)/len(item),2)
df.resample('7D',on='日期',label='right')['交易额'].apply(func)

insert image description here

Guess you like

Origin blog.csdn.net/weixin_46322367/article/details/129584277