Book-connected pandas data analysis (2)

Article directory

DataFrame data processing and analysis

DataFrame data processing and analysis

Dealing with outliers in supermarket transaction data

Import Data

import pandas as pd
# 设置列对齐
pd.set_option('display.unicode.ambiguous_as_wide',True)
pd.set_option('display.unicode.east_asian_width',True)
# 读取全部数据，使用默认索引
df=pd.read_excel('./超市营业额2.xlsx')

df[df.交易额<200]#交易额低于200的数据

insert image description here

# 上浮50%之后仍低于200的数据
df.loc[df.交易额<200,'交易额']=df[df.交易额<200]['交易额'].map(lambda num:num*1.5)
df[df.交易额<200]

insert image description here

# 交易额高于3000的数据
df[df['交易额']>3000]

insert image description here

# 交易额低于200或高于3000的数据
df[(df.交易额<200)|(df.交易额>3000)]

insert image description here

# 低于200的交易额替换为固定的200
df.loc[df.交易额<200,'交易额']=200
# 高于3000的交易额替换为固定的3000
df.loc[df.交易额>3000,'交易额']=3000
# 交易额低于200或高于3000的数据
df[(df.交易额<200)|(df.交易额>3000)]

insert image description here

Dealing with missing values in supermarket transaction data

The DataFrame structure supports the dropna() method to discard data rows with missing values, or use the fillna() method to perform batch replacement of missing values.

dropna(axis=0,how='any',thresh=None,subset=None,inplace=False)

how=any means discarding as long as a row contains missing values; all means discarding all missing values in a row.
thresh: Used to specify to retain rows containing several non-missing value data.
subset: Used to specify which columns are only considered when judging missing values.

fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None,**kwargs)

value: used to specify the value to be replaced
method: Used to specify how to fill missing values. pad/ffill fills up to the next valid value using the last valid value encountered during the scan. backfill/bfill fills all consecutive missing values previously encountered with the first valid value encountered after the missing value.
limit: Used to specify how many consecutive missing values are filled when the parameter method is set.
inplace: True replaces in place and modifies the original data; False returns a new DataFrame without modifying the original data.

len(df)#数据总行数

insert image description here

len(df.dropna())#丢弃缺失值后的行数

insert image description here

df[df['交易额'].isnull()]#包含缺失值的行

insert image description here

#使用固定值替换缺失值
from copy import deepcopy
dff=deepcopy(df)#深复制，不影响原来的df
dff.loc[dff.交易额.isnull(),'交易额']=1000
print(dff.iloc[[110,124,168],:])

insert image description here

#使用每人交易额均值替换缺失值
dff=deepcopy(df)
for i in dff[dff.交易额.isnull()].index:
    dff.loc[i,'交易额']=round(dff.loc[dff.姓名==dff.loc[i,'姓名'],'交易额'].mean())
print(dff.iloc[[110,124,168],:])

insert image description here

#使用整体均值的80%替换缺失值
df.fillna({
    
    '交易额':round(df['交易额'].mean()*0.8)},inplace=True)#替换原数据
print(df.iloc[[110,124,168],:])

insert image description here

Dealing with Duplicate Values in Supermarket Transaction Data

len(df)#数据总行数

insert image description here

df[df.duplicated()]#重复行

insert image description here

# 一人同时负责多个柜台的排班
dff=df[['工号','姓名','日期','时段']]
dff=dff[dff.duplicated()]
for row in dff.values:
    print(df[(df.工号==row[0])&(df.日期==row[2])&(df.时段==row[3])])
df=df.drop_duplicates()#直接丢弃重复行
print('有效数据行数：',len(df))

insert image description here

#查看是否有录入错误的工号和姓名
dff=df[['工号','姓名']]
print(dff.drop_duplicates())

insert image description here

Use data difference to view employee performance fluctuations

Data difference diff(periods=1, axis=0)
periods=1 and axis=0 means each row of data minus the immediately previous row of data
periods=2 and axis=0 means each row of data minus the second row of data above this row
axis=0 means vertical difference by row, axis=1 means horizontal difference by column

#每天交易额变化情况
dff=df.groupby(by='日期').sum()['交易额'].diff()
#格式化，正数前面带加号
print(dff.map(lambda num:'%+.2f'%num)[:5])

insert image description here

#张三每天交易总额变化情况
dff=df[df.姓名=='张三'].groupby(by='日期').sum()['交易额'].diff()
print(dff.map(lambda num:'%+.2f'%num)[:5])

insert image description here

Use pivot tables and crosstabs to view performance summary data

#每人每天交易总额
dff=df.groupby(by=['姓名','日期'],as_index=False).sum()
dff=dff.pivot(index='姓名',columns='日期',values='交易额')
dff

insert image description here

#交易总额低于5万元的员工前5天业绩
dff[dff.sum(axis=1)<50000].iloc[:,:5]

insert image description here

#交易总额低于5万元的员工姓名
print(dff[dff.sum(axis=1)<50000].index.values)

['周七' '钱八']

df.pivot_table(values='交易额',index='姓名',columns='日期',aggfunc='sum',margins=True)

insert image description here

#每人在各柜台的交易总额
dff=df.groupby(by=['姓名','柜台'],as_index=False).sum()
dff.pivot(index='姓名',columns='柜台',values='交易额')

insert image description here

#每人每天上班次数
df.pivot_table(values='交易额',index='姓名',columns='日期',aggfunc='count',margins=True)

insert image description here

#每人在各柜台上班次数
df.pivot_table(values='交易额',index='姓名',columns='柜台',aggfunc='count',margins=True)

insert image description here

#每人每天上班次数
pd.crosstab(df.姓名,df.日期,margins=True).iloc[:,:5]

insert image description here

#每人在各柜台上班总次数
pd.crosstab(df.姓名,df.柜台,margins=True)

insert image description here

#每人在各柜台交易总额
pd.crosstab(df.姓名,df.柜台,df.交易额,aggfunc='sum')

insert image description here

#每人在各柜台交易额平均值
pd.crosstab(df.姓名,df.柜台,df.交易额,aggfunc='mean').apply(lambda num:round(num,2))#保留两位小数

insert image description here

View employee performance by time period using resampling techniques

A resampling interval of 7D means sampling every 7 days.
label='left' means to use the start time of the sampling period as the index of the resulting DataFrame; label='right' means to use the end time of the sampling period as the index of the resulting DataFrame; on specifies which column to resample according to, and requires the data in this
column is a datetime type.

df.日期=pd.to_datetime(df.日期)
#每7天营业总额
df.resample('7D',on='日期').sum()['交易额']

insert image description here

#每7天营业总额
df.resample('7D',on='日期',label='right').sum()['交易额']

insert image description here

#每7天营业额平均值
func=lambda num:round(num,2)
df.resample('7D',on='日期',label='right').mean().apply(func)['交易额']

insert image description here

#每7天营业额平均值
import numpy as np
func=lambda item:round(np.sum(item)/len(item),2)
df.resample('7D',on='日期',label='right')['交易额'].apply(func)

insert image description here

pandas data analysis (3)

Article directory

DataFrame data processing and analysis

Dealing with outliers in supermarket transaction data

Dealing with missing values in supermarket transaction data

Dealing with Duplicate Values in Supermarket Transaction Data

Use data difference to view employee performance fluctuations

Use pivot tables and crosstabs to view performance summary data

View employee performance by time period using resampling techniques

Guess you like

pandas data analysis (3)

Article directory

DataFrame data processing and analysis

Dealing with outliers in supermarket transaction data

Dealing with missing values ​​in supermarket transaction data

Dealing with Duplicate Values ​​in Supermarket Transaction Data

Use data difference to view employee performance fluctuations

Use pivot tables and crosstabs to view performance summary data

View employee performance by time period using resampling techniques

Guess you like

Dealing with missing values in supermarket transaction data

Dealing with Duplicate Values in Supermarket Transaction Data