[Python and data analysis experiment report] Basic application of Pandas data analysis

Table of contents

task content

The data file data.csv is given, which records the user's electricity consumption data. There are 200 users numbered 1~200 in the data, DATA_DATE represents the time, for example: 2015/1/1 represents January 1, 2015, and KWH represents power consumption. Please use the given data to achieve the following tasks:

(1) Transpose the data, such as eg.csv after transposition, and replace missing values ​​with NAN.
(2) Identify outliers in the data and replace them with NA.
(3) Calculate the basic statistics of each user's electricity consumption data, including: maximum value, minimum value, mean value, median, sum, variance, skewness, and kurtosis. (Null values ​​are not included)
(4) Differentiate each user's electricity consumption data on a daily basis, and calculate the basic statistics of the difference results. The statistics are the same as in the third question above.
(5) Calculate the 5% quantile of each user's electricity consumption data.
(6) Sum and differentiate the electricity consumption data of each user on a weekly basis (7 days a week), and calculate the basic statistics of the difference results. The statistics are the same as those in question 3.
(7) Each user has the maximum value of electricity consumption data within a certain period of time, and counts the number of days when the electricity consumption is greater than '0.9×maximum value'.
(8) Obtain the month with the maximum and minimum values ​​of each user's electricity consumption data. If the maximum value (minimum value) exists in multiple months, output the month with the most maximum value (minimum value). We count the electricity consumption data of each user on a daily basis. Assuming that the minimum value of the electricity consumption of user No. 1 is 0 (maybe because there is no electricity consumption when going out that day), in each of the 12 months of the year, there may be several If the daily power consumption is 0, then the month containing the days with the most power consumption of 0 is output. The same applies to the statistics of the maximum power consumption.
(9) Take the electricity consumption data of each user in July and August as the same batch of statistical samples, and the electricity consumption data in March and April as another batch of statistical samples, and calculate the overall sum (sum) between the two batches of samples ratio, the ratio of the mean (mean), the ratio of the maximum (max) and the ratio of the minimum (min).
(10) Combine all the characteristics of the above statistics into one table and display them.

(1) Transpose the data, such as eg.csv after transposition, and replace missing values ​​with NAN.

import pandas as pd
import numpy as np
data = pd.read_csv('data.csv', parse_dates=[1])
null_value = data.isna().sum() # 缺失值识别
print("data具有的缺失值:\n",null_value)
data = data.fillna(value=np.NAN)
result = pd.pivot_table(data, index='CONS_NO', columns='DATA_DATE')
result.to_csv('eg.csv')

insert image description here
insert image description here

(2) Identify outliers in the data and replace them with NA.

u = data['KWH'].mean() # 平均值
si = data['KWH'].std() # 标准差
three_si= data['KWH'].apply(lambda x: x>u+3*si  or x<u-3*si )
# print(three_si)
result1 = data.loc[three_si,'KWH'] # 使用3σ方法识别异常值
print("data在3σ下具有的异常值(前10):\n",result1.head(10))
data.loc[three_si,'KWH'] = pd.NA

insert image description here

(3) Calculate the basic statistics of each user's electricity consumption data, including: maximum value, minimum value, mean value, median, sum, variance, skewness, and kurtosis. (excluding null values)

def statistics(df): # 数据统计并合并统计量
    statistical_table = pd.concat([df.max(), df.min(), df.mean(), df.median(), df.sum(), df.var(), df.skew(), df.kurt()],axis=1)
    statistical_table.columns = ['最大值','最小值','均值','中位数','和','方差','偏度','峰度']
    return statistical_table
# 对其输出
pd.set_option('display.unicode.ambiguous_as_wide', True)
pd.set_option('display.unicode.east_asian_width', True)
pd.set_option('display.width',180) 
print("每个用户用电数据的基本统计量:\n",statistics(result.T))

insert image description here

(4) Differentiate the electricity consumption data of each user on a daily basis, and calculate the basic statistics of the difference results. The statistics are the same as those in the third question above.

difference = result.diff(axis=1)
difference.T #每个用户用电数据按日差分
print("每个用户用电数据按日差分的基本统计量:\n",statistics(difference.T))

insert image description here

(5) Calculate the 5% quantile of each user's electricity consumption data.

print("每个用户的5%分位数为:\n",result.quantile(0.05,axis=1))#运用df.quantile()函数求分位数

insert image description here

(6) Sum and differentiate the electricity consumption data of each user on a weekly basis (7 days a week), and calculate the basic statistics of the difference results. The statistics are the same as those in question 3.

w_index = pd.PeriodIndex(data['DATA_DATE'], freq='w')#重组时间序列
sum_week = data.groupby(by=['CONS_NO', w_index]).sum()
week_table = pd.pivot_table(sum_week, index='DATA_DATE', columns='CONS_NO')
diff_week = week_table.diff(1)#差分
print("每个用户按周求和并差分:\n",diff_week)
print("每个用户按周求和并差分的基本统计量:\n",statistics(diff_week))

insert image description here

(7) Each user has the maximum value of electricity consumption data within a certain period of time, and counts the number of days when the electricity consumption is greater than '0.9×maximum value'.

max_d = result.apply(lambda x:x>x.max()*0.9,axis=1).sum(axis=1)
print("每个用户用电数大于‘0.9最大值’的天数:\n",max_d)

insert image description here

(8) Obtain the month in which the maximum value and the minimum value of each user's electricity consumption data appear. If the maximum value (minimum value) exists in multiple months, output the month with the most maximum value (minimum value). We count the electricity consumption data of each user on a daily basis. Assuming that the minimum value of the electricity consumption of user No. 1 is 0 (maybe because there is no electricity consumption when going out that day), in each of the 12 months of the year, there may be several If the daily power consumption is 0, then the month containing the days with the most power consumption of 0 is output. The same applies to the statistics of the maximum power consumption.

Maximum value:

data['flag']=data.groupby('CONS_NO')['KWH'].apply(lambda x:x==x.max())#为每个最大值的记录打上标记
data['MAX']=data.groupby('CONS_NO')['KWH'].transform('max')#记录最大值
max_index = data[data.flag == True].index #筛出最大值的记录
surface_max = data.iloc[max_index]
#print(surface_max) #筛选出的数据
key = pd.PeriodIndex(surface_max['DATA_DATE'], freq='m')
max_count = surface_max.groupby(by=['CONS_NO', key])['KWH'].count()# 按月进行分组,统计每个月份最大值数量
max_count_df = pd.DataFrame(max_count)
max_count_df_index = max_count_df.reset_index().groupby('CONS_NO')['KWH'].idxmax()#筛出数量最多的月份
max_result = max_count_df.iloc[max_count_df_index]
max_result.columns = ['KWH最大值次数']
#print(max_result) # 最大值数量

key = pd.PeriodIndex(data['DATA_DATE'], freq='m')
month = data.groupby(by=['CONS_NO', key])['KWH'].max()# 按月进行分组
month_df = pd.DataFrame(month)
max_index = month_df.reset_index().groupby('CONS_NO')['KWH'].idxmax()#筛出用户KWH最大值
max_value = month_df.iloc[max_index]
#print(max_value) # 最大值

max_result=max_result.copy()
max_result.loc[:,'各用户的KWH最大值']=max_value.values
print(max_result)

insert image description here
Minimum value:

data['flag']=data.groupby('CONS_NO')['KWH'].apply(lambda x:x==x.min())#为每个最小值的记录打上标记
data['MIN']=data.groupby('CONS_NO')['KWH'].transform('min')#记录最小值
min_index = data[data.flag == True].index #筛出最小值的记录
surface_min = data.iloc[min_index]
#print(surface_min) #筛选出的数据

key = pd.PeriodIndex(surface_min['DATA_DATE'], freq='m')
min_count = surface_min.groupby(by=['CONS_NO', key])['KWH'].count()# 按月进行分组,统计每个月份最小值数量
min_count_df = pd.DataFrame(min_count)
min_count_df_index = min_count_df.reset_index().groupby('CONS_NO')['KWH'].idxmax()#筛出数量最多的月份
min_result = min_count_df.iloc[min_count_df_index]
min_result.columns = ['KWH最小值次数']
#print(min_result) # 最小值数量

key = pd.PeriodIndex(data['DATA_DATE'], freq='m')
month = data.groupby(by=['CONS_NO', key])['KWH'].min()# 按月进行分组
month_df = pd.DataFrame(month)
min_index = month_df.reset_index().groupby('CONS_NO')['KWH'].idxmin()#筛出用户KWH最小值
min_value = month_df.iloc[min_index]
#print(min_value) # 最小值

min_result=min_result.copy()
min_result.loc[:,'各用户的KWH最小值']=min_value.values
print(result)

insert image description here

(9) Take the electricity consumption data of each user in July and August as the same batch of statistical samples, and the electricity consumption data in March and April as another batch of statistical samples, and calculate the overall sum (sum) between the two batches of samples ratio, the ratio of the mean (mean), the ratio of the maximum (max) and the ratio of the minimum (min).

data = pd.read_csv('data.csv', parse_dates=[1])#重新加载数据(之前加载的data已被修改)
def date_filter(df): # 日期筛选,返回两张表
    idx = pd.IndexSlice
    mon78 = df.loc[idx[:,['2015-7','2015-8', '2016-7', '2016-8']],:]
    mon34 = df.loc[idx[:,['2015-3','2015-4', '2016-3', '2016-4']],:]
    return mon78, mon34

def date_merge(df_1, df_2, name): # 合并符合要求的日期,同时进行比值处理
    df_ratio = pd.merge(df_1, df_2, on='CONS_NO')
    df_ratio.columns = ['7-8月', '3-4月']
    df_ratio[name] = df_ratio['7-8月'] / df_ratio['3-4月']
    return df_ratio

def analysis(df): # 每月数据统计
    a_table = pd.concat([df.max()['KWH'], df.min()['KWH'], df.mean(), df.sum()],axis=1)
    a_table.columns = ['最大值','最小值','均值','和']
    return a_table

def analysis2(df):# 筛选出来的数据统计
    a2_table = pd.concat([df['最大值'].max(),df['最小值'].min(),df['均值'].mean(),df['和'].sum()],axis=1)
    return a2_table

m_index = pd.PeriodIndex(data['DATA_DATE'], freq='m')
m_all = data.groupby(['CONS_NO', m_index])# 按月进行分组
t=analysis(m_all)
mon78,mon34 = date_filter(t)
mon_78 = analysis2(mon78.groupby('CONS_NO'))
mon_34 = analysis2(mon34.groupby('CONS_NO'))
col = ['最大值','最小值','均值','和']
names=['最大电量比','最小电量比','均值比','总和比']
summary_table = pd.DataFrame()
for i in range(4):
    print('每个用户七八月电量%s与三四月电量%s的比值:'%(col[i],col[i]))
    ratio_table = date_merge(mon_78[col[i]], mon_34[col[i]], names[i])
    summary_table[names[i]] = ratio_table[names[i]]
    print(ratio_table,'\n')
print('比值汇总表:\n',summary_table)

Maximum power ratio
Minimum power ratio
power average ratio
total power ratio
Ratio Summary Table

(10) Combine all the characteristics of the above statistics into one table and display them.

def rename_columns(df,add):#统计量重命名
    new_col = list()
    for i in df.columns:
        if i != 'CONS_NO':
            new_col.append(i+add)
        else:
            new_col.append(i)
    df.columns=new_col
    return df
    
df1 ,df2,df3,df4 ,df5= statistics(result.T),statistics(difference.T),statistics(diff_week),min_result.reset_index(),max_result.reset_index()
df6,df7= pd.DataFrame(result.quantile(0.05,axis=1)),pd.DataFrame(max_d)
adds = ['_原始','_日差分','_周差分','_min','_max']
new = ['5%分位数','kwh>0.9max的天数']
dfs=[df1,df2,df3,df4,df5,df6,df7]#合并表单的集合
all_feature_tables = summary_table
for i in range(len(dfs)):
    if i < 5:
        a = rename_columns(dfs[i],adds[i])
        all_feature_tables = pd.merge(all_feature_tables,a,on='CONS_NO')
    elif i < 7:
        dfs[i].columns=[new[i-5]]
        all_feature_tables = pd.merge(all_feature_tables,dfs[i],on='CONS_NO')
        
all_feature_tables #展示总表

Feature Summary Table

Guess you like

Origin blog.csdn.net/dyy7777777/article/details/125200015