数据清洗

读取数据

#sep=' '  sep : str, default ‘,’ 指定分隔符号 默认为 “,"
#header 指定行数来作为列名字 默认为0 还可以为多行列名
#keep_default_na 指定参数为na 那么默认的NaN将被覆盖 否则添加

people=pd.read_csv("./people.csv",sep=',',header=0,keep_default_na=True,parse_dates=['date'])

设置图表格式

# set_index 设置索引为people_id
#drop：drop为False则索引列会被还原为普通列，否则会丢失
append：默认为False，是否将列附加到现有索引
inplace：默认为False，适当修改DataFrame(不要创建新对象)

people.set_index(keys=['people_id'],drop=True,append=False,inplace=True)

act_train=pd.read_csv("./act_train.csv",sep=',',header=0,keep_default_na=True,parse_dates=['date'])
act_train.set_index(keys=['people_id'],drop=True,append=False,inplace=True)

act_train.head(10)

act_test=pd.read_csv("./act_test.csv",sep=',',header=0,keep_default_na=True,parse_dates=['date'])
act_test.set_index(keys=['people_id'],drop=True,append=False,inplace=True)

act_test.head(10)

在这里插入图片描述

合并数据

通过DataFrame.merge()方法，People_id为键，合并act数据和people数据

#以people_id为key合并act_train  people
# 左连接 对于相同的列标题 添加_act _people来区分

train_data=act_train.merge(people,how='left',left_index=True,right_index=True,suffixes=('_act', '_people'))
train_data.head(10)

test_data=act_test.merge(people,how='left',left_index=True,right_index=True,suffixes=('_act', '_people'))
test_data.head(10）

在这里插入图片描述

拆分数据

在官方说明当中 type1的active和非type1 的active的格式是不同的
根据activing的type来拆分

查看类型和数量

train_data.activity_category.value_counts()

type 2    904683
type 5    490710
type 3    429408
type 4    207465
type 1    157615
type 6      4253
type 7      3157
Name: activity_category, dtype: int64

拆分数据

types=['type %d'%i for i in range(1,8)]
train_datas={}
test_datas={}
for _type in types:
    # DataFrame.dropna():抛弃了全为NaN的行和列
    train_datas[_type]=train_data[train_data.activity_category==_type].dropna(axis=(0,1), how='all')
    test_datas[_type]=test_data[test_data.activity_category==_type].dropna(axis=(0,1), how='all')
    print(train_datas[_type].activity_category.unique())
    print(test_datas[_type].activity_category.unique())

在这里插入图片描述

观察拆分后的数据

train_datas['type 1'].head(2)

在这里插入图片描述
发现每一个数据集中 activity_category都是一个样的
因此可以删除这一列（上一步拆分的过程就是一个天然的聚类的过程 activity_id对应了聚类的id）

删除activity_category这一列

# 删除activity_id这一列
#drop函数的使用：删除行、删除列
#print frame.drop(['a'])
#print frame.drop(['Ohio'], axis = 1)
#drop函数默认删除行，列需要加axis = 1

types=['type %d'%i for i in range(1,8)]
for _type in types:
    train_datas[_type].drop(['activity_category'],axis=1,inplace=True)
    test_datas[_type].drop(['activity_category'],axis=1,inplace=True)

观察删除activity_category列之后的数据

在这里插入图片描述

去除唯一值

首先将activity_id这一列数据变成索引列

# 指定索引列为act-id
#append=True是保留原来的people_id行索引从而生成一个多级行索引
#inplace=Ture是原地修改数据

types=['type %d'%i for i in range(1,8)]
for _type in types:
    train_datas[_type].set_index(keys=['activity_id'], drop=True, append=True, inplace=True)
    test_datas[_type].set_index(keys=['activity_id'], drop=True, append=True, inplace=True)

观察数据

在这里插入图片描述
每一行数据对应唯一的一个索引对（元祖（people_id,activity_id))。
也就是得到的结果中不包含重复索引
可以通过以下代码验证

验证不包含重复索引

#验证索引值为唯一

types=['type %d'%i for i in range(1,8)]
for _type in types:
    print(train_datas[_type].index.is_unique,end=',')
    print(test_datas[_type].index.is_unique,end=',' )

在这里插入图片描述
说明后续训练中不用考虑people_id,activit_id
因为他们作为索引对应的每个样本都是唯一的
他们对于判定没有任何帮助
这一类属性通常是人工产生的
比如数据库中很多id 大多是作为索引存在

数据类型转换

在目前得到的结果集中很多是字符串类型，并且含有文字类型的字符串
例如 date_act 列的数据

	2022-07-27

因此需要对他进行处理

查看每一列的数据类型

# 查看每一列的数据类型


pd.DataFrame({'train_1':train_datas['type 1'].dtypes,'train_2':train_datas['type 2'].dtypes,
              'train_3':train_datas['type 3'].dtypes,'train_4':train_datas['type 4'].dtypes,
              'train_5':train_datas['type 5'].dtypes,'train_6':train_datas['type 6'].dtypes,
              'train_7':train_datas['type 7'].dtypes,
              'test_1':test_datas['type 1'].dtypes,'test_2':test_datas['type 2'].dtypes,
              'test_3':test_datas['type 3'].dtypes,'test_4':test_datas['type 4'].dtypes,
              'test_5':test_datas['type 5'].dtypes,'test_6':test_datas['type 6'].dtypes,
              'test_7':test_datas['type 7'].dtypes,})

在这里插入图片描述
其中：

type 1 数据中没有char_10_act;
type 2~7 数据中没有char_1_act~char_9_act

train数据中没有outcome


需要将这些数据转换成np.float64 方便以后使用

观察这些列规律发现：

group_1列为字符串：group_xxx
data_act/data_people 列为datetime64类型，把每个日期转换成从1970-01-01以来的天数(浮点数)
char_1_act~char_10_act、char_1_people ~ char_9_people列为字符串：type x
char_10_people、char_11~char_37列为boolen 将这些数据转换成0 1
outcome char_38列为整数其中outcome列为标记信息 (0~1) char_38列为连续值

数据清洗

采用Pandas对象的矢量化字符串方法.str.replace()和.str.strp() 他们都返回一个Pandas.Series对象
然后用pd.Series对象。然后使用Pandas.astype()方法来将数字形式的字符串转换成浮点数

# 数据清洗
# 采用Pandas对象的矢量化字符串方法.str.replace()和.str.strp() 他们都返回一个Pandas.Series对象
# 然后用pd.Series对象。然后使用Pandas.astype()方法来将数字形式的字符串转换成浮点数

str_col_list=['group_1']+['char_%d_act'%i for i in range(1,11)]+['char_%d_people'%i for i in range(1,10)]
bool_col_list=['char_10_people']+['char_%d'%i for i in range(11,38)]
types=['type %d'%i for i in range(1,8)]
for _type in types:
    for data_set in [train_datas,test_datas]:
        data_set[_type].date_act= (data_set[_type].date_act- np.datetime64('1970-01-01'))/ np.timedelta64(1, 'D')
        data_set[_type].date_people= (data_set[_type].date_people- np.datetime64('1970-01-01'))/ np.timedelta64(1,'D') 
        data_set[_type].group_1=data_set[_type].group_1.str.replace("group",'').str.strip().astype(np.float64)
        for col in bool_col_list:
               if col in data_set[_type]:data_set[_type][col]=data_set[_type][col].astype(np.float64)
        for col in str_col_list[1:]:
               if col in data_set[_type]:data_set[_type][col]=data_set[_type][col].str.replace("type",'').str.strip().astype(np.float64) 

        data_set[_type]= data_set[_type].astype(np.float64)

再次检查数据索引

#检查数据索引

types=['type %d'%i for i in range(1,8)]
for _type in types:
    print((train_datas[_type].dtypes==np.float64).all(),end=',')
    print((test_datas[_type].dtypes==np.float64).all(),end=',')

True,True,True,True,True,True,True,True,True,True,True,True,True,True,

检查源数据

在这里插入图片描述

Data_Cleaner类

根据前面的分析
写出一个Data_Cleaner类
提供.load_data()方法返回清洗好的数据

import numpy as np
import pandas as pd
import  pickle
import  time
import os
def current_time():
    '''
    以固定格式打印当前时间

    :return:返回当前时间的字符串
    '''
    return time.strftime('%Y-%m-%d %X', time.localtime())
class Data_Cleaner:
    '''
    数据清洗器

    它的初始化需要提供三个文件的文件名。它提供了唯一的对外接口：load_data()。它返回清洗好的数据。
    如果数据已存在，则直接返回。否则将执行一系列清洗操作并返回清洗好的数据。
    '''
    def __init__(self,people_file_name,act_train_file_name,act_test_file_name):
        '''

        :param people_file_name: people.csv文件的 file_path
        :param act_train_file_name: act_train.csv文件的 file_path
        :param act_test_file_name:act_test.csv文件的 file_path
        :return:
        '''
        self.p_fname=people_file_name
        self.train_fname=act_train_file_name
        self.test_fname=act_test_file_name
        self.types=['type %d'%i for i in range(1,8)]
        self.fname='output/cleaned_data'
    def load_data(self):
        '''
        加载清洗好的数据

         如果数据已经存在，则直接返回。如果不存在，则加载 csv文件，然后合并数据、拆分成 type1 ~type7，然后执行数据类型转换，
        最后重新排列每个列的顺序。然后保存数据并返回数据。

        :return:一个元组：依次为：self.train_datas,self.test_datas
        '''
        if(self._is_ready()):
            print("cleaned data is availiable!\n")
            self._load_data()
        else:
            self._load_csv()
            self._merge_data()
            self._split_data()
            self._typecast_data()
            self._save_data()
        return self.train_datas,self.test_datas

    def _load_csv(self):
        '''
        加载 csv 文件

        :return:
        '''
        print("----- Begin run load_csv at %s -------"%current_time())
        self.people=pd.read_csv(self.p_fname,sep=',',header=0,keep_default_na=True,parse_dates=['date'])
        self.act_train=pd.read_csv(self.train_fname,sep=',',header=0,keep_default_na=True,parse_dates=['date'])
        self.act_test=pd.read_csv(self.test_fname,sep=',',header=0,keep_default_na=True,parse_dates=['date'])

        self.people.set_index(keys=['people_id'],drop=True,append=False,inplace=True)
        self.act_train.set_index(keys=['people_id'],drop=True,append=False,inplace=True)
        self.act_test.set_index(keys=['people_id'],drop=True,append=False,inplace=True)

        print("----- End run load_csv at %s -------"%current_time())
    def _merge_data(self):
        '''
        合并 people 数据和 activity 数据

        :return:
        '''
        print("----- Begin run merge_data at %s -------"%current_time())
        self.train_data=self.act_train.merge(self.people,how='left',left_index=True,right_index=True,suffixes=('_act', '_people'))
        self.test_data=self.act_test.merge(self.people,how='left',left_index=True,right_index=True,suffixes=('_act', '_people'))
        print("----- End run merge_data at %s -------"%current_time())
    def _split_data(self):
        '''
        拆分数据为 type 1~ 7

        :return:
        '''
        print("----- Begin run split_data at %s -------"%current_time())
        self.train_datas={}
        self.test_datas={}
        for _type in self.types:
            ## 拆分
            self.train_datas[_type]=self.train_data[self.train_data.activity_category==_type].dropna(axis=(0,1), how='all')
            self.test_datas[_type]=self.test_data[self.test_data.activity_category==_type].dropna(axis=(0,1), how='all')
            # 删除列 activity_category
            self.train_datas[_type].drop(['activity_category'],axis=1,inplace=True)
            self.test_datas[_type].drop(['activity_category'],axis=1,inplace=True)
            # 将列 activity_id 作为索引
            self.train_datas[_type].set_index(keys=['activity_id'], drop=True, append=True, inplace=True)
            self.test_datas[_type].set_index(keys=['activity_id'], drop=True, append=True, inplace=True)
        print("----- End run split_data at %s -------"%current_time())

    def _typecast_data(self):
        '''
        执行数据类型转换，将所有数据转换成浮点数

        :return:
        '''
        print("----- Begin run typecast_data at %s -------"%current_time())
        str_col_list=['group_1']+['char_%d_act'%i for i in range(1,11)]+['char_%d_people'%i for i in range(1,10)]
        bool_col_list=['char_10_people']+['char_%d'%i for i in range(11,38)]

        for _type in self.types:
            for data_set in [self.train_datas,self.test_datas]:
                # 处理日期列
                data_set[_type].date_act= (data_set[_type].date_act- np.datetime64('1970-01-01'))/ np.timedelta64(1, 'D')
                data_set[_type].date_people= (data_set[_type].date_people- np.datetime64('1970-01-01'))/ np.timedelta64(1,'D')
                # 处理 group 列
                data_set[_type].group_1=data_set[_type].group_1.str.replace("group",'').str.strip().astype(np.float64)
                # 处理布尔值列
                for col in bool_col_list:
                    if col in data_set[_type]:data_set[_type][col]=data_set[_type][col].astype(np.float64)
                # 处理其他字符串列
                for col in str_col_list[1:]:
                    if col in data_set[_type]:data_set[_type][col]=data_set[_type][col].str.replace("type",'').str.strip().astype(np.float64)

            data_set[_type]= data_set[_type].astype(np.float64)
        print("----- End run typecast_data at %s -------"%current_time())
    def _is_ready(self):
        if(os.path.exists(self.fname)):
            return True
        else :
            return False
    def _save_data(self):
        print("----- Begin run save_data at %s -------"%current_time())
        with open(self.fname,"wb") as file:
            pickle.dump([self.train_datas,self.test_datas],file=file)
        print("----- End run save_data at %s -------"%current_time())
    def _load_data(self):
        print("----- Begin run _load_data at %s -------"%current_time())
        with open(self.fname,"rb") as file:
            self.train_datas,self.test_datas=pickle.load(file)
        print("----- End run _load_data at %s -------"%current_time())

if __name__=='__main__':
    clearner=Data_Cleaner("./Data/people.csv",'./Data/act_train.csv','./Data/act_test.csv')
    result=clearner.load_data()
    for key,item in result[0].items():
        for col in item.columns:
            unique_value=item[col].unique()

            if(len(unique_value)<=100):
                print(col,':len=',len(unique_value),'\t;data=',unique_value)
            else:print(col,':len=',len(unique_value))

        print("\n=======\n")

独热码编码

# 独热码编码
# 观察各列的取值集合


lambda_len=lambda x:len(x.unique())
lambda_data=lambda x:str(x.unique()) if(len(x.unique())<=3) else str(x.unique()[:3])+'...'
train_results={}
test_results={}
types=['type %d'%i for i in range(1,8)]
for _type in types:
    train_results[_type[-1]]=pd.DataFrame({'len':train_datas[_type].apply(lambda_len),
                        'data':train_datas[_type].apply(lambda_data)},
                        index=train_datas[_type].columns) 
    test_results[_type[-1]]=pd.DataFrame({'len':test_datas[_type].apply(lambda_len),
                        'data':train_datas[_type].apply(lambda_data)},
                        index=test_datas[_type].columns) 

train_12=train_results['1'].merge(train_results['2'],how='outer',left_index=True,right_index=True,suffixes=('_ta_1', '_ta_2')) 
train_34=train_results['3'].merge(train_results['4'],how='outer',left_index=True,right_index=True,suffixes=('_ta_3', '_ta_4')) 
train_56=train_results['5'].merge(train_results['6'],how='outer',left_index=True,right_index=True,suffixes=('_ta_5', '_ta_6')) 
train_test_77=train_results['7'].merge(test_results['7'],how='outer',left_index=True,right_index=True,suffixes=('_ta_7', '_tt_7')) 
test_12=test_results['1'].merge(test_results['2'],how='outer',left_index=True,right_index=True,suffixes=('_tt_1', '_tt_2')) 
test_34=test_results['3'].merge(test_results['4'],how='outer',left_index=True,right_index=True,suffixes=('_tt_3', '_tt_4')) 
test_56=test_results['5'].merge(test_results['6'],how='outer',left_index=True,right_index=True,suffixes=('_tt_5', '_tt_6')) 

train_12.merge(train_34,how='outer',left_index=True,right_index=True)\
    .merge(train_56,how='outer',left_index=True,right_index=True)  \
    .merge(train_test_77,how='outer',left_index=True,right_index=True)\
    .merge(test_12,how='outer',left_index=True,right_index=True) \
    .merge(test_34,how='outer',left_index=True,right_index=True) \
    .merge(test_56,how='outer',left_index=True,right_index=True)

ta:train
tt:test
后缀1:type1
在这里插入图片描述

修改列的顺序

# 修改列的顺序


from scipy.sparse import hstack,csr_matrix
from sklearn.preprocessing  import OneHotEncoder
def onehot_encode(train_datas,test_datas): 

    train_results={}
    test_results={}
    types=['type %d'%i for i in range(1,8)]
    for _type in types:
        if _type=='type 1':
            one_hot_cols=['char_%d_act'%i for i in range(1,10)]+\
            ['char_%d_people'%i for i in range(1,10)]
            train_end_cols=['group_1','date_act','date_people','char_38','outcome']
            test_end_cols=['group_1','date_act','date_people','char_38']
        else:
            one_hot_cols=['char_%d_people'%i for i in range(1,10)]
            train_end_cols=['group_1','char_10_act','date_act','date_people','char_38','outcome']
            test_end_cols=['group_1','char_10_act','date_act','date_people','char_38']
        
        train_front_array=train_datas[_type][one_hot_cols].values #头部数组
        train_end_array=train_datas[_type][train_end_cols].values#末尾数组
        train_middle_array=train_datas[_type].drop(train_end_cols+one_hot_cols,axis=1,inplace=False).values#中间数组
        
        test_front_array=test_datas[_type][one_hot_cols].values #头部数组
        test_end_array=test_datas[_type][test_end_cols].values#末尾数组
        test_middle_array=test_datas[_type].drop(test_end_cols+one_hot_cols,axis=1,inplace=False).values#中间数组

        encoder=OneHotEncoder(categorical_features='all',sparse=True) # 一个稀疏矩阵，类型为 csr_matrix
        train_result=hstack([encoder.fit_transform(train_front_array),csr_matrix(train_middle_array),csr_matrix(train_end_array)])
        test_result=hstack([encoder.transform(test_front_array),csr_matrix(test_middle_array),csr_matrix(test_end_array)])
        train_results[_type]=train_result
        test_results[_type]=test_result
    return train_results,test_results

检查特征数量

# 检查特征数量

types=['type %d'%i for i in range(1,8)]

print('before encode:\n')
for _type in types:
    print('train(type=%s):shape='%_type,train_datas[_type].shape)
    print('test(type=%s):shape='%_type,test_datas[_type].shape)
print('==============\n\n')    
train_results,test_results=onehot_encode(train_datas,test_datas)
print('after encode:\n')
for _type in types:
    print('train(type=%s):shape='%_type,train_results[_type].shape)
    print('test(type=%s):shape='%_type,test_results[_type].shape)
print('==============\n\n')

在这里插入图片描述

归一化处理

# 归一化处理

from sklearn.preprocessing  import MaxAbsScaler
def scale(train_datas,test_datas): 
    train_results={}
    test_results={}
    types=['type %d'%i for i in range(1,8)]
    
    for _type in types:
        if _type=='type 1':
            train_last_index=5#最后5列为 group_1/date_act/date_people/char_38/outcome
            test_last_index=4#最后4列为 group_1/date_act/date_people/char_38 
        else:
            train_last_index=6#最后6列为 group_1/char_10_act/date_act/date_people/char_38/outcome
            test_last_index=5#最后5列为 group_1/char_10_act/date_act/date_people/char_38 
        
        scaler=MaxAbsScaler()
        train_array=train_datas[_type].toarray()        
        train_front=train_array[:,:-train_last_index]
        train_mid=scaler.fit_transform(train_array[:,-train_last_index:-1])#outcome 不需要归一化
        train_end=train_array[:,-1].reshape((-1,1)) #outcome
        train_results[_type]=np.hstack((train_front,train_mid,train_end))
        
        test_array=test_datas[_type].toarray()
        test_front=test_array[:,:-test_last_index]
        test_end=scaler.transform(test_array[:,-test_last_index:])
        test_results[_type]=np.hstack((test_front,test_end))

    return train_results,test_results

检查归一化之后的结果

# 检查归一化之后的结果

ta_results,tt_results=scale(train_results,test_results)
types=['type %d'%i for i in range(1,8)]
for _type in types:
    print("Train(type=%s):"%_type,np.unique(ta_results[_type].max(axis=1)),np.unique(ta_results[_type].min(axis=1)))
    print("Test(type=%s):"%_type,np.unique(tt_results[_type].max(axis=1)),np.unique(tt_results[_type].min(axis=1)))

在这里插入图片描述

oifengo

发布了154 篇原创文章 · 获赞 605 · 访问量 23万+

私信关注

ReaHat用户挖掘有价值用户

文章目录

数据清洗

读取数据

设置图表格式

合并数据

拆分数据

查看类型和数量

拆分数据

观察拆分后的数据

删除activity_category这一列

观察删除activity_category列之后的数据

去除唯一值

观察数据

验证不包含重复索引

数据类型转换

查看每一列的数据类型

数据清洗

再次检查数据索引

检查源数据

Data_Cleaner类

独热码编码

修改列的顺序

检查特征数量

归一化处理

检查归一化之后的结果

猜你喜欢