Machine Learning Data Preprocessing--Table Merging and Data Visualization

Data cleaning - table merge and add timestamp

extract file name

Read the name of the specified type of file

separate file name

names = os.listdir(path)
for name in names:
    index = name.rfind('.')
    name = name[:index]
    print(name)
    flag = name.split('_')

insert image description here
The original form is shown in the figure above, without header

Add columns to the table and write the specified information into the columns

define header

merge table

总程序：
import os
import pandas as pd


path = os.getcwd()
names = os.listdir(path)
for name in names:
    index = name.rfind('.')
    csv = name[index:]
    if(csv =='.csv'):  #由于文件夹中有其他文件，进行筛选，否则转化为dataframe时报错
        df = pd.read_csv(name,header=None,names=['temp','tempavg','tempmax','tempmin'])# 注意这里增加表头的方式！！
#        df.columns=['temp','tempavg','tempmax','tempmin']  #增加表头，否则下一步添加列时不方便
        name_new = name[:index]
        flag = name_new.split('_')
        print(flag)
        time = flag[2]
        series = flag[1]
        df['time'] = time
        df['series'] = series
        df.to_csv(name,index=False)  #保存更改,注意不需要自动添加索引！

for name in names:
    index = name.rfind('.')
    csv = name[index:]
    if(csv =='.csv'):  #由于文件夹中有其他文件，进行筛选，否则转化为dataframe时报错
        print(csv)
        df = pd.read_csv(name)
        df.to_csv('allok.csv',encoding="utf_8_sig",header=False,index=False,mode='a+')

df = pd.read_csv('allok.csv',header=None,names=['temp','tempavg','tempmax','tempmin','time','series'])# 定义合并好的表格名字
df.to_csv('allok.csv',index=True)

index index problem

Indexes for default additions do not start at 1

df.index = np.arange(1, len(df))

The exported data is too long to be converted into scientific notation

insert image description here
Because the exported data was too long, it became a scientific notation; it caused the rounding when merging the tables later...
so df['time'] = str(time)+'\t'

it was successfully solved by adding it!

Merge table upgrade (feature row and column expansion and merging)

Original table style: (tens of thousands of such tables need to be merged) insert image description here

After merging, it becomes a total table with separate mean and maximum values of each parameter

path = os.getcwd()
names = os.listdir(path)
i = 0
for name in names:
    index = name.rfind('.')
    csv = name[index:]
    if(csv =='.csv'):  #由于文件夹中有其他文件，进行筛选，否则转化为dataframe时报错
        df = pd.read_csv(name)
        for index,row in df.iterrows():
            feature_name = row[0]
            feature_avg = feature_name+'_avg'
            feature_min = feature_name+'_min'
            feature_max = feature_name+'_max'
            df[feature_avg] = str(row[1])+'\t'
            df[feature_max] = str(row[2])+'\t'
            df[feature_min] = str(row[3])+'\t'#防止科学计数
        data =df.iloc[:1,4:] #定位表格
        if(i == 0):
            data.to_csv('gather_operate.csv',encoding="utf_8_sig",header =True,index = False ,mode='a+')
        else:
            data.to_csv('gather_operate.csv',encoding="utf_8_sig",header =False,index = False ,mode='a+')
        i=i+1

Extract the file name to the table and save it to the parent directory

import os,sys
import xlwt

path = os.getcwd()
dirs = os.listdir(path)

write =xlwt.Workbook()
sheet = write.add_sheet('sheet_name')
i = 0

for file in dirs:
    if os.path.splitext(file)[1]=='.csv':
        sheet.write(i,0,file)
        i+=1
print(i)
write.save('../file_name.xls')

Chinese garbled characters appear when pandas writes csv format file

df.to_csv("cnn_predict_result.csv",encoding="utf_8_sig")

Migrate table headers when merging tables in batches

i = 0
for name in names:
    index = name.rfind('.')
    csv = name[index:]
    if(csv =='.csv'):  #由于文件夹中有其他文件，进行筛选，否则转化为dataframe时报错
        if(i==0):
            print("header")
            df = pd.read_csv(name)
            df.to_csv('特征值数据汇总.csv',encoding="utf_8_sig",header=True,index=False,mode='a+')#拼接第一个表格时保留表头
        else:
            print(csv)
            df = pd.read_csv(name)
            df.to_csv('特征值数据汇总.csv',encoding="utf_8_sig",header=False,index=False,mode='a+')
        i=i+1

Pandas output table string is too long to change to scientific notation

The method of directly changing the cell format on the Internet, the file is closed and then opened is still the same.
Later, I read an article
df['time']=[' %i' % i for i in df['time']] select Add /t to the column to be modified. My understanding is to add a character. Use
insert image description here
Excel to sort