[pandas] Split multiple data in a cell into multiple rows of data (explode), and use the csv file as the source file for processing

[pandas] Split multiple data in a cell into multiple rows of data (explode)

1. Raw data (test.csv)

insert image description here

2. Demand

Split the cells with multiple data in the two columns of "alias" and "subject" into multiple rows of data, delete rows with null values, and then store them as csv files

3. Code

import pandas as pd
import numpy as np

#导入数据
data = pd.read_csv('test.csv') 
#将单元格中的多个数据拆分为多行数据(用explode方法)
labels = ['别名','科目']
for label in labels:
    df = data[['学号',label]].copy() #记得要加上 copy()方法,不然循环一次后,data 数据里的内容将会改变,第二轮循环会找不到第二个label的索引
    df[label] = df[label].apply(lambda x : x.replace('[','').replace(']','').replace('[]','NaN').split(','))
    df = df.explode(label)
    #df = df.dropna(axis='index',how='any',subset=['学号',label]) 此处不知为何删除不了带有NaN的行我不理解
    #df = df.dropna(axis=0,how='any')
    print(df)
    #df = df[['学号',label]] 这一步可以不要,上一步的 df 中已经含有拆分为多行的信息了
    df.to_csv(label + '.csv', index=False)
    #至此处,已拆分为多行数据 ,但没删除掉带有 空值(NaN)的行,此时print出来的df如下图1示
     
    '''解决删除拆分后(出现缺失的空格),进行文件覆盖'''
	data1 = pd.read_csv(label + '.csv')
	data1 = data1.dropna(axis=0,how='any')
	data1.to_csv(label + '.csv', index=False)

insert image description herefigure 1

After resolving whitespace:
insert image description here

4. Summary

Finally, let’s sum up, there is no other way, it seems that only when the csv file is read out again, the df.dropna method is only valid at this time, that’s all df = df.dropna(axis=0,how='any'); The file, and then read it out, and then delete NaN, harm, who cares, the method that can solve the problem is a good method...

5. Some useful methods for processing csv format file data:

'''一些实用的方法'''
#读取csv文件中指定范围的列数
#list_a = np.arange(4)
#data = pd.read_csv('test.csv',usecols=list_a) 

#data = data.dropna(axis=1,how='all') #删除全为NaN的列 or 若是行: axis=0
#data = data.dropna(axis=1,how='any') #删除含有NaN的列 or 若是行: axis=0
#data = data.drop(['年龄','学号'],axis=1) #删除指定列
#data = data.fillna(0) #将表格中的nan替换为0
#data = data .dropna(axis='index',how='any',subset=['年龄','学号']) #删除 年龄、学号这两列中任何带有 NaN 的行
#data = data .dropna(axis='index',how='any',subset=['年龄','学号']) #删除 年龄、学号这两列中均带有 NaN 的行

Of course, it is far more than that. Here is a link to refer to other people's summary methods:
Summary of pandas usage

By the way, this is the reference link for this article:
Splitting multiple rows of data

Guess you like

Origin blog.csdn.net/qq_45067943/article/details/123272628