Pandas data processing and cleaning - missing data duplicate data type conversion grouping aggregation

Table of contents

foreword

Handling of missing data

Duplicate data processing

data type conversion

Changes to column names and indexes

Grouping and aggregation operations

Summarize


foreword

This article introduces common operations for data processing and cleaning in Pandas. It mainly includes processing of missing data, processing of duplicate data, data type conversion, changes of column names and indexes, and grouping and aggregation operations. For each operation, a corresponding code example is given. These operations are very important for data analysis and modeling, and can help us better understand and process data.


Handling of missing data

In actual data processing, missing data is often encountered. At this time, processing such as data filling or deletion is required. Pandas provides fillna() and dropna() functions to handle missing data.

$import pandas as pd
import numpy as np

# 创建含有缺失数据的DataFrame
df = pd.DataFrame({'A': [1, 2, np.nan, 4],
                   'B': [5, np.nan, 7, 8],
                   'C': [9, 10, 11, np.nan]})

# 使用fillna()函数填充缺失数据
df.fillna(0)

# 使用dropna()函数删除缺失数据
df.dropna()$

Duplicate data processing

The existence of duplicate data may affect the analysis results, and duplicate data processing is required. Pandas provides the drop_duplicates() function to remove duplicate data.

import pandas as pd

# 创建含有重复数据的DataFrame
df = pd.DataFrame({'A': [1, 1, 2, 3],
                   'B': [4, 5, 6, 6]})

# 使用drop_duplicates()函数去除重复数据
df.drop_duplicates()

data type conversion

During data processing, data types need to be converted. Pandas provides the astype() function to convert data types.

import pandas as pd

# 创建含有不同数据类型的DataFrame
df = pd.DataFrame({'A': [1, 2, 3],
                   'B': ['4', '5', '6']})

# 使用astype()函数进行数据类型转换
df['B'] = df['B'].astype(int)

Changes to column names and indexes

During data processing, changes need to be made to column names and indexes. Pandas provides the rename() function to change column names and indexes.

import pandas as pd

# 创建含有不同列名和索引的DataFrame
df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [4, 5, 6]},
                  index=['a', 'b', 'c'])

# 使用rename()函数进行列名和索引的更改
df = df.rename(columns={'A': 'new_A'}, index={'a': 'new_a'})

Grouping and aggregation operations

During data processing, data needs to be grouped and aggregated. Pandas provides groupby() and agg() functions for grouping and aggregation operations.

import pandas as pd

# 创建含有不同数据的DataFrame
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'],
                   'B': ['x', 'y', 'x', 'y'],
                   'C': [1, 2, 3, 4]})

# 使用groupby()函数进行分组操作
grouped = df.groupby(['A', 'B'])

# 使用agg()函数进行聚合操作
grouped.agg({'C': 'sum'})


Summarize

This article describes common operations for data processing and cleaning in Pandas. This includes handling of missing data, handling of duplicate data, data type conversion, changes to column names and indexes, and grouping and aggregation operations. For each operation, a corresponding code example is given.

Guess you like

Origin blog.csdn.net/alike_u/article/details/129836392