[100 days proficient in Python] Day56: Python data analysis_Pandas data cleaning and processing

Table of contents

Data cleaning and processing

1. Handling missing values

1.1 Remove missing values:

1.2 Fill missing values:

1.3 Interpolation:

2 data type conversion

2.1 Data type conversion

2.2 Conversion of date and time:

2.3 Transformation of categorical data:

2.4 Conversion of custom data types:

3 Data deduplication

4 Data Merging and Joining


Data cleaning and processing

        In terms of data cleaning and processing, Pandas provides a variety of functions, including handling missing values, data type conversion, data deduplication, and data merging and joining. The following are detailed descriptions and examples of these functions:

1. Handling missing values

There are several ways to handle missing values ​​in Pandas, including removing missing values, filling missing values, and interpolation.

1.1 Remove missing values:

        Removing missing values ​​is the easiest way, but sometimes leads to data loss. You can use dropna()the method to drop rows or columns that contain missing values.

(1) Delete rows containing missing values:

import pandas as pd

data = {'A': [1, 2, None, 4],
        'B': [5, None, 7, 8]}
df = pd.DataFrame(data)

# 删除包含缺失值的行
df_cleaned = df.dropna()
print("删除包含缺失值的行的结果:\n", df_cleaned)

(2) Drop columns containing missing values:

# 删除包含缺失值的列
df_cleaned_columns = df.dropna(axis=1)
print("删除包含缺失值的列的结果:\n", df_cleaned_columns)

1.2 Fill missing values:

        Filling missing values ​​is the method of replacing missing values ​​with specific values. You can use fillna()the method to fill in missing values.

Fill missing values ​​with specific values:

# 使用特定值填充缺失值
df_filled = df.fillna(0)  # 用 0 填充缺失值
print("使用特定值填充缺失值的结果:\n", df_filled)

1.3 Interpolation:

        Interpolation is a data-based approach to estimating missing values ​​based on known data point values. Pandas provides a variety of interpolation methods, such as linear interpolation, polynomial interpolation, etc.

(1) Linear interpolation:

Linear interpolation uses known linear relationships between data points to estimate missing values. This is a simple and common interpolation method.

import pandas as pd

data = {'A': [1, 2, None, 4],
        'B': [5, None, 7, 8]}
df = pd.DataFrame(data)

# 使用线性插值填充缺失值
df_interpolated = df.interpolate()
print("使用线性插值填充缺失值的结果:\n", df_interpolated)

 (2) Polynomial interpolation:

Polynomial interpolation uses polynomial functions to approximate known data points to estimate missing values. You can specify the order of the polynomial.

# 使用多项式插值填充缺失值(阶数为2)
df_poly_interpolated = df.interpolate(method='polynomial', order=2)
print("使用多项式插值填充缺失值的结果:\n", df_poly_interpolated)

(3) Time series interpolation:

For time series data, time-dependent interpolation methods such as time-linear interpolation can be used.

# 创建一个带有时间索引的示例 DataFrame
data = {'A': [1, 2, None, 4],
        'B': [5, None, 7, 8]}
dates = pd.date_range(start='2021-01-01', periods=len(data))
df_time_series = pd.DataFrame(data, index=dates)

# 使用时间线性插值填充缺失值
df_time_series_interpolated = df_time_series.interpolate(method='time')
print("使用时间线性插值填充缺失值的结果:\n", df_time_series_interpolated)

2 data type conversion

        In Pandas, datatype conversion is the process of changing the datatype of one or more columns to another datatype. Conversion of data types can help you adapt to specific analysis needs or ensure data consistency. The following are some common data type conversion operations with examples:

2.1 Data type conversion

  • Use astype()the method to convert the data type of a column to another data type , such as converting an integer column to a floating point column.
  • Use to pd.to_numeric()convert a column to a numeric type, such as integer or float.
import pandas as pd

# 创建示例 DataFrame
data = {'A': [1, 2, 3],
        'B': ['4', '5', '6']}
df = pd.DataFrame(data)

# 将列 'A' 从整数转换为浮点数
df['A'] = df['A'].astype(float)

# 将列 'B' 从字符串转换为整数
df['B'] = pd.to_numeric(df['B'])

print(df)

Datatype conversion in DataFrame:

df.astype(dtype, copy=True, errors='raise')
  • dtype: The target data type to convert the data type to. Can be a NumPy data type (such as np.float32) or a Python data type (such as floator int).
  • copy(optional, default True): Specifies whether to return a copy (True) or modify the original DataFrame (False).
  • errors(optional, defaults to 'raise'): Specifies how to handle conversion errors. If 'raise', an exception is raised; if 'coerce', unconvertible values ​​are set to NaN.

 Data type conversion in Series:

s.astype(dtype, copy=True, errors='raise')
import pandas as pd

# 创建一个示例 DataFrame
data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}
df = pd.DataFrame(data)

# 将列 'A' 从整数转换为浮点数
df['A'] = df['A'].astype(float)

# 将列 'B' 从整数转换为字符串
df['B'] = df['B'].astype(str)

# 将列 'C' 从字符串转换为整数并处理转换错误(设置无法转换的值为 NaN)
df['C'] = pd.to_numeric(df['C'], errors='coerce').astype(int)

print(df.dtypes)

In the above example, we demonstrated how to use astype()and pd.to_numeric()to convert data types, including converting integers to floating-point numbers, integers to strings, and strings to integers and handling conversion errors. 

2.2 Conversion of date and time:

  • Use pd.to_datetime()to convert the column to a datetime type for datetime manipulation.
import pandas as pd

# 创建示例 DataFrame
data = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03'],
        'Value': [10, 15, 20]}
df = pd.DataFrame(data)

# 将 'Date' 列从字符串转换为日期时间类型
df['Date'] = pd.to_datetime(df['Date'])

print(df.dtypes)

2.3 Transformation of categorical data:

  • Use astype('category')to convert a column to a categorical data type, suitable for limited discrete values.
import pandas as pd

# 创建示例 DataFrame
data = {'Category': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)

# 将 'Category' 列转换为分类数据类型
df['Category'] = df['Category'].astype('category')

print(df.dtypes)

2.4 Conversion of custom data types:

  • You can use a custom function to convert the data to the desired data type, for example using apply()the method.
import pandas as pd

# 创建示例 DataFrame
data = {'Numbers': ['1', '2', '3', '4']}
df = pd.DataFrame(data)

# 自定义函数将字符串转换为整数并应用到 'Numbers' 列
df['Numbers'] = df['Numbers'].apply(lambda x: int(x))

print(df.dtypes)

3 Data deduplication

In Pandas, you can use drop_duplicates()the method to remove duplicate rows. This method returns a new DataFrame with no duplicate rows. Here is an example of how to perform data deduplication in Pandas:

import pandas as pd

# 创建示例 DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30]}
df = pd.DataFrame(data)

# 执行去重操作,基于所有列
df_no_duplicates = df.drop_duplicates()

print("原始 DataFrame:")
print(df)

print("\n去重后的 DataFrame:")
print(df_no_duplicates)

In the above example, drop_duplicates()the method will deduplicate based on the content of all columns. If you want to deduplicate based on a specific column, you can subsetspecify it by passing the parameter:

# 基于 'Name' 列进行去重
df_no_duplicates_name = df.drop_duplicates(subset=['Name'])

print("基于 'Name' 列去重后的 DataFrame:")
print(df_no_duplicates_name)

You can also use keepthe parameter to control which duplicate value is kept. For example, keep='first'(the default) will keep the first occurrence of the value, while keep='last'will keep the last occurrence:

# 基于 'Name' 列进行去重,保留最后一个出现的值
df_keep_last = df.drop_duplicates(subset=['Name'], keep='last')

print("基于 'Name' 列去重,保留最后一个出现的值的 DataFrame:")
print(df_keep_last)

These examples demonstrate how to deduplicate data using Pandas. Depending on your needs, you can choose different deduplication methods.

4 Data Merging and Joining

        In Pandas, you can use different methods for data merging and joining, which is often used to combine multiple datasets for analysis. Here are some common data merge and join operations with examples:

4.1 pd.concat()

  Used to stack multiple DataFrames together along a specified axis (usually the row or column axis). pd.concat()Stack multiple DataFrames on the row axis (axis=0) by default, i.e. join them together along the row direction. If you want to stack multiple DataFrames on the column axis (axis=1), you can axisdo so by specifying the parameter as 1.

import pandas as pd

# 创建两个示例 DataFrame
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5']})

# 在行轴上堆叠两个 DataFrame
result1 = pd.concat([df1, df2])

# 在列轴上堆叠两个 DataFrame
result2 = pd.concat([df1, df2], axis=1)

print(result1,result2)

output:

4.2 pd.merge()

Used to join two DataFrames together based on one or more keys (columns), similar to SQL's JOIN operation.

import pandas as pd

# 创建两个示例 DataFrame
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                     'value_left': ['V0', 'V1', 'V2']})

right = pd.DataFrame({'key': ['K1', 'K2', 'K3'],
                      'value_right': ['V3', 'V4', 'V5']})

# 基于 'key' 列进行合并
result = pd.merge(left, right, on='key')

print(result)

output

 

4.3 df.join()

        Used to merge two DataFrames along the index.

import pandas as pd

# 创建两个示例 DataFrame
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']}, index=['I0', 'I1', 'I2'])

df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
                    'D': ['D0', 'D1', 'D2']}, index=['I1', 'I2', 'I3'])

# 沿索引合并两个 DataFrame
result = df1.join(df2)

print(result)

output:

These are some common examples of data merge and join operations. Depending on your needs, you can choose an appropriate method to combine and join datasets. Pandas provides a wealth of options and parameters to meet different merging and joining needs.

Guess you like

Origin blog.csdn.net/qq_35831906/article/details/132708122