Python uses pandas library for data cleaning

For cleaning, there is no universal template that is absolutely universal, and you will encounter various problems. You can only write many different versions and choose the appropriate program to call according to different situations, such as primary cleaning and secondary cleaning.

1. Index deletion and basic statistical operations

import pandas as pd
import numpy as np

Create a two-dimensional array with four rows and four columns of numbers from 0 to 15, and convert it into a dataframe, and the column index becomes ABCD

df = pd.DataFrame(np.arange(0, 16).reshape(4, 4), columns=['A', 'B', 'C', 'D'])
print(df)

     A   B   C    D
0   0   1    2    3
1   4   5    6    7
2   8   9   10   11
3  12  13  14  15

Delete a row or column for the specified index

# 删除一行
print(df.drop(0, axis=0))
# 删除一列
print(df.drop('A', axis=1))

 A   B   C   D
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
    B   C   D
0   1   2   3
1   5   6   7
2   9  10  11
3  13  14  15

# 求每一行的平均值
print(df.mean(axis=1))
# 每一列的平均值
print(df.mean(axis=0))
# 求每一行的和
print(df.sum(axis=1))

0     1.5
1     5.5
2     9.5
3    13.5
dtype: float64
A    6.0
B    7.0
C    8.0
D    9.0
dtype: float64
0     6
1    22
2    38
3    54
dtype: int64

# 新增一行,行索引为4,计算每一列的和
df.loc[4] = df.sum(axis=0)
print(df)
# 新增一列,列名叫sum_values,计算每一行的和
df['sum_values'] = df.sum(axis=1)
print(df)

    A   B   C    D
0   0   1   2     3
1   4   5   6     7
2   8   9  10   11
3  12  13  14  15
4  24  28  32  36
    A   B   C     D       sum_values
0   0   1   2     3              6
1   4   5   6     7              22
2   8   9  10    11            38
3  12  13  14   15           54
4  24  28  32   36          120

2. Pandas reads the data in the table, detects and processes null values

Delete empty values: dropna(axis, how, inplace)
fill empty values: fillna(value, method, axis, inplace)
axis: delete row or column index / 0 row column / 1 column
how: how to delete any: any empty Values ​​will be deleted all: all values ​​will be deleted inplace
: True/False yes/no to modify the current object
method: ffill fill forward bfill fill backward
value: the value used for filling: can be a single value, or dictionary

demonstration:

Use Excel to create a table as follows 

1. Read data and detect null values

df = pd.read_excel('./old.xlsx')  # skiprows=lambda x: x in [0, 0]
# 数据信息
print(df.info())
print("检测空值:\n", df.isnull())
print("检测单列空值:", df['字段名'].isnull)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 # Column Non-Null Count Dtype 
--- ------ ------ -------- ----- 
 0 name 4 non-null object
 1 field name 3 non-null object
dtypes: object(2)
memory usage: 208.0+ bytes
None
detect null value:
       name field name
0 False False
1 False False
2 False True
3 False True
4 True False
Detect single-column null values: <bound method Series.isnull of 0 2.1
1 a_b
2 NaN
3 NaN
4 rows 8 rows

# 删除所有空的行
df.dropna(axis=0, how='all', inplace=True)
# 删除所有空的列
df.dropna(axis=1, how='all', inplace=True)

2. Fill the empty value

# 3.填补空值
df.fillna({'姓名': 'D'}, inplace=True)
df.fillna({'字段名': 0}, inplace=True)
df['字段名'] = df['字段名'].fillna(method='ffill')

Here, the empty value of the name column is filled with D, and the empty value of the field name column is filled with 0

Whether inplace modifies the current operation object, True will be changed when the last table is output, ffill is forward filling

3. Determine whether the data in the specified field is repeated

# 检测姓名这一列是否有重复,是返回True,不是返回False
print(df.duplicated('字段名'))

Name: Jidan name, dtype: object>
0 False
1 False
2 False
3 True
4 False
dtype: bool

4. For deduplication

# 删除有姓名重复的这一行
print(df.drop_duplicates('姓名'))

    Name Field name
0 A 2.1
1 B a_b
2 c 0
4 NaN row 8 rows

3. Special processing of corresponding data

Instructions:

1. First get the str attribute of the series, and then call the function on the attribute

2. It can only be used in string columns, not in numeric columns

3.dataframe has no string attribute

4.series.str is not a string method in python, but an attribute, which is different from str in python

5.expend series>>dataframe

df['字段名'] = df['字段名'].astype('str')
df['字段名'] = df['字段名'].str.replace('.', '', regex=True)
df['姓名'] = df['姓名'].str.upper()
df['字段名'] = df['字段名'].str.split('_', expand=True)[0]  # [][]列索引,行索引
df['字段名'] = df['字段名'].str.replace('行', '')
df['字段名'][1] = ord(df.loc[1]['字段名'])  # loc查询[][]为行索引,列索引
# 类型转换,将str转换为int否则不能计算
df['字段名'] = df['字段名'].astype('int64')

As above code,

The first one uses astype() to convert the data type to str string format

The second is to remove the original decimal point of data 2.1 in the old table. At this time, use the value of a certain column of df[''].str, that is, series.str, and use the replace() of the string function to replace the decimal point with nothing. , specify the regular regex=True, otherwise a warning will appear

FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df['字段名'] = df['字段名'].str.replace('.', '')

Article 3, since c in the old table is lowercase, we uniformly capitalize the data in the name column and use the upper() of the string function

Article 4, if a_b data appears in the old table, if we want to keep a, use split() to split with '_', and take the first element, that is, [0]

Article 5, remove the row of 8 rows of data in the old table, leaving the number type

Article 6, the data in position [1]['field name'] in the query dataframe data is assigned to df in the form of decimal numbers converted by the ord() function, and it should be noted that [][] corresponding to the query loc is the first index , the latter column index, and the opposite when the data is assigned

Article 7, convert all the data in the column of df['field name'] to int type, which are all string types during processing, and convert them to int for the next operation

Finished as follows

  Name Field name
0 A 21
1 B 97
2 C 0
3 B 0
4 D 8

4. Calculate a single data value

print("平均值 meandf:", df['字段名'].mean())
print("最大 max:", df['字段名'].max())
print("最小值 min:", df['字段名'].min())
print("中位数 median:", df['字段名'].median())
print("按值计数 value_counts:", df['字段名'].value_counts(0))

mean meandf: 25.2
max max: 97
min min: 0
median median: 8.0
count by value value_counts: 0 2

21    1
97    1
8     1

5. Data sorting

1. Sort by series

Sort in descending order when the ascending property is False

# 通过字段名升序排序
print(df['字段名'].sort_values(ascending=True))

Name: field name, dtype: int64
2 0
3 0
4 8
0 21
1 97

2. dataframe single column sorting

print(df.sort_values(by='字段名', ascending=True))

Name: field name, dtype: int64
  name field name
2 C 0
3 B 0
4 D 8
0 A 21
1 B 97

3. Dataframe multi-column sorting

print(df.sort_values(by=['姓名', '字段名'], ascending=[True, True]))

 Name Field name
0 A 21
3 B 0
1 B 97
2 C 0
4 D 8

save data

df.to_excel('./new.xlsx', index=False)

view new data

The key point of data cleaning is comprehensive application. The main purpose of using python is to simplify the cleaning process, and choose different tools for different levels of cleaning according to different times

6. Complete code

# _*_ coding:utf-8 _*_
# @Time    : 2022/9/12 19:06
# @Author  : ice_Seattle
# @File    : 数据清洗.py
# @Software: PyCharm

import pandas as pd

# 导入 excel
df = pd.read_excel('./old.xlsx')  # skiprows=lambda x: x in [0, 0]
#  skiprows:跳过多少行
# 数据信息
print(df.info())
print("检测空值:\n", df.isnull())
print("检测单列空值:", df['字段名'].isnull)
# 删除空值
# 删除所有空的行
df.dropna(axis=0, how='all', inplace=True)
# 删除所有空的列
df.dropna(axis=1, how='all', inplace=True)
# 3.填补空值
df.fillna({'姓名': 'D'}, inplace=True)
df.fillna({'字段名': 0}, inplace=True)
df['字段名'] = df['字段名'].fillna(method='ffill')
# 检测姓名这一列是否有重复,是返回True,不是返回False
print(df.duplicated('字段名'))
# 删除有姓名重复的这一行
print(df.drop_duplicates('姓名'))

df['字段名'] = df['字段名'].astype('str')
df['字段名'] = df['字段名'].str.replace('.', '', regex=True)
df['姓名'] = df['姓名'].str.upper()
df['字段名'] = df['字段名'].str.split('_', expand=True)[0]  # [][]列索引,行索引
df['字段名'] = df['字段名'].str.replace('行', '')
df['字段名'][1] = ord(df.loc[1]['字段名'])  # loc查询[][]为行索引,列索引
# 类型转换,将str转换为int否则不能计算
df['字段名'] = df['字段名'].astype('int64')
print(df)
# 计算单个数据值
print("平均值 meandf:", df['字段名'].mean())
print("最大 max:", df['字段名'].max())
print("最小值 min:", df['字段名'].min())
print("中位数 median:", df['字段名'].median())
print("按值计数 value_counts:", df['字段名'].value_counts(0))

# 通过series排序
# 通过字段名升序排序
print(df['字段名'].sort_values(ascending=True))
# 2.dataframe 单列排序
print(df.sort_values(by='字段名', ascending=True))
# dataframe多列排序
print(df.sort_values(by=['姓名', '字段名'], ascending=[True, True]))
# 保存数据
df.to_excel('./new.xlsx', index=False)

For the data that will encounter any problems, please leave a message in the comment area, and we will discuss together

Guess you like

Origin blog.csdn.net/qq_53521409/article/details/126828704
Recommended