[Python Data Processing - DataFrame Data Cleaning] Duplicate value processing, missing value processing, specific value replacement, deletion of specified condition rows

Welcome to my [Zhihu Account] where I do things:Coffee
and my [Station B Marvel Editing Account]:Coffee a> If my notes are helpful to you, please use your little finger to give me a big like. VideosMan

I have also summarized the relevant knowledge of DataFrame. Data cleaning is an important knowledge point in DataFrame. Welcome to like and collect it! !

[Python Study Notes—Nanny Edition] Chapter 4—About Pandas, data preparation, data processing, data analysis, and data visualization


4.3.1 Data cleaning

1. Processing of duplicate values: drop_duplicates()

drop_duplicates()    
把数据结构中行相同的数据去除(保留其中的一行)

【Example 4-6】Data deduplication.
Here df is the original data, of which rows 7, 9, 8, and 10 are repeated rows

 from pandas import DataFrame
 from pandas import read_excel
df = read_excel('e://rz2.xlsx')
df

Out[1]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 18822256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 NaN 1.225790e+17 222.31.51.200
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
9 S1402048 13322252452 1.225790e+17 221.205.98.55
10 S1405011 18922257681 1.225790e+17 183.184.230.38

newDF=df.drop_duplicates()
newDF

Out[2]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 1882 2256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 NaN 1.225790e+17 222.31.51.200
6 S1404095 1382 2254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38

df above The data in rows 7 and 9 are the same, and the data in rows 8 and 10 are the same. After deduplication, one row of data is retained in rows 7, 9, and 8 and 10.

2. Missing value processing:

dropna()、df.fillna() 、df.fillna(method=‘pad’)、df.fillna(method=‘bfill’)、df.fillna(df.mean())、df.fillna(df.mean()[math: physical]) 、strip()

Methods for handling missing data include data completion, deletion of corresponding rows, and no processing.

[Example 4-6] Missing processing.
Here df is the original data, in which rows 2, 3, and 5 have missing values

from pandas import DataFrame   
from pandas import read_excel   
df = read_excel('e://rz2.xlsx')    
df  

Out[1]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 18822256753 NaN 222.31.51.200

4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 NaN 1.225790e+17 222.31.51.200
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
9 S1402048 13322252452 1.225790e+17 221.205.98.55
10 S1405011 18922257681 1.225790e+17 183.184.230.38

1. dropna() removes data rows with empty values ​​in the data structure

[Example 4-7] Delete the row corresponding to empty data

from pandas import DataFrame
from pandas import read_excel
df = read_excel('e://rz2.xlsx')
newDF=df.dropna()
newDF

Out[3]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
4 S1405010 18922253721 1.225790e+17 120.207.64.3
6 S140 4095 13822254373 1.225790e+17 222.31.59.220< /span>deletedhas beenNaNrows have null values2, 3, 5 In the example 10 S1405011 18922257681 1.225790e+17 183.184.230.38 9 S1402048 13322252452 1.225790 e+17 221.205.98.55 8 S1405011 18922257681 1.225790e+17 183.184.230.38
7 S1402048 13322252452 1.225790e+17 221.205.98.55



2. df.fillna() replaces NaN with other values. Sometimes directly deleting empty data will affect the results of analysis, so the data can be filled. [Example 4-8] Use numerical values ​​or any characters to replace missing values

[Example 4-8] Use numerical values ​​or any characters to replace missing values

from pandas import DataFrame
from pandas import read_excel
df = read_excel('e://rz2.xlsx')
df.fillna('?')

Out[4]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 1.89223e+10 1.22579e+17 221.205.98.55 2014-11- 04 08:44:46
1 S1411023 1.35223e+10 1.22579e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 1.34223 e+10 ? 221.205.98.55 2014-11-04 08:46:39
3 20031509 1.88223e+10 ? 222.31.51.200 2014-11-04 08:47:41

4 S1405010 1.89223e+10 1.22579e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 ? 1.22579e+17 222.31.51.200 201 4-11 -04 08:50:06
6 S1404095 1.38223e+10 1.22579e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 1.33223e+10 1.22579e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 1.89223e+10 1.22579e+17 183.184.230.38 2014-1 1-04 08: 14:55
9 S1402048 1.33223e+10 1.22579e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 1.89223e+10 1.22579e+17 183.184.230.38 2014-11-04 08:14:55
If lines 2, 3, and 5 are empty, use ? Missing values ​​are replaced.

3. df.fillna(method=‘pad’) replaces NaN with the previous data value

[Example 4-9] Replace missing values ​​with the previous data value
(rows 2, 3, and 5 are missing values)

from pandas import DataFrame
from pandas import read_excel
df = read_excel('e://rz2.xlsx')
df.fillna(method='pad')

Out[5]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812 1.225790e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003 1.225790e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938 1.225790e+17 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753 1.225790e+17 222.31.51.200 2014-11-04 08:47:41

4 S1405010 18922253721 1.225790e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 18922253721 1.225790e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55

4. df.fillna(method=‘bfill’) replaces NaN with the latter data value

[Example 4-10] Replace NaN with the latter data value
(rows 2, 3, and 5 are missing values)

from pandas import DataFrame
from pandas import read_excel
df = read_excel('e://rz2.xlsx')
df.fillna(method='bfill')

Out[6]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812 1.225790e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003 1.225790e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938 1.225790e+17 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753 1.225790e+17 222.31.51.200 2014-11-04 08:47:41

4 S1405010 18922253721 1.225790e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 13822254373 1.225790e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55

5. df.fillna(df.mean()) replaces NaN with average or other descriptive statistics.

[Example 4-11] Use the mean to fill the data.

from pandas import DataFrame
from pandas import read_excel
df = read_excel('e://rz2_0.xlsx')
df

df.fillna(df.mean())

Out[7]:
No math physical Chinese
0 1 76 85 78
1 2 85 56 NaN
2 3 76 95 85
3 4 NaN 75 58
4 5 87 52 68

Out[8]:
No math physical Chinese
0 1 76 85 78.00
1 2 85 56 72.25
2 3 76 95 85.00
3 4 81 75 58.00
4 5 87 52 68.00

6. df.fillna(df.mean()[math: physical]) can select columns to handle missing values

[Example 4-12] isa certain columnuses the mean value of the column to fill the data

from pandas import DataFrame
from pandas import read_excel
df = read_excel('e://rz2_0.xlsx')
df.fillna(df.mean()['math':'physical'])

Out[26]:
No math physical Chinese
0 1 76.0 85 78.0
1 2 85.0 56 NaN
2 3 76.0 95 85.0
3 4 NaN 75 58.0
4 5 87.0 52 68.0

Out[9]:
No math physical Chinese
0 1 76 85 78
1 2 85 56 NaN
2 3 76 95 85
3 4 81 75 58
4 5 87 52 68

7. strip(): Clears the specified characters on the left and right (first and last) of character data. The default is spaces, and the middle ones are not cleared.

[Example 4-13] Delete the characters specified at the left and right or at the beginning of the string.

from pandas import DataFrame
from pandas import read_excel
df = read_excel('e://rz2.xlsx')
newDF=df['IP'].str.strip()    #因为IP是一个对象,所以先转为str。
newDF

Out[27]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003.0 1.2257903137349373e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938.0 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753.0 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721.0 1.2257903137349373e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 1.2257903137349373e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373.0 1.2257903137349373e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681.0 1.2257903137349373e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681.0 1.2257903137349373e+17 183.184.230.38 2014-11-04 08:14:55

Out[10]:
0 221.205.98.55
1 183.184.226.205
2 221.205.98.55
3 222.31.51.200
4 120.207.64.3
5 222.31.51.200
6 222.31.59.220
7 221.205.98.55
8 183.184.230.38
9 221.205.98.55
10 183.184.230.38
Name: IP, dtype: object

3. Specific value replacement: replace(‘absent’, 0)

df11 = df11.replace(np.nan,'[正常]')
df11 = df11.replace('none',np.nan)
df11 = df11.replace(' ― ',np.nan)

4. Delete the row where the element that meets the condition is located: drop()

df = df.drop(df[].index)

#Delete mobile phones with price greater than 1000

df_s_acc = df_s.drop(df_s[df_s['价格']>=1000].index) 

Out[68]:
Unnamed: 0 ID value price... Tag change specification date
5566 6626 1354676 699... Secure mobile phone 2020/11/ 1 2020-11-01
5565 6625 1354673 799 … Safe Mobile Phone 2020/11/1 2020-11-01
101 102 1346463 2699 … Safe Mobile Phone 2020/ 11/11 2020-11-11
64 64 1338710 3199 … Secure mobile phone 2020/11/11 2020-11-11
2884 3382 1352445 1499 … Secure mobile phone 2020/11/26 2020-11-26
2892 3391 1349515 1099 … Security mobile phone2020/11/26 2020-11-26
2910 3411 1349516 1299 … Safe mobile phone 2020/11/26 2020-11-26

2844 3340 1348871 999 … Safe mobile phone 2020/11/26 2020-11-26
4046 4845 1350884 799 … Safe Mobile Phone 2020/12/1 2020-12-01
4036 4834 1350882 699 … Safe Mobile Phone 2020/12/1 2020-12-01
3394 4023 1349976 799 … Safe Mobile Phone 2020/12/12 2020-12-12
4740 5656 1353088 2399 … Safe Mobile Phone 2020/12/22 2020-12-22
4737 5653 1353068 1999 … Safe Mobile Phone 2020/12/22 2020-12-22
4048 4847 1357947 1099 … Safe Mobile Phone 2021/1/1 2021-01-01

4038 4836 1357933 999 … Safe Mobile Phone 2021/1/1 2021-01-01
4043 4842 1357949 1199 … Safe Mobile Phone 2021/1/1 2021-01-01

Out[72]:
Unnamed: 0 ID value price... Tag change specification date
5566 6626 1354676 699... Secure mobile phone 2020/11/ 1 2020-11-01
5565 6625 1354673 799 … Safe Mobile Phone 2020/11/1 2020-11-01
2844 3340 1348871 999 … Safe Mobile Phone 2020/ 11/26 2020-11-26
4046 4845 1350884 799 … Secure mobile phone 2020/12/1 2020-12-01
4036 4834 1350882 699 … Secure mobile phone 2020/12/1 2020-12-01
3394 4023 1349976 799 … Secure mobile phone 2020/12/12 2020-12-12
4038 4836 1357933 999 … Safe mobile phone 2021/1/1 2021-01-01

You can also use multiple conditions

df_clear = df.drop(df[df['x']<0.01].index)
# 也可以使用多个条件
df_clear = df.drop(df[(df['x']<0.01) | (df['x']>10)].index) #删除x小于0.01或大于10的行

Guess you like

Origin blog.csdn.net/Yedge/article/details/127481705