Welcome to my [Zhihu Account] where I do things:Coffee
and my [Station B Marvel Editing Account]:Coffee a> If my notes are helpful to you, please use your little finger to give me a big like. VideosMan
I have also summarized the relevant knowledge of DataFrame. Data cleaning is an important knowledge point in DataFrame. Welcome to like and collect it! !
[Python Data Processing - DataFrame Data Cleaning]
-
- 4.3.1 Data cleaning
-
- 1. Processing of duplicate values: drop_duplicates()
- 2. Missing value processing:
-
- 1. dropna() removes data rows with empty values in the data structure
- 2. df.fillna() replaces NaN with other values. Sometimes directly deleting empty data will affect the results of analysis, so the data can be filled. [Example 4-8] Use numerical values or any characters to replace missing values
- 3. df.fillna(method='pad') replaces NaN with the previous data value
- 4. df.fillna(method='bfill') replaces NaN with the latter data value
- 5. df.fillna(df.mean()) replaces NaN with average or other descriptive statistics.
- 6. df.fillna(df.mean()[math: physical]) can select columns to handle missing values
- 7. strip(): Clears the specified characters on the left and right (first and last) of character data. The default is spaces, and the middle ones are not cleared.
- 3. Specific value replacement: replace('missing exam', 0)
- 4. Delete the row where the element that meets the condition is located: drop()
4.3.1 Data cleaning
1. Processing of duplicate values: drop_duplicates()
drop_duplicates()
把数据结构中行相同的数据去除(保留其中的一行)
【Example 4-6】Data deduplication.
Here df is the original data, of which rows 7, 9, 8, and 10 are repeated rowsfrom pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df
Out[1]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 18822256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 NaN 1.225790e+17 222.31.51.200
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
9 S1402048 13322252452 1.225790e+17 221.205.98.55
10 S1405011 18922257681 1.225790e+17 183.184.230.38
newDF=df.drop_duplicates() newDF
Out[2]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 1882 2256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 NaN 1.225790e+17 222.31.51.200
6 S1404095 1382 2254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
df above The data in rows 7 and 9 are the same, and the data in rows 8 and 10 are the same. After deduplication, one row of data is retained in rows 7, 9, and 8 and 10.
2. Missing value processing:
dropna()、df.fillna() 、df.fillna(method=‘pad’)、df.fillna(method=‘bfill’)、df.fillna(df.mean())、df.fillna(df.mean()[math: physical]) 、strip()
Methods for handling missing data include data completion, deletion of corresponding rows, and no processing.
[Example 4-6] Missing processing.
Here df is the original data, in which rows 2, 3, and 5 have missing valuesfrom pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df
Out[1]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 18822256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 NaN 1.225790e+17 222.31.51.200
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
9 S1402048 13322252452 1.225790e+17 221.205.98.55
10 S1405011 18922257681 1.225790e+17 183.184.230.38
1. dropna() removes data rows with empty values in the data structure
[Example 4-7] Delete the row corresponding to empty data
from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') newDF=df.dropna() newDF
Out[3]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
4 S1405010 18922253721 1.225790e+17 120.207.64.3
6 S140 4095 13822254373 1.225790e+17 222.31.59.220< /span>deletedhas beenNaNrows have null values2, 3, 5 In the example 10 S1405011 18922257681 1.225790e+17 183.184.230.38 9 S1402048 13322252452 1.225790 e+17 221.205.98.55 8 S1405011 18922257681 1.225790e+17 183.184.230.38
7 S1402048 13322252452 1.225790e+17 221.205.98.55
2. df.fillna() replaces NaN with other values. Sometimes directly deleting empty data will affect the results of analysis, so the data can be filled. [Example 4-8] Use numerical values or any characters to replace missing values
[Example 4-8] Use numerical values or any characters to replace missing values
from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df.fillna('?')
Out[4]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 1.89223e+10 1.22579e+17 221.205.98.55 2014-11- 04 08:44:46
1 S1411023 1.35223e+10 1.22579e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 1.34223 e+10 ? 221.205.98.55 2014-11-04 08:46:39
3 20031509 1.88223e+10 ? 222.31.51.200 2014-11-04 08:47:41
4 S1405010 1.89223e+10 1.22579e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 ? 1.22579e+17 222.31.51.200 201 4-11 -04 08:50:06
6 S1404095 1.38223e+10 1.22579e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 1.33223e+10 1.22579e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 1.89223e+10 1.22579e+17 183.184.230.38 2014-1 1-04 08: 14:55
9 S1402048 1.33223e+10 1.22579e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 1.89223e+10 1.22579e+17 183.184.230.38 2014-11-04 08:14:55
If lines 2, 3, and 5 are empty, use ? Missing values are replaced.
3. df.fillna(method=‘pad’) replaces NaN with the previous data value
[Example 4-9] Replace missing values with the previous data value
(rows 2, 3, and 5 are missing values)from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df.fillna(method='pad')
Out[5]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812 1.225790e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003 1.225790e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938 1.225790e+17 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753 1.225790e+17 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721 1.225790e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 18922253721 1.225790e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
4. df.fillna(method=‘bfill’) replaces NaN with the latter data value
[Example 4-10] Replace NaN with the latter data value
(rows 2, 3, and 5 are missing values)from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df.fillna(method='bfill')
Out[6]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812 1.225790e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003 1.225790e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938 1.225790e+17 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753 1.225790e+17 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721 1.225790e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 13822254373 1.225790e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
5. df.fillna(df.mean()) replaces NaN with average or other descriptive statistics.
[Example 4-11] Use the mean to fill the data.
from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2_0.xlsx') df df.fillna(df.mean())
Out[7]:
No math physical Chinese
0 1 76 85 78
1 2 85 56 NaN
2 3 76 95 85
3 4 NaN 75 58
4 5 87 52 68Out[8]:
No math physical Chinese
0 1 76 85 78.00
1 2 85 56 72.25
2 3 76 95 85.00
3 4 81 75 58.00
4 5 87 52 68.00
6. df.fillna(df.mean()[math: physical]) can select columns to handle missing values
[Example 4-12] isa certain columnuses the mean value of the column to fill the data
from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2_0.xlsx') df.fillna(df.mean()['math':'physical'])
Out[26]:
No math physical Chinese
0 1 76.0 85 78.0
1 2 85.0 56 NaN
2 3 76.0 95 85.0
3 4 NaN 75 58.0
4 5 87.0 52 68.0Out[9]:
No math physical Chinese
0 1 76 85 78
1 2 85 56 NaN
2 3 76 95 85
3 4 81 75 58
4 5 87 52 68
7. strip(): Clears the specified characters on the left and right (first and last) of character data. The default is spaces, and the middle ones are not cleared.
[Example 4-13] Delete the characters specified at the left and right or at the beginning of the string.
from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') newDF=df['IP'].str.strip() #因为IP是一个对象,所以先转为str。 newDF
Out[27]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003.0 1.2257903137349373e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938.0 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753.0 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721.0 1.2257903137349373e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 1.2257903137349373e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373.0 1.2257903137349373e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681.0 1.2257903137349373e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681.0 1.2257903137349373e+17 183.184.230.38 2014-11-04 08:14:55Out[10]:
0 221.205.98.55
1 183.184.226.205
2 221.205.98.55
3 222.31.51.200
4 120.207.64.3
5 222.31.51.200
6 222.31.59.220
7 221.205.98.55
8 183.184.230.38
9 221.205.98.55
10 183.184.230.38
Name: IP, dtype: object
3. Specific value replacement: replace(‘absent’, 0)
df11 = df11.replace(np.nan,'[正常]')
df11 = df11.replace('none',np.nan)
df11 = df11.replace(' ― ',np.nan)
4. Delete the row where the element that meets the condition is located: drop()
df = df.drop(df[].index)
#Delete mobile phones with price greater than 1000
df_s_acc = df_s.drop(df_s[df_s['价格']>=1000].index)
Out[68]:
Unnamed: 0 ID value price... Tag change specification date
5566 6626 1354676 699... Secure mobile phone 2020/11/ 1 2020-11-01
5565 6625 1354673 799 … Safe Mobile Phone 2020/11/1 2020-11-01
101 102 1346463 2699 … Safe Mobile Phone 2020/ 11/11 2020-11-11
64 64 1338710 3199 … Secure mobile phone 2020/11/11 2020-11-11
2884 3382 1352445 1499 … Secure mobile phone 2020/11/26 2020-11-26
2892 3391 1349515 1099 … Security mobile phone2020/11/26 2020-11-26
2910 3411 1349516 1299 … Safe mobile phone 2020/11/26 2020-11-26
2844 3340 1348871 999 … Safe mobile phone 2020/11/26 2020-11-26
4046 4845 1350884 799 … Safe Mobile Phone 2020/12/1 2020-12-01
4036 4834 1350882 699 … Safe Mobile Phone 2020/12/1 2020-12-01
3394 4023 1349976 799 … Safe Mobile Phone 2020/12/12 2020-12-12
4740 5656 1353088 2399 … Safe Mobile Phone 2020/12/22 2020-12-22
4737 5653 1353068 1999 … Safe Mobile Phone 2020/12/22 2020-12-22
4048 4847 1357947 1099 … Safe Mobile Phone 2021/1/1 2021-01-01
4038 4836 1357933 999 … Safe Mobile Phone 2021/1/1 2021-01-01
4043 4842 1357949 1199 … Safe Mobile Phone 2021/1/1 2021-01-01Out[72]:
Unnamed: 0 ID value price... Tag change specification date
5566 6626 1354676 699... Secure mobile phone 2020/11/ 1 2020-11-01
5565 6625 1354673 799 … Safe Mobile Phone 2020/11/1 2020-11-01
2844 3340 1348871 999 … Safe Mobile Phone 2020/ 11/26 2020-11-26
4046 4845 1350884 799 … Secure mobile phone 2020/12/1 2020-12-01
4036 4834 1350882 699 … Secure mobile phone 2020/12/1 2020-12-01
3394 4023 1349976 799 … Secure mobile phone 2020/12/12 2020-12-12
4038 4836 1357933 999 … Safe mobile phone 2021/1/1 2021-01-01
You can also use multiple conditions
df_clear = df.drop(df[df['x']<0.01].index)
# 也可以使用多个条件
df_clear = df.drop(df[(df['x']<0.01) | (df['x']>10)].index) #删除x小于0.01或大于10的行