Data loss (deletion) in real life is always a problem. Machine learning and data mining areas due to the lack of data due to poor data quality, face serious problems in the accuracy of the model predictions. In these areas, missing values is to make the model more accurate and effective focus.
Use remodeling index (reindexing), created a lack of value of DataFrame. In the output, NaN
a value indicating not a number .
One,Check the missing values
In order to more easily detect missing values (and across different arrays dtype
), Pandas provided isnull()
and notnull()
functions, and they are also Series object methods DataFrame
Example 1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df) print('\n') print (df['one'].isnull())
Output:
One two three
's 0.036297 -0.615260 -1.341327
b Gulf Shores
c -1.908168 -0.779304 0.212467
of Gulf Shores
and 0.527409 -2.432343 0.190436
f 1.428975 -0.364970 1.084148
g Gulf Shores
h 0.763328 -0.818729 0.240498
's False
b True
c False
d True
False e
f False
g True
H False
Name: one, dtype: bool
Example 2
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print (df['one'].notnull())
Output results:
a True
b False
c True
d False
e True
f True
g False
h True
Name: one, dtype: bool
two,Lack of calculation data
- When summed data,
NA
it will be considered0
- If the data is all
NA
, then the result will beNA
Example 1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df) print('\n') print (df['one'].sum())
Output:
One two three
's -1.191036 0.945107 -0.806292
b Gulf Shores
c 0.127794 -1.812588 -0.466076
of Gulf Shores
and 2.358568 0.559081 1.486490
f -0.242589 0.574916 -0.831853
g Gulf Shores
h 1.815404 -1.706736 -0.328030
0.7247067964060545
Example 2
import pandas as pd df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two']) print(df) print('\n') print (df['one'].sum())
Output:
One Two
0 LIST
1 LIST
2 LIST
3 LIST
4 LIST
5 LIST
0
Third, fill in missing data
Pandas provides a variety of methods to remove the missing values. fillna()
Function can be "filled" with a non-null data by several methods NA
value.
Replace with a NaN scalar value
The following program shows how to use 0
the replacement NaN
.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one','two', 'three']) df = df.reindex(['a', 'b', 'c'])
print (df) print('\n') print ("NaN replaced with '0':") print (df.fillna(0))
Output:
one two three
a -0.479425 -1.711840 -1.453384
b NaN NaN NaN
c -0.733606 -0.813315 0.476788
NaN replaced with '0':
one two three
a -0.479425 -1.711840 -1.453384
b 0.000000 0.000000 0.000000
c -0.733606 -0.813315 0.476788
Here padding zero value; of course, you can also fill out any other value.
Alternatively missing (or) generic value
In many cases, it must be replaced by a generic value specific value. It can be achieved through the application of this alternative approach. Replaced with a scalar value NA
is fillna()
equivalent to the behavior of the function.
Examples
import pandas as pd df = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]}) print(df) print('\n') print (df.replace({1000:10,2000:60}))
Output:
one two
0 10 1000
1 20 0
2 30 30
3 40 40
4 50 50
5 2000 60
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
填写NA前进和后退
使用重构索引章节讨论的填充概念,来填补缺失的值。
方法 | 动作 |
---|---|
pad/fill |
填充方法向前 |
bfill/backfill |
填充方法向后 |
示例1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df) print('\n') print (df.fillna(method='pad'))
输出结果:
one two three
a -0.023243 1.671621 -1.687063
b NaN NaN NaN
c -0.933355 0.609602 -0.620189
d NaN NaN NaN
e 0.151455 -1.324563 -0.598897
f 0.605670 -0.924828 -1.050643
g NaN NaN NaN
h 0.892414 -0.137194 -1.101791
one two three
a -0.023243 1.671621 -1.687063
b -0.023243 1.671621 -1.687063
c -0.933355 0.609602 -0.620189
d -0.933355 0.609602 -0.620189
e 0.151455 -1.324563 -0.598897
f 0.605670 -0.924828 -1.050643
g 0.605670 -0.924828 -1.050643
h 0.892414 -0.137194 -1.101791
示例2
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print (df.fillna(method='backfill'))
输出结果:
one two three
a 2.278454 1.550483 -2.103731
b -0.779530 0.408493 1.247796
c -0.779530 0.408493 1.247796
d 0.262713 -1.073215 0.129808
e 0.262713 -1.073215 0.129808
f -0.600729 1.310515 -0.877586
g 0.395212 0.219146 -0.175024
h 0.395212 0.219146 -0.175024
四、丢失缺少的值
使用dropna
函数和axis
参数。 默认情况下,axis = 0
,即在行上应用,这意味着如果行内的任何值是NA
,那么整个行被排除。
实例1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print (df.dropna())
输出结果 :
one two three
a -0.719623 0.028103 -1.093178
c 0.040312 1.729596 0.451805
e -1.029418 1.920933 1.289485
f 1.217967 1.368064 0.527406
h 0.667855 0.147989 -1.035978
示例2
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print (df.dropna(axis=1))
输出结果:
Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]