Pandas | 17 missing data processing

Data loss (deletion) in real life is always a problem. Machine learning and data mining areas due to the lack of data due to poor data quality, face serious problems in the accuracy of the model predictions. In these areas, missing values ​​is to make the model more accurate and effective focus.

 

Use remodeling index (reindexing), created a lack of value of DataFrame. In the output, NaNa value indicating not a number .

One,Check the missing values

In order to more easily detect missing values (and across different arrays dtype), Pandas provided isnull()and notnull()functions, and they are also Series object methods DataFrame 

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3),
                  index=['a', 'c', 'e', 'f','h'],
                  columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print('\n')

print (df['one'].isnull())

Output:

        One two three 
's 0.036297 -0.615260 -1.341327
b Gulf Shores
c -1.908168 -0.779304 0.212467
of Gulf Shores
and 0.527409 -2.432343 0.190436
f 1.428975 -0.364970 1.084148
g Gulf Shores
h 0.763328 -0.818729 0.240498


's False
b True
c False
d True
False e
f False
g True
H False
Name: one, dtype: bool
 

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df['one'].notnull())
Output results:
a     True
b False
c True
d False
e True
f True
g False
h True
Name: one, dtype: bool
 

two,Lack of calculation data

  • When summed data, NAit will be considered0
  • If the data is all NA, then the result will beNA

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print('\n')

print (df['one'].sum())

Output:

        One two three 
's -1.191036 0.945107 -0.806292
b Gulf Shores
c 0.127794 -1.812588 -0.466076
of Gulf Shores
and 2.358568 0.559081 1.486490
f -0.242589 0.574916 -0.831853
g Gulf Shores
h 1.815404 -1.706736 -0.328030


0.7247067964060545

 

Example 2

import pandas as pd

df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])

print(df)
print('\n')

print (df['one'].sum())

Output:

   One Two 
0 LIST
1 LIST
2 LIST
3 LIST
4 LIST
5 LIST

0
 

Third, fill in missing data

Pandas provides a variety of methods to remove the missing values. fillna()Function can be "filled" with a non-null data by several methods NAvalue.

 

Replace with a NaN scalar value

The following program shows how to use 0the replacement NaN.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one','two', 'three'])

df = df.reindex(['a', 'b', 'c'])
print (df) print('\n') print ("NaN replaced with '0':") print (df.fillna(0))

Output:

        one       two     three
a -0.479425 -1.711840 -1.453384
b       NaN       NaN       NaN
c -0.733606 -0.813315  0.476788
NaN replaced with '0': one two three a -0.479425 -1.711840 -1.453384 b 0.000000 0.000000 0.000000 c -0.733606 -0.813315 0.476788
 

Here padding zero value; of course, you can also fill out any other value.

 

Alternatively missing (or) generic value

In many cases, it must be replaced by a generic value specific value. It can be achieved through the application of this alternative approach. Replaced with a scalar value NAis fillna()equivalent to the behavior of the function.

Examples

import pandas as pd

df = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})

print(df)
print('\n')

print (df.replace({1000:10,2000:60}))

Output:

    one   two
0 10 1000
1 20 0
2 30 30
3 40 40
4 50 50
5 2000 60

one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60

 

填写NA前进和后退

使用重构索引章节讨论的填充概念,来填补缺失的值。

方法 动作
pad/fill 填充方法向前
bfill/backfill 填充方法向后

示例1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print('\n')

print (df.fillna(method='pad'))

输出结果:

        one       two     three
a -0.023243 1.671621 -1.687063
b NaN NaN NaN
c -0.933355 0.609602 -0.620189
d NaN NaN NaN
e 0.151455 -1.324563 -0.598897
f 0.605670 -0.924828 -1.050643
g NaN NaN NaN
h 0.892414 -0.137194 -1.101791


one two three
a -0.023243 1.671621 -1.687063
b -0.023243 1.671621 -1.687063
c -0.933355 0.609602 -0.620189
d -0.933355 0.609602 -0.620189
e 0.151455 -1.324563 -0.598897
f 0.605670 -0.924828 -1.050643
g 0.605670 -0.924828 -1.050643
h 0.892414 -0.137194 -1.101791
 

示例2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df.fillna(method='backfill'))

输出结果:

        one       two     three
a  2.278454  1.550483 -2.103731
b -0.779530  0.408493  1.247796
c -0.779530  0.408493  1.247796
d  0.262713 -1.073215  0.129808
e  0.262713 -1.073215  0.129808
f -0.600729  1.310515 -0.877586
g  0.395212  0.219146 -0.175024
h  0.395212  0.219146 -0.175024
 

四、丢失缺少的值

使用dropna函数和axis参数。 默认情况下,axis = 0,即在行上应用,这意味着如果行内的任何值是NA,那么整个行被排除。

实例1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df.dropna())
输出结果 :
        one       two     three
a -0.719623  0.028103 -1.093178
c  0.040312  1.729596  0.451805
e -1.029418  1.920933  1.289485
f  1.217967  1.368064  0.527406
h  0.667855  0.147989 -1.035978
 

示例2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df.dropna(axis=1))
输出结果:
Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]
 




Guess you like

Origin www.cnblogs.com/Summer-skr--blog/p/11705887.html