处理丢失数据

import numpy as np

有两种丢失数据：

None
np.nan(NaN)

type(None)

NoneType

type(np.nan)

float

type(1000)

int

type("hello")

str

np.nan + 100

nan

100 + "dsaf"

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-fb724d538b75> in <module>()
----> 1 100 + "dsaf"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

None + 100

1. None

None是Python自带的，其类型为python object。因此，None不能参与到任何计算中。

object类型的运算要比int类型的运算慢得多
计算不同数据类型求和时间
%timeit np.arange(1e5,dtype=xxx).sum()

1e5

%timeit np.arange(1e6,dtype="int").sum()

%timeit np.arange(1e6,dtype="float").sum()

%timeit np.arange(1e6,dtype="object").sum()

2. np.nan（NaN）

np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN。

但可以使用np.nan*()函数来计算nan，此时视nan为0。

nd = np.array([10,20,30,np.nan])

nd.sum()

np.nansum(nd)

np.nanmean(nd)

np.array([1,2,3,np.nan,None])

​

3. pandas中的None与NaN

1) pandas中None与np.nan都视作np.nan

创建DataFrame

df = DataFrame([[10,20,34,None,23,np.nan],

使用DataFrame行索引与列索引修改DataFrame数据

df.sum(axis=0) # axis=0把行加起来

A     40.0
B     96.0
C    266.0
D    312.0
E     23.0
F     20.0
dtype: float64

df.sum(axis=1) # 把列加起来

a     87.0
b    496.0
c    174.0
dtype: float64

【注】pandas中的nan在运算的时候被视作0

2) pandas中None与np.nan的操作

isnull()
notnull()
dropna(): 过滤丢失数据
fillna(): 填充丢失数据

(1)判断函数

isnull()
notnull()

df

df.isnull()

df.isnull().all(axis=0) # 判断每一列中的所有元素是否全为True，如果是则为True

A    False
B    False
C    False
D    False
E    False
F    False
dtype: bool

df.isnull().any(axis=0)

A    False
B    False
C    False
D     True
E     True
F     True
dtype: bool

df.isnull().all(axis=1)

a    False
b    False
c    False
dtype: bool

df.isnull().any(axis=1)

a    True
b    True
c    True
dtype: bool

# 输出有缺失行

A    False
B    False
C    False
D     True
E     True
F     True
dtype: bool

df[cond]

d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  """Entry point for launching an IPython kernel.

---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-19-023e73c851e3> in <module>()
----> 1 df[cond]

d:\Anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   1956         if isinstance(key, (Series, np.ndarray, Index, list)):
   1957             # either boolean or fancy integer index
-> 1958             return self._getitem_array(key)
   1959         elif isinstance(key, DataFrame):
   1960             return self._getitem_frame(key)

d:\Anaconda\lib\site-packages\pandas\core\frame.py in _getitem_array(self, key)
   1996             # check_bool_indexer will throw exception if Series key cannot
   1997             # be reindexed to match DataFrame rows
-> 1998             key = check_bool_indexer(self.index, key)
   1999             indexer = key.nonzero()[0]
   2000             return self.take(indexer, axis=0, convert=False)

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in check_bool_indexer(ax, key)
   1937         mask = isnull(result._values)
   1938         if mask.any():
-> 1939             raise IndexingError('Unalignable boolean Series provided as '
   1940                                 'indexer (index of the boolean Series and of '
   1941                                 'the indexed object do not match')

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

df[df.columns[cond]] # 输出所有缺失的列

判断条件

(df>30).any(axis=0)

A    False
B     True
C     True
D     True
E    False
F    False
dtype: bool

(df<50).all(axis=1)

a    False
b    False
c    False
dtype: bool

# 输出所有的值都大于100的那些列

df[df.columns[c]]

​

(2) 过滤函数

dropna()

可以选择过滤的是行还是列（默认为行）

也可以选择过滤的方式 how = 'all'

df

df.dropna(axis=1,how="any")

(3) 填充函数 Series/DataFrame

fillna()

df

df.fillna(1000) # 直接填补

可以选择前向填充还是后向填充

df.fillna(method="bfill",axis=0) # 拿后面的行标对应的值填充到前面的nan位置

df.fillna(method="ffill",axis=0)

df.fillna(method="ffill",axis=1)

对于DataFrame来说，还要选择填充的轴axis。记住，对于DataFrame来说：

axis=0：index/行
axis=1：columns/列

============================================

练习7：

简述None与NaN的区别
假设张三李四参加模拟考试，但张三因为突然想明白人生放弃了英语考试，因此记为None，请据此创建一个DataFrame,命名为ddd3
老师决定根据用数学的分数填充张三的英语成绩，如何实现？用李四的英语成绩填充张三的英语成绩？