pandas 之数据清洗-缺失值

Abstract

During the course fo doing data analysis and modeling, a significant amount of time is spend on data preparation: loading, cleaning, transforming, and rearrangin.
在整个数据分析建模过程中, 大量的时间(80%)的时间是用在了数据的预处理中, 如数据清洗, 加载, 标准化, 重塑等.

Such tasks are often reported to take 80% or more of an analyst's time. Sometimes the way that data is store in(存储于) files or databases is not in the right format for a particular task (数据存储在文件, 或数据库中, 而且格式, 内容不是你预想的那样能直接用的) Many reseachers choose to do ad hoc processing of data from one form to another using a general-purpose programming language, like Python, R, Java or Unix text-processing tools like sed or awk.(研究者通长使用一些通用的编程语言如Python, Java, R等, 对数据进行格式转换再进行处理) Fortunately(幸运地是), pandas, alone with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form. (
pandas是一个第三库, 提供了高效, 灵活的工具去帮助我们进行数据清理, 转换, 重塑等操作)

if you identify a type of data manipulation that isn't anywhere in this book or else-where in the pandas library, feel free to share your use case on one of the Python mailing lists or on the pandas GitHub site(如果有什么新的数据处理需求, pandas没有的话, 欢迎在Github 来留言) Indeed(事实上), much of the design and implementation of pandas has been driven by the needs of real-world applications.(panas的功能都是来自人们真实的生活工作中提出的需求哦)

In this chapter I discuss tools for missing data, duplicate data, string manipulation and some other analytical data transformations. (本章讨论缺失值, 重复值, 字符转换等, 格式等一些操作) In next chapter, I focus on combining and rearranging datasets in various ways.(下章讨论数据和合并, 连接, 重塑等操作)

import numpy as np
import pandas as pd

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as panless as possible.(pandas的一个小目标就是要让处理缺失数据变得尽可能简单) For example, all of the descriptive atatistics on pandas objects exlude missing data by defalult.(默认就过滤掉缺失数据了)

The way that missing data(直接丢弃缺失值) is represented(表示) in pandas objects is somewhat(稍微) imperfect(不友好), but it is functional for a lot of users.(尽管直接忽略缺失值, 有点暴力操作,但确实方便了很多用户) For numeric data, pandas uses the floating-point value NaN to represent missing data. We call this a sentinel value(哨兵值) that can be easily detected(被检测到):

string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'cj'])

string_data

0     aardvark
1    artichoke
2          NaN
3           cj
dtype: object

"string_data.notnull()"
"检测缺失值"
string_data.isnull()

'string_data.notnull()'

'检测缺失值'

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we've adopted a convention used in the R programming language by refering to missing data as NA, which stands for not available(在pandas中, 我们采用了R语言中使用的规定, 将不可用的值用NA表示). In statistics (NA在统计学中,表示不存在或者不可观测)applications, NA data may either be data that does not exist or that exist but was not observed(through problems with with data collections(数据没有采集到)) When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data cause by missing data.(数据分析通常需要专门对缺失值进行处理, 不然会影响整个分析的结果哦)

The built-in Python None value is also treated as NA in objects arrays:

string_data[0] = None

"None 是作为缺失值"
string_data.isnull()

'None 是作为缺失值'

0     True
1    False
2     True
3    False
dtype: bool

There is work onging(不间断地) in the pandas project to improve the internal details of how missing data is handled, but the user API functions, like pandas.isnull, abstract away many of the annoying(令人烦恼的) details.

See Table 7-1 for a list of some functions related to missing data handing.

Argument	Description
dropna	根据轴标签来过滤其缺失值
fillna	填充的值, 或者用默认方法, 如"ffill(向前填充)" or "bfill"
isnull	返回布尔索引数组True表示NA
notnull	跟isnull相反

string_data.notnull()

0    False
1     True
2    False
3     True
dtype: bool

Filter Out Missing Data

There are a few ways to filter out missing data. While you always have the options to to dy hand using pandas. isnull and boolean indexing, the dropna can be helpful. On a Series, it retruns the Series with the non-null data and index values: (Series中, dropna直接干掉了NA)

data = pd.Series([1, np.nan, 3.5, np.nan, 7])

"dropna 直接干掉缺失值 - 非原地(深拷贝)"
data.dropna()

'dropna 直接干掉缺失值 - 非原地(深拷贝)'

0    1.0
2    3.5
4    7.0
dtype: float64

data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

"this is equivalent to:"

data[data.notnull()]

'this is equivalent to:'

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that all NA or only those containing any Na. dropna by default drops any row containing a missing value.
(就DF删除缺失值而言, 可能有删除包含NA的整条记录(row), 或整个column, 默认是删除整行(row)该条记录样本作废)

data = pd.DataFrame([
    [1, 6.5, 3],
    [np.nan, np.nan, np.nan],
    [np.nan, 6.5, 3]
    
])

data

	0	1	2
0	1.0	6.5	3.0
1	NaN	NaN	NaN
2	NaN	6.5	3.0

"默认axis=0, 删除行, how=any, 只要有NA就干掉整条记录"
data.dropna()


"Passing how='all' will only drop rows that ann NA"
data.dropna(how='all')

'默认axis=0, 删除行, how=any, 只要有NA就干掉整条记录'

	0	1	2
0	1.0	6.5	3.0

"Passing how='all' will only drop rows that ann NA"

	0	1	2
0	1.0	6.5	3.0
2	NaN	6.5	3.0

To drop columns in the same way, pass axis=1

"新增一列4, 值为NA"
data[4] = np.nan

data

'新增一列4, 值为NA'

	0	1	2	4
0	1.0	6.5	3.0	NaN
1	NaN	NaN	NaN	NaN
2	NaN	6.5	3.0	NaN

"dropna(axis=1, how='all') 表示删除全为缺失值的列, 这应该不常用吧, 变量都干掉了"

data.dropna(axis=1, how='all')

"dropna(axis=1, how='all') 表示删除全为缺失值的列, 这应该不常用吧, 变量都干掉了"

	0	1	2
0	1.0	6.5	3.0
1	NaN	NaN	NaN
2	NaN	6.5	3.0

A related ways to filter out(过滤掉行) DataFrame rows tends to(倾向于) concern(涉及) time series data. Suppose you want to keep only containing a certain nuber of observations. You can indicate this with the thresh argument.

df = pd.DataFrame(np.random.randn(7,3))

"前4行, 第2列的区域 赋值为 NA"
df.iloc[:4, 1] = np.nan

"前2行, 第3列的区域, 赋值为NA"
df.iloc[:2, 2] = np.nan

df

'前4行, 第2列的区域 赋值为 NA'

'前2行, 第3列的区域, 赋值为NA'

	0	1	2
0	-0.737352	NaN	NaN
1	0.537057	NaN	NaN
2	-0.972371	NaN	0.763062
3	0.852537	NaN	-0.845470
4	0.194360	-0.646723	-0.244883
5	0.275051	1.139541	-0.080366
6	0.544598	-1.875903	1.006292

"删除含有缺失值的行-注意, 都是非原地哦"
df.dropna()

'删除含有缺失值的行-注意, 都是非原地哦'

	0	1	2
4	0.194360	-0.646723	-0.244883
5	0.275051	1.139541	-0.080366
6	0.544598	-1.875903	1.006292

"thresh=n, 表示保留至少n个非NA值的行, n越小, 保留行越多"
df.dropna(thresh=1)

'thresh=n, 表示保留至少n个非NA值的行, n越小, 保留行越多'

	0	1	2
0	-0.737352	NaN	NaN
1	0.537057	NaN	NaN
2	-0.972371	NaN	0.763062
3	0.852537	NaN	-0.845470
4	0.194360	-0.646723	-0.244883
5	0.275051	1.139541	-0.080366
6	0.544598	-1.875903	1.006292

Filling In Missing Data

Rather than(相比于过滤掉缺失数据的行列) filtering out missing data(and potentially(可能) discarding(丢弃) other data along with it), you may want to fiil in the 'holes' in any number of ways(也许对缺失值进行填充不失为一种好的方案, 针对于样本量少的情况下). For most purpose(目标), the fillna method is the workehorse function to use. Calling fillna with a constant(常量) replaces missing values with that value
(调用fiina, 用常量取替换掉缺失值)

df 

"用0去填充DF中的缺失值 - 非原地的哦"
df.fillna(0)

	0	1	2
0	-0.737352	NaN	NaN
1	0.537057	NaN	NaN
2	-0.972371	NaN	0.763062
3	0.852537	NaN	-0.845470
4	0.194360	-0.646723	-0.244883
5	0.275051	1.139541	-0.080366
6	0.544598	-1.875903	1.006292

'用0去填充DF中的缺失值 - 非原地的哦'

	0	1	2
0	-0.737352	0.000000	0.000000
1	0.537057	0.000000	0.000000
2	-0.972371	0.000000	0.763062
3	0.852537	0.000000	-0.845470
4	0.194360	-0.646723	-0.244883
5	0.275051	1.139541	-0.080366
6	0.544598	-1.875903	1.006292

"除非赋值操作, 几乎绝大部分DF操作都是非原地的-深拷贝"
df

'除非赋值操作, 几乎绝大部分DF操作都是非原地的-深拷贝'

	0	1	2
0	-0.737352	NaN	NaN
1	0.537057	NaN	NaN
2	-0.972371	NaN	0.763062
3	0.852537	NaN	-0.845470
4	0.194360	-0.646723	-0.244883
5	0.275051	1.139541	-0.080366
6	0.544598	-1.875903	1.006292

Calling fillna with a dict, you can use a different fill value for each column

# 这是之前自己爬招聘网站的数据,做练习用
data = pd.read_csv("../job2.csv") 

# data.info()

data.head(10)

	position_name	salary	work_place	experience	degree	people_num,	publish_time	work_honey	job_info	company_addr	company_name	company_link	company_type	company_staff	company_business	collect_time
0	Python开发工程师	0.8-1万/月	北京-朝阳区	5-7年经验	大专	招若干人	03-16发布	[]	['精通Excel宏(VBA/VBS)和Python开发', '职位要求：', '5年经验，...	NaN	北京裕翔创新工程技术有限公司	https://jobs.51job.com/all/co5414592.html	民营公司	50-150人	计算机软件	2019-03-16 17:23:54.793818
1	软件开发工程师(Python/Java)(006307)(职位编号：BGI006307)	20-30万/年	深圳-盐田区	无工作经验	本科	招1人	03-16发布	['五险一金', '交通补贴', '餐饮补贴', '通讯补贴', '年终奖金', '定期体检']	['1.负责基础组件的开发工作，包括算法模块、计算引擎、流程容器、通信接口等。', '2.负...	NaN	深圳华大基因股份有限公司	https://jobs.51job.com/all/co5405940.html	民营公司	1000-5000人	检测，认证,医疗设备/器械	2019-03-16 17:23:55.300749
2	PHP/Python程序员	1-1.5万/月	深圳-南山区	1年经验	大专	招1人	03-16发布	['五险一金', '员工旅游', '年终奖金', '弹性工作']	['']	NaN	深圳齐采科技有限公司	https://jobs.51job.com/all/co3957494.html	创业公司	50-150人	互联网/电子商务	2019-03-16 17:23:56.189117
3	Python开发工程师	0.8-1.5万/月	郑州-高新区	1年经验	大专	招2人	03-16发布	['五险一金', '年终奖金', '股票期权', '通讯补贴', '餐饮补贴', '定期体检...	['1、基于Python进行需求的开发、部署、优化。', '1、拥有较强的Python编程能...	NaN	广州佰聆数据股份有限公司	https://jobs.51job.com/all/co3803005.html	民营公司	150-500人	计算机服务(系统、数据服务、维修),计算机软件	2019-03-16 17:24:10.626950
4	Python开发实习	4-6千/月	南京-雨花台区	无工作经验	大专	招6人	03-16发布	['五险一金', '补充医疗保险', '员工旅游', '定期体检', '弹性工作', '年终...	['岗位职责：', '']	NaN	西安金堆金能源科技有限公司	https://jobs.51job.com/all/co5379100.html	民营公司	50-150人	新能源	2019-03-16 17:24:10.652288
5	运维工程师（8K+提成周末双休）	0.8-1.5万/月	广州-天河区	3-4年经验	大专	招2人	03-12发布	['五险一金', '餐饮补贴', '交通补贴', '年终奖金', '绩效奖金', '带薪年假...	['1.工作时间：周一至周五，9:00-18:00或9:30-18:30', '2.社会保险...	NaN	广州创思云网络科技有限公司	https://jobs.51job.com/all/co5067556.html	创业公司	少于50人	计算机软件	2019-03-16 17:24:12.513511
6	运维主管	1-1.5万/月	上海-黄浦区	5-7年经验	本科	招1人	03-12发布	['五险一金', '绩效奖金', '弹性工作']	[]	NaN	百可录（北京）科技有限公司	https://jobs.51job.com/all/co4545403.html	创业公司	50-150人	计算机软件,金融/投资/证券	2019-03-16 17:24:13.948386
7	数据库开发工程师	1-1.5万/月	广州-番禺区	3-4年经验	本科	招1人	03-12发布	['五险一金', '年终奖金', '周末双休']	['1.参与项目需求分析,并抽象出数据层的需求', '2根据业务需求进行数据库设计(包括数据...	NaN	广州策码信息技术咨询有限公司	https://jobs.51job.com/all/co5274492.html	创业公司	50-150人	计算机软件	2019-03-16 17:24:13.963832
8	SeniorRetailSystemEngineer(Python)	1.5-2万/月	北京-朝阳区	5-7年经验	本科	招1人	03-16发布	['五险一金', '补充医疗保险', '绩效奖金', '年终奖金']	['', '']	NaN	绫致时装（天津）有限公司	https://jobs.51job.com/all/co181210.html	外资（欧美）	10000人以上	服装/纺织/皮革,批发/零售	2019-03-16 17:24:15.067126
9	11215N-Python开发工程师(职位编号：11215N)	30-40万/年	北京-朝阳区	5-7年经验	本科	招3人	03-16发布	['五险一金', '带薪年假', '节日福利', '绩效奖金', '定期体检']	[]	NaN	平安科技（深圳）有限公司	https://jobs.51job.com/all/co2155678.html	民营公司	1000-5000人	互联网/电子商务,计算机软件	2019-03-16 17:24:15.095420

"以字典方式, 缺失值的列进行对应值填充"
"col 1 中缺失值用 0.5 填充, col 2 用 0 填充"

df.fillna({1:0.5, 2:0})

'以字典方式, 缺失值的列进行对应值填充'

'col 1 中缺失值用 0.5 填充, col 2 用 0 填充'

	0	1	2
0	-0.737352	0.500000	0.000000
1	0.537057	0.500000	0.000000
2	-0.972371	0.500000	0.763062
3	0.852537	0.500000	-0.845470
4	0.194360	-0.646723	-0.244883
5	0.275051	1.139541	-0.080366
6	0.544598	-1.875903	1.006292

fillna returns a new object, but you can modify the existing object in-place
(默认是非原地的, 可以指定原地修改 inplace=True)

"inplace=True 原地修改 _ 表示不知用啥变量"

_ = df.fillna(0, inplace=True)

df

'inplace=True 原地修改 _ 表示不知用啥变量'

	0	1	2
0	-0.737352	0.000000	0.000000
1	0.537057	0.000000	0.000000
2	-0.972371	0.000000	0.763062
3	0.852537	0.000000	-0.845470
4	0.194360	-0.646723	-0.244883
5	0.275051	1.139541	-0.080366
6	0.544598	-1.875903	1.006292

The same interpolation methods available for reindexing can be used with fillna:

df = pd.DataFrame(np.random.randn(6,3))

df.iloc[2:, 1] = np.nan  # 不包括第二行
df.iloc[4:, 2] = np.nan

df

	0	1	2
0	0.166483	-1.231651	-0.812065
1	-0.849753	1.941498	-0.782602
2	-1.296790	NaN	0.838213
3	-1.033377	NaN	1.514566
4	-2.002896	NaN	NaN
5	-0.144666	NaN	NaN

"先前填充 method=ffill"

df.fillna(method='ffill') # forward

'先前填充 method=ffill'

	0	1	2
0	0.166483	-1.231651	-0.812065
1	-0.849753	1.941498	-0.782602
2	-1.296790	1.941498	0.838213
3	-1.033377	1.941498	1.514566
4	-2.002896	1.941498	1.514566
5	-0.144666	1.941498	1.514566

"这是啥??"
df.fillna(method='ffill', limit=2)

	0	1	2
0	0.166483	-1.231651	-0.812065
1	-0.849753	1.941498	-0.782602
2	-1.296790	1.941498	0.838213
3	-1.033377	1.941498	1.514566
4	-2.002896	NaN	1.514566
5	-0.144666	NaN	1.514566

With fillna you can do lots of ohter things with a little creativity. For example, you might pass the mean or median value of Series:
(处理缺失值, 其实就两个方案, 要么干掉整条数据, 要么填充, 填充用得比较多, 用ffill方法填充, 用值, 如mean, median...等填充, 花样很多, 而操作很简单)

"用平均值填充(其实中位数会更好)NA"

data = pd.Series([1, np.nan, 3.5, np.nan, 7.5])

data.fillna(data.mean())

"中位数填充, cj推荐"
data.fillna(data.mean())

'用平均值填充(其实中位数会更好)NA'

0    1.0
1    4.0
2    3.5
3    4.0
4    7.5
dtype: float64

'中位数填充, cj推荐'

0    1.0
1    4.0
2    3.5
3    4.0
4    7.5
dtype: float64

See Table 7-2 for a refrence on fillna.

参数	描述
value	标量值or字典(col:value) 填充缺失值
method	默认'ffill' 向前填充
axis	默认axis=0, 按行填充, axis=1则按列
inplace	默认是深拷贝, 非原地, 可 inplace=True
limit	不明白暂时