Python learning | data cleaning and preparation

A handle missing data

In pandas, using a R language usage, i.e. missing value is represented as NA, which represents not available not available; floating-point values NaN (Not a Number) indicates missing data.
1.1 filter out missing data

1) dropna method

drona method discards any row containing missing values, if you want to discard the whole row or column NA incoming how = 'all' parameter.

Thresh parameters passed to filter out rows with n columns or NaN

2) pandas.notnull, Boolean index

1.2 fill in missing data
fillna Method

NOTE: method, there are two ways, ffill previous value is missing values ​​appearing later filled, bfill is in front of the missing values, values ​​that appear after filling

2 Data Conversion

2.1 Remove duplicate data

 method returns a boolean duplicated Series, determines whether there is a duplicate

drop_duplicates method to remove duplicate rows
above two methods is determined by default, all of the columns, the columns can duplicate designated portion is determined, the default is the first combination of retention value occurs, the incoming keep = 'last' is
retained last.

2.2 using a function or map data conversion
word objects Series of typical methods can accept a map or a function containing a mapping relationship, str.lower method, the respective values to lowercase

You can also pass a job to complete all of these functions:

2.3 overrides

replace method

You may be replaced with a plurality of disposable values:

The parameter can also be passed in the dictionary:

2.4 Rename axis index
to define a function, which can be assigned to the index, so that you can place modification to DataFrame

If you are creating a copy of the data, rather than modifying the original data, then use rename

ename typical word may be combined to achieve the object of the update portion axis labels:

2.5离散化和面元划分
连续数据常常被离散化或拆分为“面元”(bin)

2.6检测和过滤异常值
找出某列中绝对值大小超过3的值:

根据这些条件,就可以对值进行设置。下面的代码可以将值限制在区间-3到3以内:

注:np.sign(data)可以生成1和-1

2.7排列和随机采样
 

利用numpy.random.permutation函数可以轻松实现对Series或DataFrame的列的排列工作(permuting,随机重排序)

随机选择子集并不替换原来数据,可在Series和DataFrame上使用sample方法

2.8计算指标/哑变量

另一种常用于统计建模或机器学习的转换方式是:将分类变量(categorical variable) 转换为“哑变量”或“指标矩阵.
如果DataFrame的某一列中含有k个不同的值,则可以派生出一个k列矩阵或DataFrame(其值全为1和0),pandas有一个get_dummies函数可以实现该功能

3 字符串操作

以逗号分隔的字符串可以用split拆分成数段:

split常常与strip一起使用,以去除空白符(包括换行符):

向字符串"::"的join方法传入一个列表或元组,可以将子字符串以双冒号分隔符的形式连接起来

3.1正则表达式

正则表达式常称作regex, 提供了一种灵活的在文本中搜索或匹配(通常比前者复杂) 字符串模式的方式,Python内置的re模块负责对字符串应用正则表达式
拆分一个字符串,分隔符为数量不定的一组空白符\制表符、空格、换行符等:

如果对许多字符串使用同一条正则表达式,可以用re.compile创建regex对象。这样将可以节省大量的CPU时间。

3.2 pandas的矢量化字符串函数

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/LivLu24/article/details/94431912