pandas数据清洗与计算（二）

dataframe.duplicated（subset=None，keep=‘first’）函数检查是否是重复行，返回布尔值：
subset=None：str,传入列名/列名列表，可检查指定列是否存在重复值
keep: {‘first’, ‘last’, False}, default ‘first’
keep=first: 当出现重复行时，第一次出现的数据不算重复（false）
keep=last: 当出现重复行时，最后一次出现的数据不算重复（false）
False:将所有出现的重复值都标为重复（True）

import pandas as pd
from pandas import DataFrame
import numpy as np
data=DataFrame({'k1':["one"]*3+["two"]*4,'k2':[1,1,2,3,3,4,4]})
print(data.duplicated(keep='first'))
#输出：
0    False
1     True
2    False
3    False
4     True
5    False
6     True

drop_duplicates(keep=‘first’)移除重复数据

print(data.drop_duplicates())
keep： {‘first’, ‘last’, False}, default ‘first’默认保留第一次出现的数据
#输出：
    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4

**9. series.map(func)：映射函数–对一个series应用函数func

frame = pd.DataFrame({'item':['ball', 'mug', 'pen', 'pencil', 'ashtray'],'price':[1, 2, 3, 4, 5]})
print(frame)
frame.price = frame.price.map(lambda x: x + 1)
print('new_frame',frame)
#输出：
      item  price
0     ball      1
1      mug      2
2      pen      3
3   pencil      4
4  ashtray      5
new_frame       
		item  price
0     ball      2
1      mug      3
2      pen      4
3   pencil      5
4  ashtray      6

DataFrame.apply(func, axis=0）
axis=0 对行应用函数，axis=1 对列应用函数
对dataframe的多行或多列应用函数

np.random.seed(7738744)
data=np.random.randint(0,20,20).reshape(5,4)
df=DataFrame(data,columns=list('mnpq'),index=list('abcde'))
def multi(x):
    if x >3:
        return x**2

print(df.apply(multi))
#输出：

DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method=‘pad’)
to_place:str, regex, list, dict, Series, int, float, or None，定义需被替换的值
to_place=字符串/数字/正则表达式即将符合该条件的值替换为新value
to_place=字符串/数字/正则表达式列表即将符合该条件的值替换为新value
to_place=字典：则key为需要替换的值，value为新值

value：scalar, dict, list, str, regex, default None 替换的值
inplace：true：修改原dataframe false：生成copy 默认false
limit：int 修改数量

import pandas as pd
from pandas import DataFrame
import numpy as np
np.random.seed(7738744)
data=np.random.randint(0,20,20).reshape(5,4)
df=DataFrame(data,index=list('abcde'))
replaced1=df.replace({11:'NaN'}
replaced2=df.replace([10,11,12],np.nan)
replaced3=df.replace({0:13,3:17},100)

print(replaced1)
print(replaced2)
print(replaced3)
#输出：
#字典映射，将指定的key替换成指定的value
    0   1    2   3
a  13  19    5   8
b  18   3   12  13
c  19  10  NaN   7
d  12  12   10  17
e   4  13  NaN  10
#将df中的10,11,12替换成NaN
      0     1    2     3
a  13.0  19.0  5.0   8.0
b  18.0   3.0  NaN  13.0
c  19.0   NaN  NaN   7.0
d   NaN   NaN  NaN  17.0
e   4.0  13.0  NaN   NaN
#将0列的12,3列的17替换成 100
     0   1   2    3
a  100  19   5    8
b   18   3  12   13
c   19  10  11    7
d   12  12  10  100
e    4  13  11   10

pandas.cut():离散化和面元划分
语法：(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False)
x : 进行划分的一维数组
bins :
1. 整数—将x划分为多少个等间距的区间
  2,序列—将x划分在指定的序列中，若不在该序列中，则是NaN
  right : 是否包含右端点
  include_lowest:是否包含左端点
  labels : 是否用标记来代替返回的bins
  precision: 精度

array=np.random.randint(13,40,10)
cut=pd.cut(array,[13,18,25,35],labels=['未成年','年轻人','中年人'])
print(cut)
print(cut.describe())

#输出：
#将array划分为三个区间分别标志位三个年龄段

[年轻人, NaN, 未成年, NaN, 年轻人, NaN, 年轻人, 中年人, 中年人, 未成年]
Categories (3, object): [未成年 < 年轻人 < 中年人]

#用describe函数计数和计算占比
            		counts  freqs
categories               
未成年            	  1    0.1
年轻人           	   1    0.1
中年人              7    0.7
NaN             	 1    0.1

dataframe.sample():随机取样
(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
n:int ，要抽取的行数
frac：float 0-1,抽取的百分比，即抽取百分之多少的数据
replace: bool 抽取后是否放回（决定抽取的值是否有重复），默认False
axis：抽取行还是列，axis=0的时是抽取行，axis=1时是抽取列

array=np.random.randint(0,100,100).reshape(20,5)
df=DataFrame(array)
sample1=df.sample(frac=0.4)
sample2=df.sample(7)


print(sample1)
print(sample2)
#输出：
 index    0   1   2   3   4
	  19  59  97  46  83  21
	 11  52  19  89  40  43
	 6    0  40  34   5  60
	 12  48  74  23  23   2
	 14  24  11  72   3   2
  	2   59  36  17  55  90
	 0   66  24  96  99  90
	 10  56  36  60  45  2
	 
index    0   1   2   3   4
	15  38   5  24  16  78
	7   46   0  79   5  37
	6    0  40  34   5  60
	9    0  29  87  86  35
	2   59  36  17  55  90
	13  92  61  96  15  85
	3   42  46  68  54  95
#生成的sample是随机抽取的，index都是乱的

pd.get_dummies()函数:计算指标和哑变量
使用场景: 当频繁出现的几个独立变量时，可以使用：
语法：pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False)

参数释义：
data : array-like, Series, or DataFrame
输入的数据
prefix : string, list of strings, or dict of strings, default None ，get_dummies转换后，列名的前缀
*columns : list-like, default None
指定需要实现类别转换的列名
dummy_na : bool, default False 是否忽略空值
drop_first : bool, default False 获得k中的k-1个类别值，去除第一个

df=DataFrame({"key":['b','b','a','c','a','b'],"data":range(6)})
print(pd.get_dummies(df))
#输出：
   data  key_a  key_b  key_c
0     0      0      1      0
1     1      0      1      0
2     2      1      0      0
3     3      0      0      1
4     4      1      0      0
5     5      0      1      0

pandas数据清洗与计算（二）

猜你喜欢