Pandas入门学习(3)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_1290259791/article/details/83150690

Pandas 常用功能

主要介绍Pandas的常用功能

1、Pandas 迭代

Padndas 对象之间基本迭代行为取决于类型。
当迭代Series时,被视为数组。
当迭代DataFrame,遵循迭代对象的键。

  • Series:值
  • DataFrame:列标签

迭代 DataFrame

迭代 DataFrame 提供列名。

import pandas as pd
import numpy as np
N = 10
df = pd.DataFrame({
	'A': pd.date_range('2016-11-11',periods=N),
	'C': np.linspace(0,num=N,stop=N-1),
	'X': np.random.rand(N),
	'W': np.random.choice(['Low','Mid','High'],N).tolist(),
	'D': np.random.normal(100,10,size=(N)).tolist(),
	})
for item in df:
	print(item)
A
C
X
W
D

注意: 要遍历 DataFrame 中的行,使用下面函数。

  • iteritems():迭代(key, value)
  • iterrows():将行迭代为(索引,Series)对
  • itertuples():以nametuple的形式迭代行

iteritems()示例

将每个列作为键,将值与值作为键和列值迭代为Series对象

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,3), columns=['A','B','C'])
for key, value in df.iteritems():
	print(key)
	print(value)
A
0    0.957009
1    0.501260
2    0.274135
Name: A, dtype: float64
B
0    0.078463
1    0.987697
2    0.781049
Name: B, dtype: float64
C
0    0.733517
1    0.803489
2    0.074316
Name: C, dtype: float64

iterrows()示例

ierrows() 返回迭代器,产生每个索引值以及包含没行数据的序列

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,3), columns=['A','B','C'])
for key, value in df.iterrows():
	print(key)
	print(value)
0
A    0.656146
B    0.214489
C    0.112665
Name: 0, dtype: float64
1
A    0.529889
B    0.261862
C    0.747018
Name: 1, dtype: float64
2
A    0.415430
B    0.525688
C    0.015409
Name: 2, dtype: float64

itertuples()示例

itertuples() 将为DataFrame中每一行返回一个tuplename元组。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,3), columns=['A','B','C'])
for row in df.itertuples():
	print(row)
Pandas(Index=0, A=0.02256113604091181, B=0.4702768374802535, C=0.0965308087405059)
Pandas(Index=1, A=0.8422016872537603, B=0.17994358605628646, C=0.042277440820879364)
Pandas(Index=2, A=0.7924808877865748, B=0.7236640663801537, C=0.5110374703536472)

2、Pandas 排序

Pandas 有两种排序方式

  • 按标签
  • 按实际值

按标签排序

sort_index():方法默认ascending参数True升序。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,3), index=[1,9,4,5], columns=['A','B','C'])
print(df)
df_sorted = df.sort_index(ascending=False)
print(df_sorted)
          A         B         C
1  0.132607  0.105872  0.875598
9  0.223384  0.362026  0.437898
4  0.638698  0.277726  0.453978
5  0.115070  0.709539  0.835981
          A         B         C
9  0.223384  0.362026  0.437898
5  0.115070  0.709539  0.835981
4  0.638698  0.277726  0.453978
1  0.132607  0.105872  0.875598

按列排序

通过传递参数axis参数值为01,对标签进行排序。
默认情况下,axis=0,按行排列。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,3), index=[1,9,4,5], columns=['A','C','B'])
print(df)
df_sorted = df.sort_index(axis=1)
print(df_sorted)
          A         C         B
1  0.988267  0.908142  0.680500
9  0.675936  0.308623  0.249646
4  0.626666  0.162618  0.735269
5  0.490554  0.177270  0.603323
          A         B         C
1  0.988267  0.680500  0.908142
9  0.675936  0.249646  0.308623
4  0.626666  0.735269  0.162618
5  0.490554  0.603323  0.177270

按值排序

像索引排序一样,sort_values()是按值排序的方法。
接受by参数,将使用要与其排序值的DataFrame的列名称。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,3), index=[1,9,4,5], columns=['A','C','B'])
df_sorted1 = df.sort_values(by=['C','B'])
df_sorted2 = df.sort_values(by='C')
print(df_sorted1)
print(df_sorted2)
          A         C         B
9  0.902778  0.189340  0.435696
4  0.169302  0.245135  0.647082
5  0.491392  0.372607  0.386437
1  0.294235  0.904643  0.018072
          A         C         B
9  0.902778  0.189340  0.435696
4  0.169302  0.245135  0.647082
5  0.491392  0.372607  0.386437
1  0.294235  0.904643  0.018072

3、Pandas索引和选择数据

索引运算符[]和属性运算符.,可以快速访问Pandas数据结构。
现在支持三种类型的多轴索引

方法索引

索引 描述
.loc() 基于标签
.iloc() 基于整数
.ix() 基于标签和整数

注意: 里面的参数,第一个是行,第二个是列。

loc()

loc() 有多种访问方式:

  • 单个标量标签
  • 标签列表
  • 切片对象
  • 一个布尔数组

loc需要两个单/列表/范围运算符,用,分割。第一个表示行,第二个表示列。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5,3), index=['a','b','c','d','e'], columns=['A','B','C'])
print(df.loc[:,'A'])
print(df.loc[:,['A','C']])
print(df.loc[['a','c','e'],['A','C']])
print(df.loc['a':'c','A':'B'])
a    0.535062
b    0.037609
c    0.190991
d    0.875407
e    0.234947
Name: A, dtype: float64
          A         C
a  0.535062  0.402936
b  0.037609  0.036611
c  0.190991  0.749456
d  0.875407  0.676398
e  0.234947  0.385565
          A         C
a  0.535062  0.402936
c  0.190991  0.749456
e  0.234947  0.385565
          A         B
a  0.535062  0.004707
b  0.037609  0.473187
c  0.190991  0.947285

iloc()

纯整数索引

  • 整数
  • 整数列表
  • 系列值
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5,3), index=['a','b','c','d','e'], columns=['A','B','C'])
print(df.iloc[:3,:2])
print(df.iloc[:2,[0,2]])
          A         B
a  0.558476  0.962624
b  0.238883  0.116831
c  0.881508  0.411235
          A         C
a  0.558476  0.717169
b  0.238883  0.830214

ix()

进行选择和子集化对象的混合方法。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5,3), index=['a','b','c','d','e'], columns=['A','B','C'])
print(df.ix[:4])
print(df.loc[:,'A'])
          A         B         C
a  0.031052  0.325817  0.118600
b  0.280782  0.990863  0.873839
c  0.488767  0.051455  0.073738
d  0.161729  0.546026  0.542651
a    0.031052
b    0.280782
c    0.488767
d    0.161729
e    0.409365
Name: A, dtype: float64

运算符索引

使用符号来对数据进行索引

对象 索引 描述
Series s.loc[] 标量值
DataFrame df.loc[] 标量对象

符号访问

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5,3), index=['a','b','c','d','e'], columns=['A','B','C'])
print(df['A'])
print(df[['A','B']])
print(df[2:3])
a    0.410222
b    0.082454
c    0.862867
d    0.010191
e    0.110962
Name: A, dtype: float64
          A         B
a  0.410222  0.402827
b  0.082454  0.508531
c  0.862867  0.747506
d  0.010191  0.357649
e  0.110962  0.118784
          A         B         C
c  0.862867  0.747506  0.292915

属性访问

使用属性运算符.选择列

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5,3), index=['a','b','c','d','e'], columns=['A','B','C'])
print(df.A)
a    0.203211
b    0.946038
c    0.467963
d    0.949120
e    0.528867
Name: A, dtype: float64

4、Pandas 统计函数

百分比(pct_change())

Series、DataFrame都有pct_change()函数
此函数将每个元素与前一个元素进行比较,并计算变化百分比

import pandas as pd
import numpy as np
s = pd.Series(np.arange(1,6))
print(s.pct_change())
df = pd.DataFrame(np.random.rand(5,2))
print(df.pct_change())
0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
dtype: float64
          0         1
0       NaN       NaN
1 -0.014001 -0.912531
2  0.268638  8.857442
3 -0.527441  0.511108
4  0.280720 -0.566934

相关性(corr())

相关性显示任何两个数值(Series)之间的线性关系
Dataframe中存在非数字的列,则自动排除

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5,2))
print(df[0].corr(df[1]))
print(df.corr())
0.6188159891824275
          0         1
0  1.000000  0.618816
1  0.618816  1.000000

数据排名(rank())

为元素数组中的每个元素生成排名

import pandas as pd
import numpy as np
s = pd.Series([1,6,3,2])
print(s.rank())
0    1.0
1    4.0
2    3.0
3    2.0
dtype: float64

5、缺失数据处理

Padndas 的缺失值(NA 或NaN)

数据缺失

使用重构索引,创建一个缺失的DataFrame

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12.).reshape(4,3),
index = list('acef'),
columns = ['one', 'two', 'three'])
df = df.reindex(list('abcd'))
print(df)
   one  two  three
a  0.0  1.0    2.0
b  NaN  NaN    NaN
c  3.0  4.0    5.0
d  NaN  NaN    NaN

检查缺失值

为了检查缺失值,Pandas提供了isnull()notnull()函数
也是SeriesDataFrame对象的方法

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12.).reshape(4,3),
index = list('acef'),
columns = ['one', 'two', 'three'])
df = df.reindex(list('abcd'))
print(df['one'].isnull())
print(df['one'].notnull())
a    False
b     True
c    False
d     True
Name: one, dtype: bool
a     True
b    False
c     True
d    False
Name: one, dtype: bool

缺少数据的计算

在求和数据时,NA被视为0
数据全是NA,结果就是NA

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12.).reshape(4,3),
index = list('acef'),
columns = ['one', 'two', 'three'])
df = df.reindex(list('abcd'))
print(df['one'].sum())
3.0

清洗/填充缺少数据

fillna()函数通过集中方法用非空数据填充NA

用标量值替换NaN

使用0来替换NaN

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12.).reshape(4,3),
index = list('acef'),
columns = ['one', 'two', 'three'])
df = df.reindex(list('abcd'))
print(df)
print(df.fillna(0))
   one  two  three
a  0.0  1.0    2.0
b  NaN  NaN    NaN
c  3.0  4.0    5.0
d  NaN  NaN    NaN
   one  two  three
a  0.0  1.0    2.0
b  0.0  0.0    0.0
c  3.0  4.0    5.0
d  0.0  0.0    0.0

填写NaN的前面/后面值

将空缺的NaN的值,填写为前面值或后面值

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12.).reshape(4,3),
index = list('acef'),
columns = ['one', 'two', 'three'])
df = df.reindex(list('abcd'))
print(df)
print(df.fillna(method='pad'))  # 前面值
print(df.fillna(method='bfill'))    # 后面值
   one  two  three
a  0.0  1.0    2.0
b  NaN  NaN    NaN
c  3.0  4.0    5.0
d  NaN  NaN    NaN
   one  two  three
a  0.0  1.0    2.0
b  0.0  1.0    2.0
c  3.0  4.0    5.0
d  3.0  4.0    5.0
   one  two  three
a  0.0  1.0    2.0
b  3.0  4.0    5.0
c  3.0  4.0    5.0
d  NaN  NaN    NaN

丢失缺少的值

排除缺少的值,使用dropna函数和axis参数。
默认情况下,axis=0,也就是行内存在NA,整行删除

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12.).reshape(4,3),
index = list('acef'),
columns = ['one', 'two', 'three'])
df = df.reindex(list('abcd'))
print(df.dropna())
print(df.dropna(axis=1))
   one  two  three
a  0.0  1.0    2.0
c  3.0  4.0    5.0
Empty DataFrame
Columns: []
Index: [a, b, c, d]

替换丢失/通用值

用标量替换NAfillna()函数的等效行为

import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print(df.replace({1000:10,2000:60}))
   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60

猜你喜欢

转载自blog.csdn.net/qq_1290259791/article/details/83150690