python学习之路--数据分析利器pandas(上)

pandas的实用性相信不用我多说，大家都知道其厉害之处，无论是《利用python进行数据分析》，还是《python科学计算与数据分析》等等书籍都花了很大篇幅去介绍它的功能。今天根据个人的学习心得以及工作上使用的经验对其进行一个简单的汇总。

1. 数据结构

pandas常用的数据结构有两种，分别是一维的series（一组索引和一组数据）和二维的dataframe。series由一组索引和一组数据组成,且数据必须是相同类型的，而dataframe由两组索引（既有行索引也有列索引）和多组数据组成，数据可以是多种字符、数值等等，也就是大家都比较熟悉的表格型结构，因此dataframe也是我们最为熟悉最为常用的数据结构。

1.1 Seires

#构建series
import pandas as pd
import numpy as np
s1 = pd.Series(np.arange(5),index = ["a","c","e","d","b"])
print(s1)
print("s1的索引:{}".format(s1.index))
a    0
c    1
e    2
d    3
b    4
dtype: int32
s1的索引:Index(['a', 'c', 'e', 'd', 'b'], dtype='object')

#reindex重排索引顺序，如找不到对应的索引值，就引入缺失值NaN。
s1.reindex(["a","b","c","d","e","f"])
a    0.0
b    4.0
c    1.0
d    3.0
e    2.0
f    NaN
dtype: float64

#当然觉得显示的是NaN不好看，可以通过fill_value来设置
s1.reindex(["a","b","c","d","e","f"],fill_value = 0)
a    0
b    4
c    1
d    3
e    2
f    0
dtype: int32

#数据的选取
print("位置索引:\n{}".format(s1[1]))             
print("切片索引:\n{}".format(s1[:3]))           
print("布尔索引:\n{}".format(s1[s1>0]))          
print("普通索引:\n{}".format(s1["c"]))
位置索引:
1
切片索引:
a    0
c    1
e    2
dtype: int32
布尔索引:
c    1
e    2
d    3
b    4
dtype: int32
普通索引:
1

# series最为重要的一个功能就是计算时会自动对齐不同索引的数据
s1 = pd.Series(np.arange(5),index = ["a","c","e","d","b"])
s2 = pd.Series(np.arange(2,5),index = ["a","c","e"])
s1 + s2
a    2.0
b    NaN
c    4.0
d    NaN
e    6.0
dtype: float64

1.2 Dataframe

#dataframe构建
df1 = pd.DataFrame(np.arange(1,10).reshape(3,3),index = ["a","b","c"],columns = ["A","B","C"])  #利用numpy的数组构建
df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index = ["a","b","c"],columns = ["A","B","C"])     #利用嵌套列表构建
df3 = pd.DataFrame({"A":[1,4,7],"B":[2,5,8],"C":[3,6,9]},index = ["a","b","c"])                 #利用字典构建
print("利用numpy数组构建的df1:\n{}".format(df1))
print("利用嵌套列表构建的df2:\n{}".format(df2))
print("利用字典构建的df3:\n{}".format(df3))
利用numpy数组构建的df1:
   A  B  C
a  1  2  3
b  4  5  6
c  7  8  9
利用嵌套列表构建的df2:
   A  B  C
a  1  2  3
b  4  5  6
c  7  8  9
利用字典构建的df3:
   A  B  C
a  1  2  3
b  4  5  6
c  7  8  9

三种方式都可构建dataframe，并且构建的dataframe是一样的，举例说明的话小文更喜欢用numpy数组的方式来构建。

#df1的行列索引
print("df1的行索引：{}".format(df1.index))
print("df1的列索引：{}".format(df1.columns))
df1的行索引：Index(['a', 'b', 'c'], dtype='object')
df1的列索引：Index(['A', 'B', 'C'], dtype='object')

#df1的数据选取:可利用列索引、loc普通索引、iloc位置索引以及ix交叉索引
print("利用列索引选取数据：\n{}".format(df1[["A","C"]]))  #使用列索引选择一列或者多列数据(注：不适用于行索引)
利用列索引选取数据：
   A  C
a  1  3
b  4  6
c  7  9
print("利用loc普通索引选取数据：\n{}".format(df1.loc["a","B"]))    #使用普通索引选取数据
利用loc普通索引选取数据：
2
print("利用loc普通索引选取数据：\n{}".format(df1.loc[["a","c"],["B","C"]]))   #使用普通索引选取数据
利用loc普通索引选取数据：
   B  C
a  2  3
c  8  9
print("利用iloc位置索引选取数据：\n{}".format(df1.iloc[:2,1:]))    #使用位置索引选取数据
利用iloc位置索引选取数据：
   B  C
a  2  3
b  5  6
print("利用ix交叉索引选取数据：\n{}".format(df1.ix[["a","c"],1:]))   #使用交叉索引选取数据
利用ix交叉索引选取数据：
   B  C
a  2  3
c  8  9

报错：
C:\Users\wenjianhua\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

当然使用ix交叉索引时，python会提示ix已被弃用，请选择loc或iloc选取的字眼，所以还是以普通索引loc和位置索引iloc为主。

2. 常用操作

对于series也好，dataframe也好，常用的操作莫过于增改删查。查就是通过索引的方式查询，上面已经介绍过了，接下来介绍增改删。

2.1 增

# 通过insert增加一列，insert（位置，列名，数值，allow_duplicates =True/False）
df1 = pd.DataFrame(np.arange(1,10).reshape(3,3),index = ["a","b","c"],columns = ["A","B","C"]) 
df1.insert(3,"D",[1,2,3])
print(df1)
A  B  C  D
a  1  2  3  1
b  4  5  6  2
c  7  8  9  3

#通过索引的方式增加一列
df1["F"] = [3,5,8]
print(df1)
A  B  C  D  F
a  1  2  3  1  3
b  4  5  6  2  5
c  7  8  9  3  8

#通过pd.merge()实现多列合并，类似于sql的内连接，左连接，右连接以及并联
#pd.merge（df1，df2，on,how = "inner"/"left"/"right"/"outer",left_on,right_on,left_index = True/False,right_index = True/False)
df2 = pd.DataFrame(np.random.randn(4).reshape(2,2),index = ["b","c"],columns = ["X","Y"])
print(pd.merge(df1,df2,left_index = True,right_index = True,how = "inner" ))      #公共列为索引
A  B  C  D  F         X         Y
b  4  5  6  2  5 -0.507837 -0.994847
c  7  8  9  3  8 -0.106815  0.780901
df2 = pd.DataFrame([[1,3],[4,2],[7,10]],index = ["d","e","f"],columns = ["A","Y"])
print(pd.merge(df1,df2,how = "left",on = df1["A"]))        #以df1为主表左联df2，公共列为df1["A"]
key_0  A_x  B  C  D  F  A_y   Y
0      1    1  2  3  1  3    1   3
1      4    4  5  6  2  5    4   2
2      7    7  8  9  3  8    7  10

可以看到此时df1["A"]为公共列，因为列索引相同，因此分别时A_x,A_y,可以通过suffixes参数进行设置，另外因为df1跟df2的行索引各不相同，合并后的行索引默认从0开始。

print(pd.merge(df1,df2,how = "left",on = df1["A"],suffixes = ["_a","_b"]))
key_0  A_a  B  C  D  F  A_b   Y
0      1    1  2  3  1  3    1   3
1      4    4  5  6  2  5    4   2
2      7    7  8  9  3  8    7  10

df2 = pd.DataFrame([["a",3],["b",2],["c",10]],index = ["d","e","f"],columns = ["A","Y"])
print(pd.merge(df1,df2,how = "left",left_index = True,right_on = df2["A"]))    #左公共列为索引，右公共列为df2["A"]
key_0  A_x  B  C  D  F A_y   Y
d     a    1  2  3  1  3   a   3
e     b    4  5  6  2  5   b   2
f     c    7  8  9  3  8   c  10

#通过pd.concat()增加行,pd.concat(数据,ignore_index = True/False)
df3 = pd.DataFrame(np.random.randn(10).reshape(2,5),index = ["a","b"],columns = ["A","B","C","D","F"])
print(df3)
A         B         C         D         F
a -0.813396  0.596950  0.299149  1.289255  1.172996
b -0.672459  0.314604  0.497577 -0.968354 -0.600552

df4 = pd.concat([df1,df3],ignore_index = True)  #合并后行索引会出现不连续或者重复的情况，此时需要用ignore_index重设行索引
print(df4)
A         B         C         D         F
0  1.000000  2.000000  3.000000  1.000000  3.000000
1  4.000000  5.000000  6.000000  2.000000  5.000000
2  7.000000  8.000000  9.000000  3.000000  8.000000
3 -0.813396  0.596950  0.299149  1.289255  1.172996
4 -0.672459  0.314604  0.497577 -0.968354 -0.600552

2.2 删

#通过del方式删除某一列
del df1["D"]
print(df1)
A  B  C  F
a  1  2  3  3
b  4  5  6  5
c  7  8  9  8

#通过drop函数删除某一列/行或者多列/行(列名/行名,axis = 0/1,inplace = True/False)
#当axis = 1时，删除列
#当inplace = True时，替换原数据
print(df1.drop(columns = ["B","C"]))           #使用columns时不需要指定axis
A  F
a  1  3
b  4  5
c  7  8
print(df1.drop(["C"],axis = 1))              #直接使用列索引时，必须指定axis
A  B  F
a  1  2  3
b  4  5  5
c  7  8  8
print(df1.drop(["b"],axis = 0))            #直接使用行索引时，必须指定axis
A  B  C  F
a  1  2  3  3
c  7  8  9  8
print(df1.drop(df1.columns[[1,3]],axis = 1))          #使用位置索引时，必须指定axis
A  C
a  1  3
b  4  6
c  7  9

可以看到使用了以上三种drop的方法后，df1仍然没有变化，那是因为inplace默认为False，不修改原数据，如要修改设置为True

df1.drop(df1.columns[[1,3]],axis = 1,inplace = True) 
print(df1)
A  C
a  1  3
b  4  6
c  7  9

#列的删除还可以通过pop()删除并返回删除的列
print(df1.pop("C"))
a    3
b    6
c    9
Name: C, dtype: int32

2.3 改

df1 = pd.DataFrame(np.arange(1,10).reshape(3,3),index = ["a","b","c"],columns = ["A","B","C"]) 
print(df1)
A  B  C
a  1  2  3
b  4  5  6
c  7  8  9

#数值的修改直接通过普通索引的方式进行修改
df1["A"] = 0
print(df1)
A  B  C
a  0  2  3
b  0  5  6
c  0  8  9

#数值的修改还可以通过位置索引的方式进行修改
df1.iloc[:,0] = [1,4,7]
print(df1)
A  B  C
a  1  2  3
b  4  5  6
c  7  8  9

#数值的修改还可以通过repalce()来实现
print(df1.replace(5,np.NaN))
A    B  C
a  1  2.0  3
b  4  NaN  6
c  7  8.0  9

#行/列索引的修改可通过index/columns修改，唯一的缺点就是就算只想修改某一行/列都必须全输入
print(df1.index)
Index(['a', 'b', 'c'], dtype='object')
df1.index = ["a","c","d"]
print(df1.index)
Index(['a', 'c', 'd'], dtype='object')

#修改行/列索引更为方便的方式是使用rename
df1.rename(index = {"c":"b","d":"c"},inplace = True)
print(df1.index)
Index(['a', 'b', 'c'], dtype='object')

3. 描述统计方法

count：统计非NA值的数量
describe： Series或者各个DataFrame列数值汇总统计
min：最小值
max：最大值
quantile: 分位数
sum：求和
mean：均值(numpy)
median：中位数(numpy)
mad：根据均值计算平均绝对离差
var：方差(numpy)
std：标准差(numpy)
skew：偏度（三阶矩）
kurt：峰度（四阶矩）

def summary(x):
    print("求和：",sum(x))
    print("最小值：",min(x))
    print("最大值：",max(x))
    print("均值：",np.mean(x))
    print("中位数：",np.median(x))
    print("方差：",np.var(x))
    print("标准差：",np.std(x))

x = np.random.randn(9)
print(x)
[ 0.3278447   0.35207847  1.49437789 -2.11866331  1.13839775 -1.63416003
  0.3934872   0.85271989  0.36729024]

summary(x)
求和： 1.173372792257903
最小值： -2.1186633130744097
最大值： 1.4943778881381498
均值： 0.13037475469532256
中位数： 0.3672902376436664
方差： 1.3092951819301024
标准差： 1.1442443715964272

当然对于我们可以自己定义一个函数来计算各个描述性指标，也可以通过describe来完成,但是describe只能对应于serie或者dataframe。

print(df1.describe())
A    B    C
count  3.0  3.0  3.0
mean   4.0  5.0  6.0
std    3.0  3.0  3.0
min    1.0  2.0  3.0
25%    2.5  3.5  4.5
50%    4.0  5.0  6.0
75%    5.5  6.5  7.5
max    7.0  8.0  9.0

小文的数据之旅

发布了33 篇原创文章 · 获赞 30 · 访问量 3万+

私信关注