pandas

导入pandas:

# 数据分析有三剑客，三个模块

#前面两个属于数据分析，展示数据，画图（一图顶千言）

import numpy  as np

import pandas as pd

from pandas import Series,DataFrame

1.Series

Series是一种类似于一维数组的对象，由下面两个部分组成：

values:一维数组（ndarray类型）
index：相关的数据索引标签

1）series的创建

两种创建方式：

（1）由列表或numpy数组创建

默认索引为0到N-1的整数型索引

nd = np.random.randint(0,150,size=10)

nd

s = Series(nd)

s

通过设置index参数指定索引

type(s)

#string类型在series中也会显示成objects
l = list('qwertyuiop')

s = Series(l)

s

# mysql中有两种索引，语言中一般也有两种索引，比如枚举和索引

l = [1,2,3,4,5]

s = Series(1, index=list('abcde'))

s

name参数

# name比较类似于表名
#series用于创建一维数据
s1 = Series(np.random.randint(0,150,size=8,index=list('abcdefgh'),name='python'))
s2 = Series(np.random.randint(0,150,size=8,index=list('abcdefgh'),name='english'))
s3 = Series(np.random.randint(0,150,size=8,index=list('abcdefgh'),name='math'))
display(s1,s2,s3)

(2)由字典创建

# 字典的方式在实际的应用中比较适合series

s = Series({'a':1,'b':2,'c':3})
s
结果：
a    1
b    2
c    3
dtype: int64

s1 = Series({'a':150,'b':100,'c':90})

s1
结果：
a    150
b    100
c     90
dtype: int64

2)Series的索引和切片

（1）显示索引

使用index中的元素作为索引值
使用.loc[]（推荐)

可以理解为panadas是ndarray的升级版，但Series也可以是dict的升级版

注：此时是闭区间

#如果Series想同时获得两个及以上的索引的值，name索引必须是一个list
s1[['a','b']]
结果：
a    150
b    100
dtype: int64

s2.loc[‘a']

结果：

150

（2）隐式索引

使用整数作为索引值
使用.iloc[] (推荐)

注：此时是半开区间

s2 = s1

s2[[0,1]]

结果：
a    150
b    100 
dtype: int64

s2.iloc[0]
结果：
150

s2.iloc[[0,1]]

结果：
a    150
b    100
dtype: int64

l = [1,2,3,4,5]
s = Series(l, index=list('你好漂亮啊'))
s

结果：
你    1 
好    2 
漂    3 
亮    4 
啊    5 
dtype: int64

#这种无规律的关联索引是依赖枚举索引的
s['你':'好']
结果：
你    1 
好    2

切片

#显式索引是闭区间
#显式索引，即使超出了范围也不会报错，会显示到最大的
s2['a':'z']
结果：
a    150 
b    100 
c     90 
dtype: int64

（1）显式切片

（2）隐式切片

3）Series的基本概念

可以把Series看成一个定长的有序字典

可以通过shape,size,index,value等得到series的属性（series的值是一个ndarray类型的)

s1
结果：
a    150
b    100
c     90
dtype: int64

s1.shape

结果：

(3,)

s1.size

结果：

3

s1.index

结果：

Index(['a', 'b', 'c'], dtype='object')

s1.values

结果：

array([150, 100,  90], dtype=int64)

可以使用head()、tail()快速查看series对象的样式，共同都有一个参数n，默认值是5

s1.name='java'

s1.head(n=10)

结果：

q    150 

w    100 

e     90 

Name: java, dtype: int64

s1.tail()

结果：

q    150 

w    100 

e     90 

Name: java, dtype: int64

使用pandas读取CSV文件

#读取文件，使用的是pandas，不是使用数据类型
%timeit pd.read_csv('csv所在的路径')
结果：
1.26 ms ± 96.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

h = pd.read_csv('csv所在的路径')

h.head()  #查看头5行的信息

结果：

Year	Agriculture	Architecture	Art and Performance	Biology	Business	Communications and Journalism	Computer Science	Education	Engineering	English	Foreign Languages	Health Professions	Math and Statistics	Physical Sciences	Psychology	Public Administration	Social Sciences and History
0	1970	4.229798	11.921005	59.7	29.088363	9.064439	35.3	13.6	74.535328	0.8	65.570923	73.8	77.1	38.0	13.8	44.4	68.4
1	1971	5.452797	12.003106	59.9	29.394403	9.503187	35.5	13.6	74.149204	1.0	64.556485	73.9	75.5	39.0	14.9	46.2	65.5
2	1972	7.420710	13.214594	60.4	29.810221	10.558962	36.6	14.9	73.554520	1.2	63.664263	74.6	76.9	40.2	14.8	47.6	62.6
3	1973	9.653602	14.791613	60.2	31.147915	12.804602	38.4	16.4	73.501814	1.6	62.941502	74.9	77.4	40.9	16.5	50.4	64.3
4	1974	14.074623	17.444688	61.9	32.996183	16.204850	40.5	18.9	73.336811	2.2	62.413412	75.3	77.9

当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况：

s6 = Series({'a':1,'b':2,'c':np.e,'d':None,'e':np.nan})

s6

结果：

a    1.000000 

b    2.000000 

c    2.718282 

d         NaN 

e         NaN dtype: float64
#mysql->int->float->object(string)->Null default 0
#mysql中运行null的效率最低，在开发中对于不重要的字段给出一个default 0
#NaN在统计、分组，计算、（where|having）查询的时候效率非常低

可以使用pd.isnull(),pd.notnull(),或者自带isnull(),notnull()函数检测缺失数据

cond = pd.isnull(s6)

cond

结果：

a    False 

b    False 

c    False 

d     True 

e     True 

dtype: bool

cond = pd.notnull(s6)

cond

结果：

a     True
b     True
c     True
d    False
e    False
dtype: bool

Series的运算

（1）适用于numpy的数组运算也适用于series

（2）Series之间的运算

在运算中自动对齐不同索引的数据
如果索引不对应，则补NAN

注意：要想保留所有的index，则需要使用.add()函数


#生成0-100之间的随机整数

s1 = Series(np.random.randint(0,100,size=8),index=list('qwertyui'))
s1
结果：
q    23 
w    59 
e    63 
r    38 
t    39 
y    86 
u    77 
i    54 
dtype: int32


#生成0-100之间的随机整数

s2 = Series(np.random.randint(0,100,size=8),index=list('ertyuiop'))
s2
结果：
e     6 
r    98
t    74
y    73
u    23 
i    89
o    32 
p    73 
dtype: int32

当不需要NAN时，使用fill_value=0

2.DataFrame

DataFrame是一个【表格型】的数据结构，可以看做是【由series组成的字典】（共用同一个索引。DataFrame由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引，也有列索引。

– 行索引：index

– 列索引：columns

–值： values(numpy的二维数组)

DataFrame就是excel表格,相当于mysql中的table
Series是一列
DataFrame是多列
DataFrame公用同一索引

1）DataFrame的创建

最常用的方法是传递一个字典来创建。DataFrame以字典的键作为每一列的名称，以字典的值（一个数组

作为每一列。

此外，DataFrame会自动加上每一行的索引（和Series一样）。

同Series一样，若传入的列与字典的键不匹配，则相应的值为NaN

df4 = DataFrame({'数学':['100','90','80','70','60'],
                '语文':['101','91','81','71','61']  ,
                  'Python':['102','92','82','72','62']},
                  index=list('abcde),
                  columns=['数学','语文','python'])
df4
结果：

数学	语文	python
a	100	101
b	90	91
c	80	81
d	70	71
e	60	61

df4 = DataFrame({'数学':['100','90','80','70','60'],
          '语文':['101','91','81','71','61'],
          'python':['102','92','82','72','62']}
          ,index=['雷军','罗安装','马化腾','强东','思聪']
           ,columns=['数学','语文','python','En'])
df4 
结果：

数学	语文	python	En
雷军	100	101	102
罗安装	90	91	92
马化腾	80	81	82
强东	70	71	72
思聪	60	61	62

DataFrame属性：values、columns、index、shape、ndim、dtypes

2）DataFrame的索引

（1）对列进行索引

– 通过类似字典的方式

– 通过属性的方式

可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFra相同的索引，且name 属性就是列名。4

#字典方式索引
['python']

结果：

雷军     102 
罗安装     92 
马化腾     82 
强东      72 
思聪      62 
Name: python, dtype: object

#如何同时取出两列

df4[['python','En']]
结果：

python	En
雷军	102
罗安装	92
马化腾	82
强东	72
思聪	62

df4['python']['思聪']
结果：
‘62’

#通过属性方式索引
df4.python

结果:

雷军     102 
罗安装     92 
马化腾     82 
强东      72 
思聪      62 
Name: python, dtype: object

（2）对行进行索引

-使用.loc[]加Index来进行索引

-使用.iloc[]加整数来进行索引

同样返回一个Series，index为原来的colums

df4.loc['雷军']  #Series

结果：
数学        100
语文        101
python    102
En        NaN
Name: 雷军, dtype: object

#多个值是DataFrame
df4.loc[‘雷军’，‘马化腾’]['python']
结果：
雷军     102
马化腾     82
Name: python, dtype: object

df4.loc['雷军','马化腾'],['python','数学']
结果：

python	数学
雷军	102
马化腾	82

df4.loc['罗安装','数学']

结果：
90

（3）对元素索引的方法

–使用列索引

–使用行索引

–使用values属性（二维numpy数组)

df4.iloc[0,1]

结果：
‘101’

df4.iloc[0:3,0:3]

结果：

数学	语文	python
雷军	100	101	102
罗安装	90	91	92
马化腾	80	81	82

注意：

直接使用中括号时：

索引表示的是列索引
切片表示的是行切片

3）DataFrame的运算

（1）DataFrame直接的运算

同series一样：

在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

# 生成5行4列的随机数

df5 = DataFrame(np.random.randint(0,150,size=(5,4)),index=list('abcde'),columns=['数学','语文','Python','En'])

结果：

数学	语文	python	En
a	84	28	129
b	136	9	47
c	39	126	5
d	5	51	130
e	22	34	90

df6 = DataFrame(np.random.randint(0,150,size=(5,4)),)index=list(‘cdefg’),columns=[‘数学’,’语文’,’Python’,’En’])

结果：

数学	语文	python	En
c	17	31	72	45
d	96	73	102	94
e	137	60	41	66
f	145	124	112	7
g	22	93	121	66

#df5和df6只有cde部分是相同的，所以其余部分相加结果为NAN
df5 + df6

结果：

数学	语文	python	En
a	NaN	NaN	NaN
b	NaN	NaN	NaN
c	56.0	157.0	77.0
d	101.0	124.0	232.0
e	159.0	94.0	131.0
f	NaN	NaN	NaN
g	NaN	NaN	NaN

可以采用fill_value=0来替代NaN

df5.add(df6,fill_value=0)

数学	语文	python	En
a	84.0	28.0	129.0
b	136.0	9.0	47.0
c	56.0	157.0	77.0
d	101.0	124.0	232.0
e	159.0	94.0	131.0
f	145.0	124.0	112.0
g	22.0	93.0	121.0

下面是Python 操作符与pandas操作函数的对应表：

Python Operator	Pandas Method(s)
`+`	`add()`
`-`	`sub()`, `subtract()`
`*`	`mul()`, `multiply()`
`/`	`truediv()`, `div()`, `divide()`
`//`	`floordiv()`
`%`	`mod()`
`**`	`pow()`

(2)Series与DataFrame之间的运算

【重要】

使用python操作符：以行为单位操作(参数必须是行),对所有的行都有效。类似于Numpy中二维数组与一维数组的运算，但可能出现NaN
使用pandas操作函数

axis = 0:以列为单位操作(参数必须是列)，对所有的列都有效

axis = 1:以行为单位操作（参数必须是行），对所有的行都有效

列方向：

s = Series([1,2,3,4,5])
s

结果：

0    1 

1    2 

2    3 

3    4 

4    5

 dtype: int64

#axis等于1表示列
df5.add(s,axis=1)

结果：

数学	语文	python	En	0	1	2	3	4
a	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
b	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
c	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
d	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
e	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

行方向：

#axis=0表示行
df5.add(s,axis=0)

结果：

数学	语文	python	En
a	NaN	NaN	NaN
b	NaN	NaN	NaN
c	NaN	NaN	NaN
d	NaN	NaN	NaN
e	NaN	NaN	NaN
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	NaN
4	NaN	NaN	NaN

注：遇到一个从上海来的老师，感觉他每天教学方法很规范：提前备课，上课边看概念边举列子，以上文章为其上课讲义。个人觉得很好，特认真学习。学习的方法感觉和教学差不多，提前预习，学会举一反三，养成良好的习惯，时间久了就会呈现出一种规范的行为。

python学习-pandas(1)