[Python数据分析-01]Pandas数据结构之Series

# 导入pandas包

import pandas as pd

# Pandas主要有两个数据结构：Series和DataFrame，这里我们介绍Series的使用方法
# Series是一种类似一维数组的对象，它由一组数组（Numpy数据类型）以及一组与之相关的数据标签（索引下标）组成

obj = pd.Series([4, 7, -5, 3])

# Series的表现形式：索引下标在左边，值在右边。这里我们没有为数据指定索引，会自动创建一个0到N-1的（N为数据的长度）的整数型索引。

print(obj)

0    4
1    7
2   -5
3    3
dtype: int64

# 通过Series的index和values属性获取索引下标和数组的表现形式

obj.index

RangeIndex(start=0, stop=4, step=1)

obj.values

array([ 4,  7, -5,  3], dtype=int64)

# 创建一个Series并未其各个数据指明索引下标

obj2 = pd.Series([4, 7, -5, 3], index=(['d', 'b', 'a', 'c']))

obj2

d    4
b    7
a   -5
c    3
dtype: int64

obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

# 与Numpy一样，我们可以通过索引下标方式选取Series中的一个或者一组值

obj2['a']

-5

obj2['d'] = 6

obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

# Numpy数组运算（如根据布尔型数组进行过滤、标量乘法、应用数学函数等）都会保留索引和值之间的链接：

obj2

d    6
b    6
a   -5
c    3
dtype: int64

obj2[obj2 > 0]

d    6
b    6
c    3
dtype: int64

obj2*2

d    12
b    12
a   -10
c     6
dtype: int64

import numpy as np
np.exp(obj2)

d    403.428793
b    403.428793
a      0.006738
c     20.085537
dtype: float64

# 将Series看成是一个定长的有序字典，因为它是索引值到数据值的一个映射。可以用在许多原本需要字典参数的函数中

'b' in obj2

True

'e' in obj2

False

# 数据存放在一个Python字典中，可以直接通过这个字典来创建Series：

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

obj3 = pd.Series(sdata)

obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

# 如果只传入一个字典，则结果Series中的索引下标就是原字典中的键（有序排列）

states = ['California', 'Ohio', 'Oregon', 'Texas']

obj4 = pd.Series(sdata, index=states)

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

# Pandas中，NaN表示缺失的值，pandas的isnull和notnull函数可用于检查缺失数据

pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

# Series中最重要一个功能是：它在算数运算中会自动补齐不同索引的数据

obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

obj3+obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

# Series对象本身及其索引都有一个name属性，该属性跟pandas其他的关键功能关系非常密切：

obj4.name = 'population'

obj4.index.name = 'state'

obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

# Series的索引可以通过赋值的方式直接修改

obj

0    4
1    7
2   -5
3    3
dtype: int64

obj.index = ['zhejiang', 'ningbo', 'caicai', 'nbu']

obj

zhejiang    4
ningbo      7
caicai     -5
nbu         3
dtype: int64