文章目录

pandas

1、Series

1）Series的创建

(1) 由列表或numpy数组创建

通过设置index参数指定索引

name参数
copy属性

(2) 由字典创建

2）Series的索引和切片

(0)常规索引的方式
(1) 显式索引：
(2) 隐式索引：
常规切片
显式切片
隐式切片

3）Series的基本概念

可以把Series看成一个定长的有序字典
可以通过ndim,shape，size，index,values等得到series的属性
可以通过head(),tail()快速查看Series对象的样式
使用pandas读取CSV文件
当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况
可以使用pd.isnull()，pd.notnull()，或自带isnull(),notnull()函数检测缺失数据

Excel的主要作用是保存数据，进行数据分析

Pandas是线上服务类型，数据分析和数据处理(在机器学习中数据处理)

在统计学理论支撑下诞生的，帮助相关的业务部分部门需要监控、定位、分析、解决问题，帮助企业高效决策，提高经验的效率，从而提高利润，发挥价值，规避分析。

pandas

数据分析三剑客

numpy数值计算
pandas数据分析
matplotlib+seaborn数据可视化

tableau + power bi + Excel

# pandas继承了numpy
# pandas中有两组数据类型，一个是Series，另一个是DataFrame
import pandas as pd
import numpy as np

1、Series

Series是一种类似与一维数组的对象，由下面两个部分组成：

values：一组数据（ndarray类型）
key：相关的数据索引标签

1）Series的创建

两种创建方式：

(1) 由列表或numpy数组创建

默认索引为0到N-1的整数型索引

s1 = pd.Series([1,2,3,4])
s1
#0    1
#1    2
#2    3
#3    4
#dtype: int64

通过设置index参数指定索引

s2 = pd.Series(data=[1,2,3,4],index=list('abcd'))
s2
#a    1
#b    2
#c    3
#d    4
#dtype: int64

name参数

s3 = pd.Series(data=[1,2,3,4],index=list('abcd'),name='demo')
s3
#a    1
#b    2
#c    3
#d    4
#Name: demo, dtype: int64

copy属性

对于ndarray来说，直接可以引用地址

arr = np.array([1,2,3,4,5])
ser = pd.Series(data=arr,copy=True)
ser
#0    1
#1    2
#2    3
#3    4
#4    5
#dtype: int32
arr[0]=100
ser
#0    1
#1    2
#2    3
#3    4
#4    5
#dtype: int32

(2) 由字典创建

#set类型不支持
dict_ = dict(a=1,b=2,c=3)
pd.Series(dict_)
#a    1
#b    2
#c    3
#dtype: int64

2）Series的索引和切片

可以使用中括号取单个索引（此时返回的是元素类型），或者中括号里一个列表取多个索引（此时返回的仍然是一个Series类型）。分为显示索引和隐式索引：

(0)常规索引的方式

S = pd.Series(dict(a=1,b=2,c=3,d=4))
S[:2],S[:'c'],S.c
#(a    1
# b    2
# dtype: int64, a    1
# b    2
# c    3
# dtype: int64, 3)

索引的类型有两种：

枚举型索引:特征索引是连续数值
关联型索引:特征索引都是离散字符类型

(1) 显式索引：

使用index中的关联类型作为索引值
使用.loc[]（推荐）
可以理解为pandas是ndarray的升级版,但是Series也可是dict的升级版

注意，此时是闭区间

S.loc['b']
#2

(2) 隐式索引：

使用整数作为索引值
使用.iloc[]（推荐）
注意，此时是半闭区间

S.iloc[2]
#3

常规切片

S
#a    1
#b    2
#c    3
#d    4
#dtype: int64
S[1:-1]
#b    2
#c    3
#dtype: int64
S['a':'d']
#a    1
#b    2
#c    3
#d    4
#dtype: int64

显式切片

S.loc['a':'d']
#a    1
#b    2
#c    3
#d    4
#dtype: int64

隐式切片

S.iloc[0:-1]
#a    1
#b    2
#c    3
#dtype: int64

3）Series的基本概念

可以把Series看成一个定长的有序字典

可以通过ndim,shape，size，index,values等得到series的属性

S.ndim,S.shape,S.size,S.dtype
#(1, (4,), 4, dtype('int64'))
S.index
#Index(['a', 'b', 'c', 'd'], dtype='object')
S.keys()
#Index(['a', 'b', 'c', 'd'], dtype='object')
S.values
#array([1, 2, 3, 4], dtype=int64)
S.nbytes
#32

可以通过head(),tail()快速查看Series对象的样式

共同都有一个参数n，默认值为5

S = pd.Series(data=np.random.randint(0,10,10000))
#Linux 当中 head -n xxx.txt 读取前几行
S.head(n=3)
#0    4
#1    9
#2    8
#dtype: int32

使用pandas读取CSV文件

#filepath_or_buffer = 路径
#sep=','  CSV的分割符
city = pd.read_csv('500_Cities__Local_Data_for_Better_Health.csv')
city.head()

在这里插入图片描述

当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况

np.array([None,1,2,3])
#array([None, 1, 2, 3], dtype=object)
S=pd.Series([None,1,2,3])

可以使用pd.isnull()，pd.notnull()，或自带isnull(),notnull()函数检测缺失数据

#mysql  where demo is not null
index = S.notnull()
index
#0    False
#1     True
#2     True
#3     True
#dtype: bool
S[index]
#1    1.0
#2    2.0
#3    3.0
#dtype: float64

汪雯琦

发布了476 篇原创文章 · 获赞 400 · 访问量 5万+

私信关注

【数据挖掘重要笔记day11】pandas和Series的获取、Series的基本使用