import pandas as pd
import numpy as np
Head与Tail
- 默认显示5条数据
- 也可指定显示数据的数量
# 默认显示前5条
series.head()
# 指定显示后3条
series.tail(3)
Properties and underlying data
# 将列标签名变为小写
df.columns = [x.lower() for x in df.columns]
The .array attribute is used to extract data in Index or Series
s.index.array
s.array
Series
- 是带标签的一维数组
- 轴标签统称为索引
- 调用pd.Series函数即可创建Series:
s = pd.Series(data,index=index)
Data supports: 1, Python dictionary 2, multi-dimensional array 3, scalar value (for example, 5)
When the data is different, the index has different usage
When data is a multidimensional array
- index长度必须与data长度一致
- 没有指定index参数时,创建数值型索引,即[0,...,len(data)-1]。
# 指定index参数时
s=pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
s
a 0.710396
b 1.597084
c 0.341957
d 0.467000
e 0.884691
dtype: float64
s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
# 未指定index参数时
pd.Series(np.random.randn(5))
0 0.159028
1 -0.202146
2 -0.320935
3 0.029248
4 0.135631
dtype: float64
Note: Pandas index can be repeated
When data is a Python dictionary
- Series可以用字典实例化
d={
'b':1,'a':0,'c':2}
pd.Series(d)
b 1
a 0
c 2
dtype: int64
- 注意:data为字典,且未设置index参数时,如果Python版本>=3.6且Pandas版本>=0.23,Series按字典的插入顺序排序索引;
如果设置了index参数,则按索引标签提取data里对应的值。
d={
'a':0.,'b':1.,'c':2.}
# 未设置index索引,按字典的插入顺序排序索引
pd.Series(d)
a 0.0
b 1.0
c 2.0
dtype: float64
# 设置了index参数,则按索引标签提取data里对应的值
pd.Series(d,index=['b','c','d','a'])
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
When data is a scalar
- 必须提供索引
- Series按索引长度重复该标量值
pd.Series(5.,index=['a','b','c','d','e'])
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
Series operation is similar to ndarray, supports most NumPy functions, and also supports index slicing.
get() method
- 直接引用Series里没有的标签会触发异常
- get()方法可以提取Series里没有的标签,返回None或指定默认值:
s.get('f',np.nan)
Series supports name attribute
s=pd.Series(np.random.randn(5),name='something')
s.name
rename() rename the Series
s2=s.rename('different')
s2 and s point to different objects
DataFrame
- 由多种类型的列构成的二维标签数据结构
- 生成的索引是每个Series索引的并集
- 如果没有指定列,DateFrame的列就是字典键的有序列表
Use Series dictionary or dictionary to generate DataFrame
d={
'one':pd.Series([1.,2.,3.],index=['a','b','c']),
'two':pd.Series([1.,2.,3.,4.],index=['a','b','c','d'])}
df = pd.DataFrame(d)
df
one | two | |
---|---|---|
a | 1.0 | 1.0 |
b | 2.0 | 2.0 |
c | 3.0 | 3.0 |
d | NaN | 4.0 |
pd.DataFrame(d,index=['d','b','a'])
one | two | |
---|---|---|
d | NaN | 4.0 |
b | 2.0 | 2.0 |
a | 1.0 | 1.0 |
pd.DataFrame(d,index=['d','b','a'],columns=['two','three'])
two | three | |
---|---|---|
d | 4.0 | NaN |
b | 2.0 | NaN |
a | 1.0 | NaN |
The index and columns attributes are used to access row and column labels respectively
Generate DataFrame with multi-dimensional array dictionary and list dictionary
The length of the multidimensional array must be the same. If the index parameter is passed, the index length must be consistent with the array; if the index parameter is not passed, the generated result is range(n), and n is the array length.
d={
'one':[1.,2.,3.,4.],
'two':[4.,3.,2.,1.]}
pd.DataFrame(d)
one | two | |
---|---|---|
0 | 1.0 | 4.0 |
1 | 2.0 | 3.0 |
2 | 3.0 | 2.0 |
3 | 4.0 | 1.0 |
pd.DataFrame(d,index=['a','b','c','d'])
one | two | |
---|---|---|
a | 1.0 | 4.0 |
b | 2.0 | 3.0 |
c | 3.0 | 2.0 |
d | 4.0 | 1.0 |
Generate DataFrame with structured multidimensional array or record multidimensional array
data=np.zeros((2,),dtype=[('A','i4'),('B','f4'),('C','a10')])
data
array([(0, 0., b''), (0, 0., b'')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
# 给data数组赋值
data[:]=[(1,2.,'Hello'),(2,3.,"World")]
data
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
pd.DataFrame(data)
A | B | C | |
---|---|---|---|
0 | 1 | 2.0 | b'Hello' |
1 | 2 | 3.0 | b'World' |
pd.DataFrame(data,index=['first','second'])
A | B | C | |
---|---|---|---|
first | 1 | 2.0 | b'Hello' |
second | 2 | 3.0 | b'World' |
pd.DataFrame(data,columns=['C','A','B'])
C | A | B | |
---|---|---|---|
0 | b'Hello' | 1 | 2.0 |
1 | b'World' | 2 | 3.0 |
Generate DataFrame with list dictionary
data2=[{
'a':1,'b':2},{
'a':5,'b':10,'c':20}]
pd.DataFrame(data2)
a | b | c | |
---|---|---|---|
0 | 1 | 2 | NaN |
1 | 5 | 10 | 20.0 |
Generate DataFrame with tuple dictionary
Tuple dictionary can automatically create multi-level index DataFrame
Extract, add, delete columns
DataFrame is like an indexed Series dictionary. The operations of extracting, setting, and deleting (del, pop) columns are similar to those of a dictionary
When inserting a Series with a different index from the DataFrame, the index of the DataFrame shall prevail
By default, the column is inserted at the end of the DataFrame, and the insert function can specify the position of the inserted column
# 参数1是位置,参数2是列名,参数3是数据
df.insert(1,'bar',df['one'])
df
one | bar | two | |
---|---|---|---|
a | 1.0 | 1.0 | 1.0 |
b | 2.0 | 2.0 | 2.0 |
c | 3.0 | 3.0 | 3.0 |
d | NaN | NaN | 4.0 |
DataFrame operations
Support Boolean operators
# 首先创建2个dtype=bool类型的DataFrame df1和df2
df1 & df2 # 与
df1 | df2 # 或
df1 ^ df2 # 异或
-df1 # 非
Transpose
The T attribute (that is, the transpose function) can transpose the DataFrame:
df[:5].T
Basic properties of DataFrame
df.shape # 行数 列数
df.dtypes # 列数据属性
df.ndim # 数据维度
df.index # 行索引
df.columns # 列索引
df.values # 对象值,二维ndarray数组
df.info() # 相关信息概览:行数、列数,列索引,列非空值个数,列类型,内存占用
df.describe() # 快速综合统计结果:计数,均值,标准差,最大值。四分位数,最小值
A summary and distinction about axis
In Numpy, 0 means column and 1 means row
In Pandas, 0 means row and 1 means column
Classification processing of advanced operation data
The core of data classification processing:
- groupby()函数
- groups属性查看分组情况
df = pd.DataFrame({
'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
'price':[4,3,3,2.5,4,2],
'color':['red','yellow','yellow','green','green','green'],
'weight':[12,20,50,30,20,44]})
df
item | price | color | weight | |
---|---|---|---|---|
0 | Apple | 4.0 | red | 12 |
1 | Banana | 3.0 | yellow | 20 |
2 | Orange | 3.0 | yellow | 50 |
3 | Banana | 2.5 | green | 30 |
4 | Orange | 4.0 | green | 20 |
5 | Apple | 2.0 | green | 44 |
# 想要对水果的种类进行分析
df.groupby(by='item')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F09329BCD0>
# 查看详细的分组情况
df.groupby(by='item').groups
{'Apple': [0, 5], 'Banana': [1, 3], 'Orange': [2, 4]}
分组聚合
# 计算出每一种水果的平均价格
df.groupby(by='item').mean()
price | weight | |
---|---|---|
item | ||
Apple | 3.00 | 28 |
Banana | 2.75 | 25 |
Orange | 3.50 | 35 |
df.groupby(by='item')['price'].mean()
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
# 计算每一种颜色对应水果的平均重量
df.groupby(by='color')['weight'].mean()
color
green 31.333333
red 12.000000
yellow 35.000000
Name: weight, dtype: float64
dic = df.groupby(by='item')['price'].mean().to_dict()
# 将计算出的平均重量汇总到源数据中
df['mean_p'] = df['item'].map(dic)
df
item | price | color | weight | mean_p | |
---|---|---|---|---|---|
0 | Apple | 4.0 | red | 12 | 3.00 |
1 | Banana | 3.0 | yellow | 20 | 2.75 |
2 | Orange | 3.0 | yellow | 50 | 3.50 |
3 | Banana | 2.5 | green | 30 | 2.75 |
4 | Orange | 4.0 | green | 20 | 3.50 |
5 | Apple | 2.0 | green | 44 | 3.00 |
高级数据聚合
当pandas已经封装好的那些数据聚合函数无法满足我们的实际聚合需求的时候,我们可以用到高级数据聚合,即
使用groupby分组后,使用transform和apply提供自定义函数实现更多的运算。
transform和apply也可以传入一个lambda表达式
def my_mean(s):
m_sum = 0
for i in s:
m_sum += i
return m_sum / len(s)
df.groupby(by='item')['price'].transform(my_mean)
0 3.00
1 2.75
2 3.50
3 2.75
4 3.50
5 3.00
Name: price, dtype: float64
df.groupby(by='item')['price'].apply(my_mean)
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64