Pandas basic study notes

import pandas as pd
import numpy as np

Head与Tail

- 默认显示5条数据
- 也可指定显示数据的数量
# 默认显示前5条
series.head()
# 指定显示后3条
series.tail(3)

Properties and underlying data

# 将列标签名变为小写
df.columns = [x.lower() for x in df.columns]

The .array attribute is used to extract data in Index or Series

s.index.array
s.array

Series

- 是带标签的一维数组
- 轴标签统称为索引
- 调用pd.Series函数即可创建Series:
s = pd.Series(data,index=index)

Data supports: 1, Python dictionary 2, multi-dimensional array 3, scalar value (for example, 5)

When the data is different, the index has different usage

When data is a multidimensional array

- index长度必须与data长度一致
- 没有指定index参数时,创建数值型索引,即[0,...,len(data)-1]。
# 指定index参数时
s=pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
s
a    0.710396
b    1.597084
c    0.341957
d    0.467000
e    0.884691
dtype: float64
s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
# 未指定index参数时
pd.Series(np.random.randn(5))
0    0.159028
1   -0.202146
2   -0.320935
3    0.029248
4    0.135631
dtype: float64

Note: Pandas index can be repeated

When data is a Python dictionary

- Series可以用字典实例化
d={
    
    'b':1,'a':0,'c':2}
pd.Series(d)
b    1
a    0
c    2
dtype: int64



- 注意:data为字典,且未设置index参数时,如果Python版本>=3.6且Pandas版本>=0.23,Series按字典的插入顺序排序索引;
  如果设置了index参数,则按索引标签提取data里对应的值。
d={
    
    'a':0.,'b':1.,'c':2.}
# 未设置index索引,按字典的插入顺序排序索引
pd.Series(d)
a    0.0
b    1.0
c    2.0
dtype: float64
# 设置了index参数,则按索引标签提取data里对应的值
pd.Series(d,index=['b','c','d','a'])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

When data is a scalar

- 必须提供索引
- Series按索引长度重复该标量值
pd.Series(5.,index=['a','b','c','d','e'])
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Series operation is similar to ndarray, supports most NumPy functions, and also supports index slicing.

get() method

- 直接引用Series里没有的标签会触发异常
- get()方法可以提取Series里没有的标签,返回None或指定默认值:
s.get('f',np.nan)

Series supports name attribute

s=pd.Series(np.random.randn(5),name='something')
s.name

rename() rename the Series

s2=s.rename('different')

s2 and s point to different objects

DataFrame

- 由多种类型的列构成的二维标签数据结构
- 生成的索引是每个Series索引的并集
- 如果没有指定列,DateFrame的列就是字典键的有序列表

Use Series dictionary or dictionary to generate DataFrame

d={
    
    'one':pd.Series([1.,2.,3.],index=['a','b','c']),
   'two':pd.Series([1.,2.,3.,4.],index=['a','b','c','d'])}
df = pd.DataFrame(d)
df
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
pd.DataFrame(d,index=['d','b','a'])
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0
pd.DataFrame(d,index=['d','b','a'],columns=['two','three'])
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN

The index and columns attributes are used to access row and column labels respectively

Generate DataFrame with multi-dimensional array dictionary and list dictionary

The length of the multidimensional array must be the same. If the index parameter is passed, the index length must be consistent with the array; if the index parameter is not passed, the generated result is range(n), and n is the array length.

d={
    
    'one':[1.,2.,3.,4.],
   'two':[4.,3.,2.,1.]}
pd.DataFrame(d)
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
pd.DataFrame(d,index=['a','b','c','d'])
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0

Generate DataFrame with structured multidimensional array or record multidimensional array

data=np.zeros((2,),dtype=[('A','i4'),('B','f4'),('C','a10')])
data
array([(0, 0., b''), (0, 0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
# 给data数组赋值
data[:]=[(1,2.,'Hello'),(2,3.,"World")]
data
array([(1, 2., b'Hello'), (2, 3., b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
pd.DataFrame(data)
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'
pd.DataFrame(data,index=['first','second'])
A B C
first 1 2.0 b'Hello'
second 2 3.0 b'World'
pd.DataFrame(data,columns=['C','A','B'])
C A B
0 b'Hello' 1 2.0
1 b'World' 2 3.0

Generate DataFrame with list dictionary

data2=[{
    
    'a':1,'b':2},{
    
    'a':5,'b':10,'c':20}]
pd.DataFrame(data2)
a b c
0 1 2 NaN
1 5 10 20.0

Generate DataFrame with tuple dictionary

Tuple dictionary can automatically create multi-level index DataFrame

Extract, add, delete columns

DataFrame is like an indexed Series dictionary. The operations of extracting, setting, and deleting (del, pop) columns are similar to those of a dictionary

When inserting a Series with a different index from the DataFrame, the index of the DataFrame shall prevail

By default, the column is inserted at the end of the DataFrame, and the insert function can specify the position of the inserted column

# 参数1是位置,参数2是列名,参数3是数据
df.insert(1,'bar',df['one'])
df
one bar two
a 1.0 1.0 1.0
b 2.0 2.0 2.0
c 3.0 3.0 3.0
d NaN NaN 4.0

DataFrame operations

Support Boolean operators

# 首先创建2个dtype=bool类型的DataFrame df1和df2
df1 & df2 # 与
df1 | df2 # 或
df1 ^ df2 # 异或
-df1      # 非

Transpose

The T attribute (that is, the transpose function) can transpose the DataFrame:

df[:5].T

Basic properties of DataFrame

df.shape # 行数 列数
df.dtypes # 列数据属性
df.ndim # 数据维度
df.index # 行索引
df.columns # 列索引
df.values # 对象值,二维ndarray数组
df.info() # 相关信息概览:行数、列数,列索引,列非空值个数,列类型,内存占用
df.describe() # 快速综合统计结果:计数,均值,标准差,最大值。四分位数,最小值

A summary and distinction about axis

In Numpy, 0 means column and 1 means row

In Pandas, 0 means row and 1 means column

Classification processing of advanced operation data

The core of data classification processing:

- groupby()函数
- groups属性查看分组情况
df = pd.DataFrame({
    
    'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                'price':[4,3,3,2.5,4,2],
                'color':['red','yellow','yellow','green','green','green'],
                'weight':[12,20,50,30,20,44]})
df
item price color weight
0 Apple 4.0 red 12
1 Banana 3.0 yellow 20
2 Orange 3.0 yellow 50
3 Banana 2.5 green 30
4 Orange 4.0 green 20
5 Apple 2.0 green 44
# 想要对水果的种类进行分析
df.groupby(by='item')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F09329BCD0>
# 查看详细的分组情况
df.groupby(by='item').groups
{'Apple': [0, 5], 'Banana': [1, 3], 'Orange': [2, 4]}

分组聚合

# 计算出每一种水果的平均价格
df.groupby(by='item').mean()
price weight
item
Apple 3.00 28
Banana 2.75 25
Orange 3.50 35
df.groupby(by='item')['price'].mean()
item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64
# 计算每一种颜色对应水果的平均重量
df.groupby(by='color')['weight'].mean()
color
green     31.333333
red       12.000000
yellow    35.000000
Name: weight, dtype: float64
dic = df.groupby(by='item')['price'].mean().to_dict()
# 将计算出的平均重量汇总到源数据中
df['mean_p'] = df['item'].map(dic)
df
item price color weight mean_p
0 Apple 4.0 red 12 3.00
1 Banana 3.0 yellow 20 2.75
2 Orange 3.0 yellow 50 3.50
3 Banana 2.5 green 30 2.75
4 Orange 4.0 green 20 3.50
5 Apple 2.0 green 44 3.00

高级数据聚合

当pandas已经封装好的那些数据聚合函数无法满足我们的实际聚合需求的时候,我们可以用到高级数据聚合,即

使用groupby分组后,使用transform和apply提供自定义函数实现更多的运算。

transform和apply也可以传入一个lambda表达式

def my_mean(s):
    m_sum = 0
    for i in s:
        m_sum += i
    return m_sum / len(s)
df.groupby(by='item')['price'].transform(my_mean)
0    3.00
1    2.75
2    3.50
3    2.75
4    3.50
5    3.00
Name: price, dtype: float64
df.groupby(by='item')['price'].apply(my_mean)
item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

Guess you like

Origin blog.csdn.net/m0_51210480/article/details/111770405