Article directory
1 Group aggregation
1.1 Data aggregation and grouping
The simple accumulation method can give us a general understanding of the data set, but we often need to perform cumulative analysis on certain tags or index parts. In this case, groupby is needed. Although the name "group by" is borrowed from a command in the SQL database language, it may be more appropriate to refer to the idea of Hadley Wickham, who invented the R language frame: split, apply and combine.
Split, Apply and Combine
A classic split-apply-combine operation, where "apply" is a summation function.
The figure below clearly describes the GroupBy process. The splitting step splits the DataFrame into several groups according to the specified keys. The Apply step applies a function to each group, usually a cumulative, transformation, or filter function. The combine step combines the results of each group into a single output array.
Although we can also achieve this through a series of masking, accumulation and merging operations introduced earlier, it is important to realize that the intermediate segmentation process does not need to be exposed explicitly. And GroupBy (often) requires only one line of code to calculate the sum, mean, count, minimum, and other cumulative values for each group. The purpose of GroupBy is to abstract these steps: the user does not need to know how to calculate at the bottom level, as long as the operation is viewed as a whole.
groupby:(by=None,as_index=True)
by : What to group by, used to determine the group of groupby
as_index : For aggregate output, return the object indexed by the group note, only for DataFrame
df = pd.DataFrame(
{
'fruit': ['apple', 'banana', 'orange', 'apple', 'banana'], # 名称
'color': ['red', 'yellow', 'yellow', 'cyan', 'cyan'], # 颜色
'price': [8.5, 6.8, 5.6, 7.8, 6.4], # 单价
'count': [3, 4, 6, 5, 2]}) # 购买总量,单位/斤
df['total'] = df['price'] * df['count']
print(df)
Output:
fruit color price count total
0 apple red 8.5 3 25.5
1 banana yellow 6.8 4 27.2
2 orange yellow 5.6 6 33.6
3 apple cyan 7.8 5 39.0
4 banana cyan 6.4 2 12.8
It should be noted that the return value here is not a DataFrame object, but a DataFrameGroupBy object. The magic of this object is that you can think of it as a special form of DataFrame, with several sets of data hidden inside, but not calculated without applying the accumulation function. This "lazy evaluation" approach allows most common accumulation operations to be implemented very efficiently in a way that is almost transparent to the user (it feels as if the operation does not exist).
To get this result, you can apply the accumulation function on the DataFrameGroupBy object, which will complete the corresponding apply/combine steps and produce the result:
print(df.groupby('fruit').sum())
"""
price
fruit
apple 16.3
banana 13.2
orange 5.6
"""
sum() is just one of many available methods. You can use any accumulation function from Pandas or NumPy, or any valid DataFrame object.
GroupBy object
The most important operations in GroupBy may be aggregate , filter , transform and apply (accumulation, filtering, conversion, application). These contents will be introduced in detail later. Now let's introduce some basic operation methods of GroupBy.
(1) Get values by column. The GroupBy object, like the DataFrame, also supports column-wise values and returns a modified GroupBy object, for example:
import pandas as pd
df = pd.DataFrame({
'fruit': ['apple', 'banana', 'orange', 'apple', 'banana'],
'color': ['red', 'yellow', 'yellow', 'cyan', 'cyan'],
'price': [8.5, 6.8, 5.6, 7.8, 6.4],
'count': [3, 4, 6, 5, 2]})
df['total'] = df['price'] * df['count']
# 查看类型
print(df.groupby('fruit'))
print(df.groupby('fruit').sum())
"""group 对象"""
for name, group in df.groupby('fruit'):
print('分组名:\t', name) # 输出组名
print('分组数据:\n', group) # 输出数据块
# 对数量与总价
print(df.groupby('fruit')[['count', 'total']].apply(lambda x: x.sum()))
Aggregation (agg)
Function name | describe |
---|---|
count | Number of non-NA values in the group |
sum | sum of non-NA values |
mean | average of non-NA values |
median | Median of non-NA values |
std, var | Standard deviation and variance |
min, max | Minimum value, maximum value that is not NA |
prod | Product of non-NA values |
first, last | The first and last non-NA values |
For example: |
import numpy as np
# 多分组数据进行多个运算
print(df.groupby('fruit').aggregate({
'price': 'min', 'count': 'max'}))
"""
如果我现在有个需求,计算每种水果最高价和最低价的差值,
1.上表中的聚合函数不能满足于我们的需求,我们需要使用自定义的聚合函数
2.在分组对象中,使用我们自定义的聚合函数
"""
# 定义一个计算差值的函数
def diff_value(arr):
return arr.max() - arr.min()
print(df.groupby('fruit')['price'].agg(diff_value))
price count
fruit
apple 7.8 5
banana 6.4 4
orange 5.6 6
fruit
apple 0.7
banana 0.4
orange 0.0
Name: price, dtype: float64
filter
"""分组后我们也可以对数据进行过滤<自定义规则过滤>"""
def filter_func(x):
print('price:\t', [x['price']])
print('price:\t', type(x['price']))
print('price:\t', x['price'].mean())
return x['price'].mean() > 6 # 过滤平均价格大于6
print(df)
print('分组价格的平均数: \t', df.groupby('fruit')['price'].mean())
# 分组之后进行过滤,不影响原本的数据
print(df.groupby('fruit').filter(filter_func))
1.2 Data merging
1.2.1 pd.merge data merge
-
Join rows of different DataFrames based on a single or multiple keys
-
Database-like connection operations
-
pd.merge:(left, right, how=‘inner’,on=None,left_on=None, right_on=None )
left: DataFrame on the left when merging
right: DataFrame on the right side when merging
how: merging method, default 'inner', 'outer', 'left', 'right'
on: The column name that needs to be merged must be on both sides, and the intersection of the column names in left and right is used as the connection key.
left_on: column in left Dataframe used as join key
right_on: right Column in Dataframe used as join key
-
Inner join: join the intersection of keys in both tables
-
Full join outer: Union of the union of keys in both tables
-
Left join left: joins all keys of the left table
-
Right join right: Union of all keys of the right table
import pandas as pd
import numpy as np
left = pd.DataFrame({
'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({
'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left,right,on='key') #指定连接键key
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
left = pd.DataFrame({
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({
'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left,right,on=['key1','key2']) #指定多个键,进行合并
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
#指定左连接
left = pd.DataFrame({
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({
'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, how='left', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
#指定右连接
left = pd.DataFrame({
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({
'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, how='right', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
The default is "inner" (inner), that is, the key in the result is the intersection
how to specify the connection method
"Outer" (outer), the key in the result is the union
left = pd.DataFrame({
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({
'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left,right,how='outer',on=['key1','key2'])
"""
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
"""
Handle duplicate column names
- Parameter suffixes: default is _x, _y
df_obj1 = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({
'key': ['a', 'b', 'd'],
'data' : np.random.randint(0,10,3)})
print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))
#运行结果
"""
data_left key data_right
0 9 b 1
1 5 b 1
2 1 b 1
3 2 a 8
4 2 a 8
5 5 a 8
"""
Join by index
- Parameter left_index=True or right_index=True
# 按索引连接
df_obj1 = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({
'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])
print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))
#运行结果
"""
data1 key data2
0 3 b 6
1 4 b 6
6 8 b 6
2 6 a 0
4 3 a 0
5 0 a 0
"""
1.2.2 pd.concat data merging
Merge multiple objects together along an axis
NumPy concat
import numpy as np
import pandas as pd
arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))
print(arr1)
print(arr2)
print(np.concatenate([arr1, arr2]))
print(np.concatenate([arr1, arr2], axis=1))
#运行结果
"""
# print(arr1)
[[3 3 0 8]
[2 0 3 1]
[4 8 8 2]]
# print(arr2)
[[6 8 7 3]
[1 6 8 7]
[1 4 7 1]]
# print(np.concatenate([arr1, arr2]))
[[3 3 0 8]
[2 0 3 1]
[4 8 8 2]
[6 8 7 3]
[1 6 8 7]
[1 4 7 1]]
# print(np.concatenate([arr1, arr2], axis=1))
[[3 3 0 8 6 8 7 3]
[2 0 3 1 1 6 8 7]
[4 8 8 2 1 4 7 1]]
"""
pd.concat
- Pay attention to specifying the axis direction, the default axis=0
- Join specifies the merging method, the default is outer
- Check whether the row index is duplicated when merging Series
df1 = pd.DataFrame(np.arange(6).reshape(3,2),index=list('abc'),columns=['one','two'])
df2 = pd.DataFrame(np.arange(4).reshape(2,2)+5,index=list('ac'),columns=['three','four'])
print(pd.concat([df1,df2])) #默认外连接,axis=0
print(pd.concat([df1,df2],axis='columns')) #指定axis=1连接
print(pd.concat([df1,df2],axis=1,join='inner'))
"""
one two three four
a 0.0 1.0 NaN NaN
b 2.0 3.0 NaN NaN
c 4.0 5.0 NaN NaN
a NaN NaN 5.0 6.0
c NaN NaN 7.0 8.0
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
one two three four
a 0 1 5 6
c 4 5 7 8
"""
1.3 Reinvention
1.3.1 stack
- Rotate column index to row index to complete hierarchical indexing
- DataFrame->Series
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'],
name='number'))
print(data)
result = data.stack()
print(result)
#输出:
"""
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
"""
1.3.2 unstack
- Expand hierarchical index
- Series->DataFrame
- The default operation is the inner index, that is, level=1
# 默认操作内层索引
print(stacked.unstack())
# 通过level指定操作索引的级别
print(stacked.unstack(level=0))
"""
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
"""
2 time series
Time series data is an important form of structured data. Any time observed or measured at multiple points in time forms a sequence over time. Many times, the time series has a fixed frequency, that is, the data points appear regularly according to a certain pattern (such as every 15 seconds...). Time series can also be irregular. The meaning of time series data depends on the specific application scenario. Mainly consists of the following types:
- Timestamp, a specific moment.
- Fixed period (period), such as January 2007 or the whole year 2010.
- Time interval (interval), represented by start and end timestamps. Period can be thought of as a special case of interval.
2.1 python built-in time module
The Python standard library contains data types for date and time data, as well as calendar functionality. We will mainly use datetime, time and calendar modules. datetime.datetime (also abbreviated as datetime) is the most commonly used data type:
from datetime import datetime
now = datetime.now()
now
"""
datetime.datetime(2023, 11, 1, 21, 25, 38, 531179)
"""
now.year, now.month, now.day
"""
(2023, 11, 1) # 返回元组
"""
datetime stores date and time in milliseconds. timedelta represents the time difference between two datetime objects:
delta = datetime(2023, 1, 7) - datetime(2022, 6, 24, 8, 15)
delta
"""
datetime.timedelta(days=196, seconds=56700)
"""
delta.days
# 输出:
"""
196
"""
from datetime import timedelta
start = datetime(2022, 1, 7)
start + timedelta(31)
"""
datetime.datetime(2022, 2, 7, 0, 0)
"""
2.1.2 Convert string and datetime to each other
stamp = datetime(2023, 10, 1)
stamp
#输出
datetime.datetime(2023, 10, 1, 0, 0)
"""将datetime对象转换成字符串"""
# 强制转换
str(stamp)
# 格式化
stamp.strftime("%Y-%m-%d %H:%M:%S")
#输出:
'2023-10-01 00:00:00'
"""将字符串转换成datetime对象"""
datetime.strptime('2023/5/20', '%Y/%m/%d')
#输出
datetime.datetime(2023, 5, 20, 0, 0)
d = ['2023/7/20', '2023/7/21', '2023/7/22']
pd.to_datetime(d) # 转换后是 DatetimeIndex 对象
index = pd.to_datetime(d + [None])
index # NaT 时间序列的缺失值
pd.isnull(index)
index.dropna() #删除NaT 时间序列的缺失值
2.2 Time series
symbol | meaning | example |
---|---|---|
%a |
abbreviated weekday name | 'Wed' |
%A |
Full working day name | 'Wednesday' |
%w |
Weekday number - 0 (Sunday) to 6 (Saturday) | '3' |
%d |
day of month (zero padding) | '13' |
%b |
abbreviated month name | 'Jan' |
%B |
Full month name | 'January' |
%m |
month of the year | '01' |
%y |
A year without a century | '16' |
%Y |
year with century | '2016' |
%H |
hour in 24 hour clock | '17' |
%I |
hour in 12 hour clock | '05' |
%p |
morning afternoon | 'PM' |
%M |
minute | '00' |
%S |
Second | '00' |
%f |
microseconds | '000000' |
%z |
UTC offset for time zone aware objects | '-0500' |
%Z |
time zone name | 'EST' |
%j |
one day of the year | '013' |
%W |
week of the year | '02' |
%c |
Date and time representation of the current locale | 'Wed Jan 13 17:00:00 2016' |
%x |
Date representation for the current locale | '01/13/16' |
%X |
Time representation for the current locale | '17:00:00' |
%% |
Literal % characters |
'%' |
dates = [datetime(2020, 12, 12), datetime(2020, 12, 13),
datetime(2020, 12, 14), datetime(2020, 12, 15),
datetime(2020, 12, 16)]
ts = pd.Series(np.random.randn(5), index=dates)
ts # 创建一个 Series 对象, 以时间为行索引
#输出
2020-12-12 0.376037
2020-12-13 0.426828
2020-12-14 0.050578
2020-12-15 0.302734
2020-12-16 -0.068885
dtype: float64
ts.index # 行索引是时间序列索引对象
输出
DatetimeIndex(['2020-12-12', '2020-12-13', '2020-12-14', '2020-12-15',
'2020-12-16'],
dtype='datetime64[ns]', freq=None)
2.2.1 Value
ts
2020-12-12 0.376037
2020-12-13 0.426828
2020-12-14 0.050578
2020-12-15 0.302734
2020-12-16 -0.068885
dtype: float64
ts[::2] # 取步长
2020-12-12 0.376037
2020-12-14 0.050578
2020-12-16 -0.068885
dtype: float64
2.2.2 Estimated time
ts1 = pd.Series(
data=np.random.randn(1000),
index=pd.date_range('2020/01/01', periods=1000)
)
ts1
2020-01-01 -0.700438
2020-01-02 -1.961004
2020-01-03 -0.226558
2020-01-04 -0.594778
2020-01-05 -1.394763
...
2022-09-22 0.025743
2022-09-23 0.784406
2022-09-24 -0.930995
2022-09-25 0.974117
2022-09-26 -1.625869
Freq: D, Length: 1000, dtype: float64
ts1['2021-05-18': '2021-05-27'] # 切片
2021-05-18 -0.487663
2021-05-19 -0.529925
2021-05-20 -0.316952
2021-05-21 0.476325
2021-05-22 -1.006280
2021-05-23 0.438202
2021-05-24 1.505284
2021-05-25 0.523409
2021-05-26 -1.139620
2021-05-27 0.573387
Freq: D, dtype: float64
2.2.3 Specify time range
pd.date_range('2020-01-01', '2023-08-28')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
'2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
'2020-01-09', '2020-01-10',
...
'2023-08-19', '2023-08-20', '2023-08-21', '2023-08-22',
'2023-08-23', '2023-08-24', '2023-08-25', '2023-08-26',
'2023-08-27', '2023-08-28'],
dtype='datetime64[ns]', length=1336, freq='D')
pd.date_range(start='2020-01-01', periods=100) # 往后推算100天
pd.date_range(end='2023-08-28', periods=100) # 往前推算100天
Resampling and frequency conversion
Resampling refers to the process of converting a time series from one frequency to another. Aggregating high-frequency data to low frequencies is called downsampling, while converting low-frequency data to high frequencies is called upsampling. Not all resampling can be divided into these two broad categories. For example, converting W-WED (every Wednesday) to W-FRI is neither downsampling nor upsampling.
All pandas objects have a resample method, which is the main function for various frequency conversion tasks. resample has an API similar to groupby. Calling resample can group data, and then call an aggregate function.
resample is a flexible and efficient method that can be used to process very large time series.
pd.date_range(start='2023-08-28', periods=10, freq='BH') # freq指定时间频率
Output:
DatetimeIndex(['2023-08-28 09:00:00', '2023-08-28 10:00:00',
'2023-08-28 11:00:00', '2023-08-28 12:00:00',
'2023-08-28 13:00:00', '2023-08-28 14:00:00',
'2023-08-28 15:00:00', '2023-08-28 16:00:00',
'2023-08-29 09:00:00', '2023-08-29 10:00:00'],
dtype='datetime64[ns]', freq='BH')