day8-Group aggregation and time series

1 Group aggregation

1.1 Data aggregation and grouping

The simple accumulation method can give us a general understanding of the data set, but we often need to perform cumulative analysis on certain tags or index parts. In this case, groupby is needed. Although the name "group by" is borrowed from a command in the SQL database language, it may be more appropriate to refer to the idea of ​​Hadley Wickham, who invented the R language frame: split, apply and combine.

Split, Apply and Combine
A classic split-apply-combine operation, where "apply" is a summation function.

The figure below clearly describes the GroupBy process. The splitting step splits the DataFrame into several groups according to the specified keys. The Apply step applies a function to each group, usually a cumulative, transformation, or filter function. The combine step combines the results of each group into a single output array.

Insert image description here
Although we can also achieve this through a series of masking, accumulation and merging operations introduced earlier, it is important to realize that the intermediate segmentation process does not need to be exposed explicitly. And GroupBy (often) requires only one line of code to calculate the sum, mean, count, minimum, and other cumulative values ​​for each group. The purpose of GroupBy is to abstract these steps: the user does not need to know how to calculate at the bottom level, as long as the operation is viewed as a whole.

groupby:(by=None,as_index=True)

by : What to group by, used to determine the group of groupby

as_index : For aggregate output, return the object indexed by the group note, only for DataFrame

df = pd.DataFrame(
    {
    
    'fruit': ['apple', 'banana', 'orange', 'apple', 'banana'],  # 名称
     'color': ['red', 'yellow', 'yellow', 'cyan', 'cyan'],  # 颜色
     'price': [8.5, 6.8, 5.6, 7.8, 6.4],  # 单价
     'count': [3, 4, 6, 5, 2]})  # 购买总量,单位/斤

df['total'] = df['price'] * df['count']
print(df)

Output:

	fruit	color	price	count	total
0	apple	red		8.5		3		25.5
1	banana	yellow	6.8		4		27.2
2	orange	yellow	5.6		6		33.6
3	apple	cyan	7.8		5		39.0
4	banana	cyan	6.4		2		12.8

It should be noted that the return value here is not a DataFrame object, but a DataFrameGroupBy object. The magic of this object is that you can think of it as a special form of DataFrame, with several sets of data hidden inside, but not calculated without applying the accumulation function. This "lazy evaluation" approach allows most common accumulation operations to be implemented very efficiently in a way that is almost transparent to the user (it feels as if the operation does not exist).

To get this result, you can apply the accumulation function on the DataFrameGroupBy object, which will complete the corresponding apply/combine steps and produce the result:

print(df.groupby('fruit').sum())
"""
        price
fruit        
apple    16.3
banana   13.2
orange    5.6
"""

sum() is just one of many available methods. You can use any accumulation function from Pandas or NumPy, or any valid DataFrame object.

GroupBy object

The most important operations in GroupBy may be aggregate , filter , transform and apply (accumulation, filtering, conversion, application). These contents will be introduced in detail later. Now let's introduce some basic operation methods of GroupBy.

(1) Get values ​​by column. The GroupBy object, like the DataFrame, also supports column-wise values ​​and returns a modified GroupBy object, for example:

import pandas as pd

df = pd.DataFrame({
    
    'fruit': ['apple', 'banana', 'orange', 'apple', 'banana'],
                   'color': ['red', 'yellow', 'yellow', 'cyan', 'cyan'],
                   'price': [8.5, 6.8, 5.6, 7.8, 6.4],
                   'count': [3, 4, 6, 5, 2]})
df['total'] = df['price'] * df['count']

# 查看类型
print(df.groupby('fruit'))
print(df.groupby('fruit').sum())

"""group 对象"""
for name, group in df.groupby('fruit'):
    print('分组名:\t', name)  # 输出组名
    print('分组数据:\n', group)  # 输出数据块

# 对数量与总价
print(df.groupby('fruit')[['count', 'total']].apply(lambda x: x.sum()))

Insert image description hereAggregation (agg)

Function name describe
count Number of non-NA values ​​in the group
sum sum of non-NA values
mean average of non-NA values
median Median of non-NA values
std, var Standard deviation and variance
min, max Minimum value, maximum value that is not NA
prod Product of non-NA values
first, last The first and last non-NA values
For example:
import numpy as np
# 多分组数据进行多个运算

print(df.groupby('fruit').aggregate({
    
    'price': 'min', 'count': 'max'}))

"""
如果我现在有个需求,计算每种水果最高价和最低价的差值,
1.上表中的聚合函数不能满足于我们的需求,我们需要使用自定义的聚合函数
2.在分组对象中,使用我们自定义的聚合函数
"""

# 定义一个计算差值的函数
def diff_value(arr): 
    return arr.max() - arr.min()

print(df.groupby('fruit')['price'].agg(diff_value))
        price  count
fruit               
apple     7.8      5
banana    6.4      4
orange    5.6      6
fruit
apple     0.7
banana    0.4
orange    0.0
Name: price, dtype: float64

filter

"""分组后我们也可以对数据进行过滤<自定义规则过滤>"""
def filter_func(x):
    print('price:\t', [x['price']])
    print('price:\t', type(x['price']))
    print('price:\t', x['price'].mean())
    return x['price'].mean() > 6  # 过滤平均价格大于6

print(df)
print('分组价格的平均数: \t', df.groupby('fruit')['price'].mean())

# 分组之后进行过滤,不影响原本的数据
print(df.groupby('fruit').filter(filter_func))  

Insert image description here

1.2 Data merging

1.2.1 pd.merge data merge

  • Join rows of different DataFrames based on a single or multiple keys

  • Database-like connection operations

  • pd.merge:(left, right, how=‘inner’,on=None,left_on=None, right_on=None )

    left: DataFrame on the left when merging

    right: DataFrame on the right side when merging

    how: merging method, default 'inner', 'outer', 'left', 'right'

    on: The column name that needs to be merged must be on both sides, and the intersection of the column names in left and right is used as the connection key.

    left_on: column in left Dataframe used as join key

    right_on: right Column in Dataframe used as join key

  • Inner join: join the intersection of keys in both tables
    Insert image description here

  • Full join outer: Union of the union of keys in both tables

  • Insert image description here

  • Left join left: joins all keys of the left table

  • Insert image description here

  • Right join right: Union of all keys of the right table
    Insert image description here

import pandas as pd
import numpy as np

left = pd.DataFrame({
    
    'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                       'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({
    
    'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

pd.merge(left,right,on='key') #指定连接键key
	key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K1	A1	B1	C1	D1
2	K2	A2	B2	C2	D2
3	K3	A3	B3	C3	D3

Insert image description here

left = pd.DataFrame({
    
    'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({
    
    'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

pd.merge(left,right,on=['key1','key2']) #指定多个键,进行合并
	key1	key2	A	B	C	D
0	K0		K0		A0	B0	C0	D0
1	K1		K0		A2	B2	C1	D1
2	K1		K0		A2	B2	C2	D2

Insert image description here

#指定左连接

left = pd.DataFrame({
    
    'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({
    
    'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

pd.merge(left, right, how='left', on=['key1', 'key2'])
    key1    key2          A    B    C    D
0    K0        K0        A0    B0    C0    D0
1    K0        K1        A1    B1    NaN    NaN
2    K1        K0        A2    B2    C1    D1
3    K1        K0        A2    B2    C2    D2
4    K2        K1        A3    B3    NaN    NaN

Insert image description here

#指定右连接

left = pd.DataFrame({
    
    'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({
    
    'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, how='right', on=['key1', 'key2'])

    key1    key2          A    B    C    D
0    K0        K0        A0    B0    C0    D0
1    K1        K0        A2    B2    C1    D1
2    K1        K0        A2    B2    C2    D2
3    K2        K0        NaN    NaN    C3    D3

Insert image description here
The default is "inner" (inner), that is, the key in the result is the intersection

how to specify the connection method

"Outer" (outer), the key in the result is the union

left = pd.DataFrame({
    
    'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({
    
    'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left,right,how='outer',on=['key1','key2'])
"""
key1    key2    A    B    C    D
0    K0    K0    A0    B0    C0    D0
1    K0    K1    A1    B1    NaN    NaN
2    K1    K0    A2    B2    C1    D1
3    K1    K0    A2    B2    C2    D2
4    K2    K1    A3    B3    NaN    NaN
5    K2    K0    NaN    NaN    C3    D3
"""

Insert image description here
Handle duplicate column names

  • Parameter suffixes: default is _x, _y
df_obj1 = pd.DataFrame({
    
    'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({
    
    'key': ['a', 'b', 'd'],
                        'data' : np.random.randint(0,10,3)})

print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))

#运行结果
"""
   data_left key  data_right
0          9   b           1
1          5   b           1
2          1   b           1
3          2   a           8
4          2   a           8
5          5   a           8
"""

Join by index

  • Parameter left_index=True or right_index=True
# 按索引连接
df_obj1 = pd.DataFrame({
    
    'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({
    
    'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])

print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))

#运行结果
"""
   data1 key  data2
0      3   b      6
1      4   b      6
6      8   b      6
2      6   a      0
4      3   a      0
5      0   a      0
"""

1.2.2 pd.concat data merging

Merge multiple objects together along an axis

NumPy concat

import numpy as np
import pandas as pd

arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))

print(arr1)
print(arr2)

print(np.concatenate([arr1, arr2]))
print(np.concatenate([arr1, arr2], axis=1))

#运行结果
"""
# print(arr1)
[[3 3 0 8]
 [2 0 3 1]
 [4 8 8 2]]

# print(arr2)
[[6 8 7 3]
 [1 6 8 7]
 [1 4 7 1]]

# print(np.concatenate([arr1, arr2]))
 [[3 3 0 8]
 [2 0 3 1]
 [4 8 8 2]
 [6 8 7 3]
 [1 6 8 7]
 [1 4 7 1]]

# print(np.concatenate([arr1, arr2], axis=1)) 
[[3 3 0 8 6 8 7 3]
 [2 0 3 1 1 6 8 7]
 [4 8 8 2 1 4 7 1]]
"""

pd.concat
- Pay attention to specifying the axis direction, the default axis=0
- Join specifies the merging method, the default is outer
- Check whether the row index is duplicated when merging Series

df1 = pd.DataFrame(np.arange(6).reshape(3,2),index=list('abc'),columns=['one','two'])

df2 = pd.DataFrame(np.arange(4).reshape(2,2)+5,index=list('ac'),columns=['three','four'])

print(pd.concat([df1,df2])) #默认外连接,axis=0
print(pd.concat([df1,df2],axis='columns')) #指定axis=1连接
print(pd.concat([df1,df2],axis=1,join='inner'))
"""
   one  two  three  four
a  0.0  1.0    NaN   NaN
b  2.0  3.0    NaN   NaN
c  4.0  5.0    NaN   NaN
a  NaN  NaN    5.0   6.0
c  NaN  NaN    7.0   8.0
   one  two  three  four
a    0    1    5.0   6.0
b    2    3    NaN   NaN
c    4    5    7.0   8.0
   one  two  three  four
a    0    1      5     6
c    4    5      7     8
"""

1.3 Reinvention

1.3.1 stack

  • Rotate column index to row index to complete hierarchical indexing
  • DataFrame->Series
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(['Ohio', 'Colorado'], name='state'),
                    columns=pd.Index(['one', 'two', 'three'],
                    name='number'))
print(data)

result = data.stack()
print(result)
#输出:
"""
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32
"""

1.3.2 unstack

  • Expand hierarchical index
  • Series->DataFrame
  • The default operation is the inner index, that is, level=1
# 默认操作内层索引
print(stacked.unstack())

# 通过level指定操作索引的级别
print(stacked.unstack(level=0))

"""
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5
"""

2 time series

Time series data is an important form of structured data. Any time observed or measured at multiple points in time forms a sequence over time. Many times, the time series has a fixed frequency, that is, the data points appear regularly according to a certain pattern (such as every 15 seconds...). Time series can also be irregular. The meaning of time series data depends on the specific application scenario. Mainly consists of the following types:

  • Timestamp, a specific moment.
  • Fixed period (period), such as January 2007 or the whole year 2010.
  • Time interval (interval), represented by start and end timestamps. Period can be thought of as a special case of interval.

2.1 python built-in time module

The Python standard library contains data types for date and time data, as well as calendar functionality. We will mainly use datetime, time and calendar modules. datetime.datetime (also abbreviated as datetime) is the most commonly used data type:

from datetime import datetime

now = datetime.now()

now

"""
datetime.datetime(2023, 11, 1, 21, 25, 38, 531179)
"""
now.year, now.month, now.day
"""
(2023, 11, 1)   # 返回元组
"""

datetime stores date and time in milliseconds. timedelta represents the time difference between two datetime objects:

delta = datetime(2023, 1, 7) - datetime(2022, 6, 24, 8, 15)
delta
"""
datetime.timedelta(days=196, seconds=56700)
"""
delta.days

# 输出:
"""
196
"""
from datetime import timedelta
start = datetime(2022, 1, 7)
start + timedelta(31)
"""
datetime.datetime(2022, 2, 7, 0, 0)
"""

Insert image description here

2.1.2 Convert string and datetime to each other

stamp = datetime(2023, 10, 1)
stamp
#输出
datetime.datetime(2023, 10, 1, 0, 0)
"""将datetime对象转换成字符串"""

# 强制转换
str(stamp)

# 格式化
stamp.strftime("%Y-%m-%d %H:%M:%S")
#输出:
'2023-10-01 00:00:00'
"""将字符串转换成datetime对象"""

datetime.strptime('2023/5/20', '%Y/%m/%d')
#输出
datetime.datetime(2023, 5, 20, 0, 0)
d = ['2023/7/20', '2023/7/21', '2023/7/22']
pd.to_datetime(d)   # 转换后是 DatetimeIndex 对象

index = pd.to_datetime(d + [None])
index  # NaT 时间序列的缺失值

pd.isnull(index)

index.dropna() #删除NaT 时间序列的缺失值

Insert image description here

2.2 Time series

symbol meaning example
%a abbreviated weekday name 'Wed'
%A Full working day name 'Wednesday'
%w Weekday number - 0 (Sunday) to 6 (Saturday) '3'
%d day of month (zero padding) '13'
%b abbreviated month name 'Jan'
%B Full month name 'January'
%m month of the year '01'
%y A year without a century '16'
%Y year with century '2016'
%H hour in 24 hour clock '17'
%I hour in 12 hour clock '05'
%p morning afternoon 'PM'
%M minute '00'
%S Second '00'
%f microseconds '000000'
%z UTC offset for time zone aware objects '-0500'
%Z time zone name 'EST'
%j one day of the year '013'
%W week of the year '02'
%c Date and time representation of the current locale 'Wed Jan 13 17:00:00 2016'
%x Date representation for the current locale '01/13/16'
%X Time representation for the current locale '17:00:00'
%% Literal %characters '%'
dates = [datetime(2020, 12, 12), datetime(2020, 12, 13), 
         datetime(2020, 12, 14), datetime(2020, 12, 15), 
         datetime(2020, 12, 16)]


ts = pd.Series(np.random.randn(5), index=dates)
ts  # 创建一个 Series 对象, 以时间为行索引
#输出
2020-12-12    0.376037
2020-12-13    0.426828
2020-12-14    0.050578
2020-12-15    0.302734
2020-12-16   -0.068885
dtype: float64
ts.index  # 行索引是时间序列索引对象
输出
DatetimeIndex(['2020-12-12', '2020-12-13', '2020-12-14', '2020-12-15',
               '2020-12-16'],
              dtype='datetime64[ns]', freq=None)

2.2.1 Value

ts
2020-12-12    0.376037
2020-12-13    0.426828
2020-12-14    0.050578
2020-12-15    0.302734
2020-12-16   -0.068885
dtype: float64
ts[::2]  # 取步长
2020-12-12    0.376037
2020-12-14    0.050578
2020-12-16   -0.068885
dtype: float64

Insert image description here

2.2.2 Estimated time

ts1 = pd.Series(
    data=np.random.randn(1000),
    index=pd.date_range('2020/01/01', periods=1000)
)

ts1
2020-01-01   -0.700438
2020-01-02   -1.961004
2020-01-03   -0.226558
2020-01-04   -0.594778
2020-01-05   -1.394763
                ...   
2022-09-22    0.025743
2022-09-23    0.784406
2022-09-24   -0.930995
2022-09-25    0.974117
2022-09-26   -1.625869
Freq: D, Length: 1000, dtype: float64
ts1['2021-05-18': '2021-05-27']  # 切片
2021-05-18   -0.487663
2021-05-19   -0.529925
2021-05-20   -0.316952
2021-05-21    0.476325
2021-05-22   -1.006280
2021-05-23    0.438202
2021-05-24    1.505284
2021-05-25    0.523409
2021-05-26   -1.139620
2021-05-27    0.573387
Freq: D, dtype: float64

2.2.3 Specify time range

pd.date_range('2020-01-01', '2023-08-28')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2023-08-19', '2023-08-20', '2023-08-21', '2023-08-22',
               '2023-08-23', '2023-08-24', '2023-08-25', '2023-08-26',
               '2023-08-27', '2023-08-28'],
              dtype='datetime64[ns]', length=1336, freq='D')
pd.date_range(start='2020-01-01', periods=100)  # 往后推算100天
pd.date_range(end='2023-08-28', periods=100)  # 往前推算100天
Resampling and frequency conversion

Resampling refers to the process of converting a time series from one frequency to another. Aggregating high-frequency data to low frequencies is called downsampling, while converting low-frequency data to high frequencies is called upsampling. Not all resampling can be divided into these two broad categories. For example, converting W-WED (every Wednesday) to W-FRI is neither downsampling nor upsampling.

All pandas objects have a resample method, which is the main function for various frequency conversion tasks. resample has an API similar to groupby. Calling resample can group data, and then call an aggregate function.
resample is a flexible and efficient method that can be used to process very large time series.
Insert image description here

pd.date_range(start='2023-08-28', periods=10, freq='BH')  # freq指定时间频率

Output:

DatetimeIndex(['2023-08-28 09:00:00', '2023-08-28 10:00:00',
               '2023-08-28 11:00:00', '2023-08-28 12:00:00',
               '2023-08-28 13:00:00', '2023-08-28 14:00:00',
               '2023-08-28 15:00:00', '2023-08-28 16:00:00',
               '2023-08-29 09:00:00', '2023-08-29 10:00:00'],
              dtype='datetime64[ns]', freq='BH')

Guess you like

Origin blog.csdn.net/m0_73678713/article/details/134153386