Processing date and time
- 1.1 Convert string to date
- 1.2 Processing time zone
- 1.3 Select date and time
- 1.4 Divide the date data into multiple features
- 1.5 Calculate the time difference between two dates
- 1.6 Encode the days of the week
- 1.7 Create a lagging feature
- 1.8 Using a rolling time window
- 1.9 Dealing with missing values in time series
1.1 Convert string to date
Convert a string vector representing date and time into time series data
Solution:
Use pandas' to_datatime function and specify the date and time format of the string through the format parameter.
import numpy as np
import pandas as pd
#创建字符串
date_strings = np.array(['03-04-2005 10:35 pM',
'23-05-2010 12:33 PM',
'04-09-2009 09:10 AM'])
#转换成datatime类型的数据
for date in date_strings:
print(pd.to_datetime(date,format = '%d-%m-%Y %I:%M %p'))
result:
2005-04-03 22:35:00
2010-05-23 12:33:00
2009-09-04 09:10:00
Common date and time formatting codes:
Code | description | example |
---|---|---|
%AND | Full year | 2020 |
%m | Month, the first vacancy needs to be filled with 0 | 09 |
%d | Day, the first vacancy needs to be filled with 0 | 11 |
%I | Hours, the first vacancy needs to be filled with 0 | 16 |
%P | AM or PM | PM |
%M | Points, the first vacancy needs to be filled with 0 | 19 |
%S | Seconds, the first vacancy needs to be filled with 0 | 50 |
1.2 Processing time zone
Add or change the time zone for a set of time series data
If not specified, all pandas objects have no time zone. You can specify the time zone through the tz parameter when creating the object
import pandas as pd
#创建datetime
pd.Timestamp('2020-09-11 16:27:44', tz = 'Europe/London')
output:
Timestamp('2020-09-11 16:27:44+0100', tz='Europe/London')
date = pd.Timestamp('2020-09-11 16:27:44')
date_in_london = date.tz_localize('Europe/London')
date_in_london
output:
Timestamp('2020-09-11 16:27:44+0100', tz='Europe/London')
- Convert time zone
date_in_london.tz_convert('Africa/Abidjan')
output:
Timestamp('2020-09-11 15:27:44+0000', tz='Africa/Abidjan')
Import the all_timezones library to see all strings representing time zones
1.3 Select date and time
Select one or more dates from a set of date vectors.
Solution:
1. Use two Boolean conditional sentences to set the start date and end date respectively
import pandas as pd
dataframe = pd.DataFrame()
dataframe['date'] = pd.date_range('1/1/2019',periods = 100000,freq = 'H')
#筛选出两个日期之间的观察值
dataframe[(dataframe['date'] > '2020-09-11 16:00:00')&
(dataframe['date'] <='2020-09-11 19:00:00')]
output:
date
14873 2020-09-11 17:00:00
14874 2020-09-11 18:00:00
14875 2020-09-11 19:00:00
2. Set the data (date) column as the index column of the data frame, and then use loc to filter
dataframe = dataframe.set_index(dataframe['date'])
dataframe.loc['2020-09-11 17:00:00':'2020-09-11 19:00:00']
1.4 Divide the date data into multiple features
Use a series of date and time data to create the characteristics of year, month, day, hour, and minute.
Solution:
Use the time attribute of the pandas function Series.dt
import pandas as pd
dataframe = pd.DataFrame()
dataframe['date'] = pd.date_range('11/09/2020',periods=5,freq = 'W')
dataframe['year'] = dataframe['date'].dt.year
dataframe['month'] = dataframe['date'].dt.month
dataframe['day'] = dataframe['date'].dt.day
dataframe['hour'] = dataframe['date'].dt.hour
dataframe['minute'] = dataframe['date'].dt.minute
dataframe
output:
date year month day hour minute
0 2020-11-15 2020 11 15 0 0
1 2020-11-22 2020 11 22 0 0
2 2020-11-29 2020 11 29 0 0
3 2020-12-06 2020 12 6 0 0
4 2020-12-13 2020 12 13 0 0
1.5 Calculate the time difference between two dates
Calculate the time difference between two date features for each observation.
Solution:
Use pandas to subtract the two date features
1.6 Encode the days of the week
Find the day of the week for each date in a date vector.
Solution:
Use the weekday_name property of Series.dt in pandas
import pandas as pd
dates = pd.Series(pd.date_range("11/09/2002",periods = 3,freq = "M"))
dates.dt.weekday
output:
0 5
1 1
2 4
dtype: int64
dates.dt.weekday_name report error
1.7 Create a lagging feature
Create a feature that lags n time periods.
Solution: Use pandas shift
import pandas as pd
dataframe = pd.DataFrame()
datframe['dates'] = pd.date_range("11/09/2020",periods = 5,freq = "D")
dataframe['stock_price'] = [1.1,2.2,3.3,4.4,5.5]
#让值滞后一行
dataframe['previous_days_stock_price'] = dataframe['stock_price'].shift(1)
dataframe
output:
dates stock_price previous_days_stock_price
0 2020-11-09 1.1 NaN
1 2020-11-10 2.2 1.1
2 2020-11-11 3.3 2.2
3 2020-11-12 4.4 3.3
4 2020-11-13 5.5 4.4
1.8 Using a rolling time window
Calculate the statistics of a time series data for a certain rolling time.
Solution:
import pandas as pd
time_index = pd.date_range("01/01/2020",periods = 5,freq = "M")
#创建数据帧,设置索引
dataframe = pd.DataFrame(index = time_index)
#创建特征
dataframe['Stock_price'] = [1,2,3,4,5]
#计算滚动平均值
dataframe.rolling(window = 2).mean()
output:
Stock_price
2020-01-31 NaN
2020-02-29 1.5
2020-03-31 2.5
2020-04-30 3.5
2020-05-31 4.5
Rolling average is often used to smooth time series data, because using the average of the entire time window can weaken the impact of short-term fluctuations.
1.9 Dealing with missing values in time series
Dealing with missing values in
time series Solution: For time series data, interpolation can be used to fill in data gaps caused by missing values.
import pandas as pd
import numpy as np
#创建日期
time_index = pd.date_range("01/01/2020",periods = 5,freq = "M")
#创建数据帧,设置索引
dataframe = pd.DataFrame(index = time_index)
#创建带缺数据的特征
dataframe['Sales'] = [1.0,2.0,np.nan,np.nan,5.0]
#对数据进行插值
dataframe.interpolate()
output:
Sales
2020-01-31 1.0
2020-02-29 2.0
2020-03-31 3.0
2020-04-30 4.0
2020-05-31 5.0
Fill forward:
dataframe.ffill()
output:
Sales
2020-01-31 1.0
2020-02-29 2.0
2020-03-31 2.0
2020-04-30 2.0
2020-05-31 5.0
Fill back:
dataframe.bfill()
output:
Sales
2020-01-31 1.0
2020-02-29 2.0
2020-03-31 5.0
2020-04-30 5.0
2020-05-31 5.0
If the line between the known points is non-linear, you can use the method parameter of interpolate to specify the interpolation method:
dataframe.interpolate(method = "quadratic")
output:
2020-01-31 1.000000
2020-02-29 2.000000
2020-03-31 3.040158
2020-04-30 4.018418
2020-05-31 5.000000