Processing date and time of pytho machine learning values

1.1 Convert string to date

Convert a string vector representing date and time into time series data

Solution:
Use pandas' to_datatime function and specify the date and time format of the string through the format parameter.

import numpy as np
import pandas as pd

#创建字符串
date_strings = np.array(['03-04-2005 10:35 pM',
                         '23-05-2010 12:33 PM',
                         '04-09-2009 09:10 AM'])

#转换成datatime类型的数据
for date in date_strings:
    print(pd.to_datetime(date,format = '%d-%m-%Y %I:%M %p'))

result:

2005-04-03 22:35:00
2010-05-23 12:33:00
2009-09-04 09:10:00

Common date and time formatting codes:

Code description example
%AND Full year 2020
%m Month, the first vacancy needs to be filled with 0 09
%d Day, the first vacancy needs to be filled with 0 11
%I Hours, the first vacancy needs to be filled with 0 16
%P AM or PM PM
%M Points, the first vacancy needs to be filled with 0 19
%S Seconds, the first vacancy needs to be filled with 0 50

1.2 Processing time zone

Add or change the time zone for a set of time series data

If not specified, all pandas objects have no time zone. You can specify the time zone through the tz parameter when creating the object

import pandas as pd
#创建datetime
pd.Timestamp('2020-09-11 16:27:44', tz = 'Europe/London')

output:

Timestamp('2020-09-11 16:27:44+0100', tz='Europe/London')
date = pd.Timestamp('2020-09-11 16:27:44')
date_in_london = date.tz_localize('Europe/London')
date_in_london

output:

Timestamp('2020-09-11 16:27:44+0100', tz='Europe/London')
  • Convert time zone
date_in_london.tz_convert('Africa/Abidjan')

output:

Timestamp('2020-09-11 15:27:44+0000', tz='Africa/Abidjan')

Import the all_timezones library to see all strings representing time zones

1.3 Select date and time

Select one or more dates from a set of date vectors.
Solution:
1. Use two Boolean conditional sentences to set the start date and end date respectively

import pandas as pd 
dataframe = pd.DataFrame()
dataframe['date'] = pd.date_range('1/1/2019',periods = 100000,freq = 'H')
#筛选出两个日期之间的观察值
dataframe[(dataframe['date'] > '2020-09-11 16:00:00')&
              (dataframe['date'] <='2020-09-11 19:00:00')]

output:

	    date
14873	2020-09-11 17:00:00  
14874	2020-09-11 18:00:00  
14875	2020-09-11 19:00:00  

2. Set the data (date) column as the index column of the data frame, and then use loc to filter

dataframe = dataframe.set_index(dataframe['date'])
dataframe.loc['2020-09-11 17:00:00':'2020-09-11 19:00:00']

1.4 Divide the date data into multiple features

Use a series of date and time data to create the characteristics of year, month, day, hour, and minute.
Solution:
Use the time attribute of the pandas function Series.dt

import pandas as pd 
dataframe = pd.DataFrame()
dataframe['date'] = pd.date_range('11/09/2020',periods=5,freq = 'W')

dataframe['year'] = dataframe['date'].dt.year
dataframe['month'] = dataframe['date'].dt.month
dataframe['day'] = dataframe['date'].dt.day
dataframe['hour'] = dataframe['date'].dt.hour
dataframe['minute'] = dataframe['date'].dt.minute
dataframe

output:

    date	    year	month  day hour minute
0	2020-11-15	2020	11	15	0	0
1	2020-11-22	2020	11	22	0	0
2	2020-11-29	2020	11	29	0	0
3	2020-12-06	2020	12	6	0	0
4	2020-12-13	2020	12	13	0	0

1.5 Calculate the time difference between two dates

Calculate the time difference between two date features for each observation.
Solution:
Use pandas to subtract the two date features

1.6 Encode the days of the week

Find the day of the week for each date in a date vector.
Solution:
Use the weekday_name property of Series.dt in pandas

import pandas as pd
dates = pd.Series(pd.date_range("11/09/2002",periods = 3,freq = "M"))
dates.dt.weekday

output:

0    5
1    1
2    4
dtype: int64

dates.dt.weekday_name report error

1.7 Create a lagging feature

Create a feature that lags n time periods.
Solution: Use pandas shift

import pandas as pd
dataframe = pd.DataFrame()

datframe['dates'] = pd.date_range("11/09/2020",periods = 5,freq = "D")
dataframe['stock_price'] = [1.1,2.2,3.3,4.4,5.5]

#让值滞后一行
dataframe['previous_days_stock_price'] = dataframe['stock_price'].shift(1)

dataframe

output:

    dates	   stock_price	previous_days_stock_price
0	2020-11-09	1.1	        NaN
1	2020-11-10	2.2	        1.1
2	2020-11-11	3.3	        2.2
3	2020-11-12	4.4	        3.3
4	2020-11-13	5.5	        4.4

1.8 Using a rolling time window

Calculate the statistics of a time series data for a certain rolling time.
Solution:

import pandas as pd

time_index = pd.date_range("01/01/2020",periods = 5,freq = "M")
#创建数据帧,设置索引
dataframe = pd.DataFrame(index = time_index)
#创建特征
dataframe['Stock_price'] = [1,2,3,4,5]

#计算滚动平均值
dataframe.rolling(window = 2).mean()

output:


            Stock_price
2020-01-31	NaN
2020-02-29	1.5
2020-03-31	2.5
2020-04-30	3.5
2020-05-31	4.5

Rolling average is often used to smooth time series data, because using the average of the entire time window can weaken the impact of short-term fluctuations.

1.9 Dealing with missing values ​​in time series

Dealing with missing values ​​in
time series Solution: For time series data, interpolation can be used to fill in data gaps caused by missing values.

import pandas as pd
import numpy as np

#创建日期
time_index = pd.date_range("01/01/2020",periods = 5,freq = "M")

#创建数据帧,设置索引
dataframe = pd.DataFrame(index = time_index)

#创建带缺数据的特征
dataframe['Sales'] = [1.0,2.0,np.nan,np.nan,5.0]

#对数据进行插值
dataframe.interpolate()

output:

	        Sales
2020-01-31	1.0
2020-02-29	2.0
2020-03-31	3.0
2020-04-30	4.0
2020-05-31	5.0

Fill forward:

dataframe.ffill()

output:

	        Sales
2020-01-31	1.0
2020-02-29	2.0
2020-03-31	2.0
2020-04-30	2.0
2020-05-31	5.0

Fill back:

dataframe.bfill()

output:

            Sales
2020-01-31	1.0
2020-02-29	2.0
2020-03-31	5.0
2020-04-30	5.0
2020-05-31	5.0

If the line between the known points is non-linear, you can use the method parameter of interpolate to specify the interpolation method:

dataframe.interpolate(method = "quadratic")

output:

2020-01-31	1.000000
2020-02-29	2.000000
2020-03-31	3.040158
2020-04-30	4.018418
2020-05-31	5.000000

Guess you like

Origin blog.csdn.net/weixin_44127327/article/details/108539609