52_Pandas processing date and time columns (string conversion, date extraction, etc.)
Will explain how to manipulate columns representing dates and times (date and time) of pandas.DataFrame. The mutual conversion between string and datetime64[ns] type, the method of extracting date and time as numbers, etc.
The following contents are explained.
Convert the string to datetime64[ns] type (timestamp type): to_datetime()
Timestamp type properties/methods
Batch the entire column using the dt accessor
Extract dates, days of the week, etc.
Convert datetime to string in any format
Convert to Python data frame type, NumPy datetime64[ns] type array
For methods not provided in dt
For datetime indexes
Convert string to datetime64[ns] type when reading from file
How to specify the datetime64[ns] type as an index and process it as time series data and how to use it, please refer to the following article.
- 26_Pandas.DataFrame time series data processing
- 27_Pandas calculate total and average of time series data by day of week, month, quarter and year
Take for example a pandas.DataFrame with the following csv file.
import pandas as pd
import datetime
df = pd.read_csv('./data/sample_datetime_multi.csv')
print(df)
# A B
#0 2017-11-01 12:24 2017年11月1日 12时24分
#1 2017-11-18 23:00 2017年11月18日 23时00分
#2 2017-12-05 5:05 2017年12月5日 5时05分
#3 2017-12-22 8:54 2017年12月22日 8时54分
#4 2018-01-08 14:20 2018年1月8日 14时20分
#5 2018-01-19 20:01 2018年1月19日 20时01分
Convert the string to datetime64[ns] type (timestamp type): to_datetime()
Using the pandas.to_datetime() function, you can convert a pandas.Series of string columns representing dates and times to datetime64[ns] type.
print(pd.to_datetime(df['A']))
# 0 2017-11-01 12:24:00
# 1 2017-11-18 23:00:00
# 2 2017-12-05 05:05:00
# 3 2017-12-22 08:54:00
# 4 2018-01-08 14:20:00
# 5 2018-01-19 20:01:00
# Name: A, dtype: datetime64[ns]
If the format is not standard, specify a format string in the parameter format.
print(pd.to_datetime(df['B'], format='%Y年%m月%d日 %H时%M分'))
# 0 2017-11-01 12:24:00
# 1 2017-11-18 23:00:00
# 2 2017-12-05 05:05:00
# 3 2017-12-22 08:54:00
# 4 2018-01-08 14:20:00
# 5 2018-01-19 20:01:00
# Name: B, dtype: datetime64[ns]
Even if the original formats are different, datetime64[ns] type values are equivalent if the indicated date and time are the same.
print(pd.to_datetime(df['A']) == pd.to_datetime(df['B'], format='%Y年%m月%d日 %H时%M分'))
# 0 True
# 1 True
# 2 True
# 3 True
# 4 True
# 5 True
# dtype: bool
If you want to add a column converted to datetime64[ns] type as a new column to a pandas.DataFrame, specify the new column name and assign it. If you specify the original column name, it will be overwritten.
df['X'] = pd.to_datetime(df['A'])
print(df)
# A B X
#0 2017-11-01 12:24 2017年11月1日 12时24分 2017-11-01 12:24:00
#1 2017-11-18 23:00 2017年11月18日 23时00分 2017-11-18 23:00:00
#2 2017-12-05 5:05 2017年12月5日 5时05分 2017-12-05 05:05:00
#3 2017-12-22 8:54 2017年12月22日 8时54分 2017-12-22 08:54:00
#4 2018-01-08 14:20 2018年1月8日 14时20分 2018-01-08 14:20:00
#5 2018-01-19 20:01 2018年1月19日 20时01分 2018-01-19 20:01:00
Timestamp type properties/methods
The dtype of the column converted by the pandas.to_datetime() function is the datetime64[ns] type, and each element is of the Timestamp type.
print(df)
# A B X
# 0 2017-11-01 12:24 2017年11月1日 12时24分 2017-11-01 12:24:00
# 1 2017-11-18 23:00 2017年11月18日 23时00分 2017-11-18 23:00:00
# 2 2017-12-05 5:05 2017年12月5日 5时05分 2017-12-05 05:05:00
# 3 2017-12-22 8:54 2017年12月22日 8时54分 2017-12-22 08:54:00
# 4 2018-01-08 14:20 2018年1月8日 14时20分 2018-01-08 14:20:00
# 5 2018-01-19 20:01 2018年1月19日 20时01分 2018-01-19 20:01:00
print(df.dtypes)
# A object
# B object
# X datetime64[ns]
# dtype: object
print(df['X'][0])
# 2017-11-01 12:24:00
print(type(df['X'][0]))
# <class 'pandas._libs.tslib.Timestamp'>
The Timestamp type inherits from and extends the datetime type of the Python standard library datetime.
print(issubclass(pd.Timestamp, datetime.datetime))
# True
Year, month, day (year, month, day), hour, minute, second (hour, minute, second), day of the week (string: weekday_name, number: dayofweek), etc. can be obtained as attributes.
print(df['X'][0].year)
# 2017
print(df['X'][0].weekday_name)
# Wednesday
You can also use to_pydatetime() to convert to the Python standard library datetime type, and to_datetime64() to convert to the NumPy datetime64[ns] type.
py_dt = df['X'][0].to_pydatetime()
print(type(py_dt))
# <class 'datetime.datetime'>
dt64 = df['X'][0].to_datetime64()
print(type(dt64))
# <class 'numpy.datetime64'>
timestamp() is a method that returns the UNIX time (epoch seconds = seconds since January 1, 1970 00:00:00) as a float type. If you need an integer, use int().
print(df['X'][0].timestamp())
# 1509539040.0
print(pd.to_datetime('1970-01-01 00:00:00').timestamp())
# 0.0
print(int(df['X'][0].timestamp()))
# 1509539040
Like the datetime type in the Python standard library, strftime() can be used to convert to a string of any format. See below for how to apply this to all elements of a column.
print(df['X'][0].strftime('%Y/%m/%d'))
# 2017/11/01
Batch the entire column using the dt accessor
There is a str accessor to apply string processing to the whole pandas.Series.
Extract the date, day of the week.
Like the Timestamp type, year, month, day (year, month, day), hour, minute, second (hour, minute, second), day of the week (string: weekday_name, number: dayofweek), etc. can be obtained as attributes. Write each attribute name after dt. Each element of the pandas.Series is processed and a pandas.Series is returned.
print(df['X'].dt.year)
# 0 2017
# 1 2017
# 2 2017
# 3 2017
# 4 2018
# 5 2018
# Name: X, dtype: int64
print(df['X'].dt.hour)
# 0 12
# 1 23
# 2 5
# 3 8
# 4 14
# 5 20
# Name: X, dtype: int64
It is also possible to use dayofweek (0 for monday, 6 for sunday) to fetch only rows for specific days of the week.
print(df['X'].dt.dayofweek)
# 0 2
# 1 5
# 2 1
# 3 4
# 4 0
# 5 4
# Name: X, dtype: int64
print(df[df['X'].dt.dayofweek == 4])
# A B X
# 3 2017-12-22 8:54 2017年12月22日 8时54分 2017-12-22 08:54:00
# 5 2018-01-19 20:01 2018年1月19日 20时01分 2018-01-19 20:01:00
Convert datetime to string in any format
When converting a column of type datetime64[ns] to type string str using the astype() method, it is converted to a string in the standard format.
print(df['X'].astype(str))
# 0 2017-11-01 12:24:00
# 1 2017-11-18 23:00:00
# 2 2017-12-05 05:05:00
# 3 2017-12-22 08:54:00
# 4 2018-01-08 14:20:00
# 5 2018-01-19 20:01:00
# Name: X, dtype: object
dt.strftime() can be used to convert a column to a string of any format in one go. It is also possible to make it a string with only date or only time.
print(df['X'].dt.strftime('%A, %B %d, %Y'))
# 0 Wednesday, November 01, 2017
# 1 Saturday, November 18, 2017
# 2 Tuesday, December 05, 2017
# 3 Friday, December 22, 2017
# 4 Monday, January 08, 2018
# 5 Friday, January 19, 2018
# Name: X, dtype: object
print(df['X'].dt.strftime('%Y年%m月%d日'))
# 0 2017年11月01日
# 1 2017年11月18日
# 2 2017年12月05日
# 3 2017年12月22日
# 4 2018年01月08日
# 5 2018年01月19日
# Name: X, dtype: object
If you want to add a column converted to a string as a new column to a pandas.DataFrame, specify the new column name and assign it. If you specify the original column name, it will be overwritten.
df['en'] = df['X'].dt.strftime('%A, %B %d, %Y')
df['cn'] = df['X'].dt.strftime('%Y年%m月%d日')
print(df)
# A B X \
# 0 2017-11-01 12:24 2017年11月1日 12时24分 2017-11-01 12:24:00
# 1 2017-11-18 23:00 2017年11月18日 23时00分 2017-11-18 23:00:00
# 2 2017-12-05 5:05 2017年12月5日 5时05分 2017-12-05 05:05:00
# 3 2017-12-22 8:54 2017年12月22日 8时54分 2017-12-22 08:54:00
# 4 2018-01-08 14:20 2018年1月8日 14时20分 2018-01-08 14:20:00
# 5 2018-01-19 20:01 2018年1月19日 20时01分 2018-01-19 20:01:00
# en cn
# 0 Wednesday, November 01, 2017 2017年11月01日
# 1 Saturday, November 18, 2017 2017年11月18日
# 2 Tuesday, December 05, 2017 2017年12月05日
# 3 Friday, December 22, 2017 2017年12月22日
# 4 Monday, January 08, 2018 2018年01月08日
# 5 Friday, January 19, 2018 2018年01月19日
Convert to Python data frame type, NumPy datetime64[ns] type array
A NumPy array ndarray whose elements are Python standard library datetime type objects can be obtained using dt.to_pydatetime().
print(df['X'].dt.to_pydatetime())
# [datetime.datetime(2017, 11, 1, 12, 24)
# datetime.datetime(2017, 11, 18, 23, 0)
# datetime.datetime(2017, 12, 5, 5, 5)
# datetime.datetime(2017, 12, 22, 8, 54)
# datetime.datetime(2018, 1, 8, 14, 20)
# datetime.datetime(2018, 1, 19, 20, 1)]
print(type(df['X'].dt.to_pydatetime()))
print(type(df['X'].dt.to_pydatetime()[0]))
# <class 'numpy.ndarray'>
# <class 'datetime.datetime'>
The datetime64[ns] type array of NumPy can be obtained by using the values attribute instead of the method.
print(df['X'].values)
# ['2017-11-01T12:24:00.000000000' '2017-11-18T23:00:00.000000000'
# '2017-12-05T05:05:00.000000000' '2017-12-22T08:54:00.000000000'
# '2018-01-08T14:20:00.000000000' '2018-01-19T20:01:00.000000000']
print(type(df['X'].values))
print(type(df['X'].values[0]))
# <class 'numpy.ndarray'>
# <class 'numpy.datetime64'>
For methods not provided in dt
For example, the Timestamp type has a method (timestamp()) that returns the UNIX time in seconds, but the dt accessor does not. In this case, just use map().
print(df['X'].map(pd.Timestamp.timestamp))
# 0 1.509539e+09
# 1 1.511046e+09
# 2 1.512450e+09
# 3 1.513933e+09
# 4 1.515421e+09
# 5 1.516392e+09
# Name: X, dtype: float64
If you want to convert to an integer type int, use the astype() method.
print(df['X'].map(pd.Timestamp.timestamp).astype(int))
# 0 1509539040
# 1 1511046000
# 2 1512450300
# 3 1513932840
# 4 1515421200
# 5 1516392060
# Name: X, dtype: int64
For datetime indexes
Very useful when working with time series data. See the article below for more details.
- 26_Pandas.DataFrame time series data processing
- 27_Pandas calculate total and average of time series data by day of week, month, quarter and year
In the example, set_index() is used to designate an existing column as an index, and the drop() method is used to remove redundant columns for convenience.
- 12_Pandas.DataFrame delete specified row and column (drop)
- 22_Pandas.DataFrame, reset the row name of the column (set_index)
df_i = df.set_index('X').drop(['en', 'cn'], axis=1)
print(df_i)
# A B
# X
# 2017-11-01 12:24:00 2017-11-01 12:24 2017年11月1日 12时24分
# 2017-11-18 23:00:00 2017-11-18 23:00 2017年11月18日 23时00分
# 2017-12-05 05:05:00 2017-12-05 5:05 2017年12月5日 5时05分
# 2017-12-22 08:54:00 2017-12-22 8:54 2017年12月22日 8时54分
# 2018-01-08 14:20:00 2018-01-08 14:20 2018年1月8日 14时20分
# 2018-01-19 20:01:00 2018-01-19 20:01 2018年1月19日 20时01分
print(df_i.index)
# DatetimeIndex(['2017-11-01 12:24:00', '2017-11-18 23:00:00',
# '2017-12-05 05:05:00', '2017-12-22 08:54:00',
# '2018-01-08 14:20:00', '2018-01-19 20:01:00'],
# dtype='datetime64[ns]', name='X', freq=None)
The DatetimeIndex type index has attributes such as year, month, day (year, month, day), hour, minute, second (hour, minute, second), day of the week (string: weekday_name, number: dayofweek), and methods such as strftime() is added so that all indexed elements can be processed at once without going through the dt attribute.
The return type varies by property and method, it is not pandas.Series, but if you want to add new columns in pandas.DataFrame, you can specify new column names and assign.
print(df_i.index.minute)
# Int64Index([24, 0, 5, 54, 20, 1], dtype='int64', name='X')
print(df_i.index.strftime('%y/%m/%d'))
# ['17/11/01' '17/11/18' '17/12/05' '17/12/22' '18/01/08' '18/01/19']
df_i['min'] = df_i.index.minute
df_i['str'] = df_i.index.strftime('%y/%m/%d')
print(df_i)
# A B min str
# X
# 2017-11-01 12:24:00 2017-11-01 12:24 2017年11月1日 12时24分 24 17/11/01
# 2017-11-18 23:00:00 2017-11-18 23:00 2017年11月18日 23时00分 0 17/11/18
# 2017-12-05 05:05:00 2017-12-05 5:05 2017年12月5日 5时05分 5 17/12/05
# 2017-12-22 08:54:00 2017-12-22 8:54 2017年12月22日 8时54分 54 17/12/22
# 2018-01-08 14:20:00 2018-01-08 14:20 2018年1月8日 14时20分 20 18/01/08
# 2018-01-19 20:01:00 2018-01-19 20:01 2018年1月19日 20时01分 1 18/01/19
Convert string to datetime64[ns] type when reading from file
When reading data from a file, you can convert the string to datetime64[ns] type while reading. For the pandas.read_csv() function, specify in the parameter parse_dates a list of column numbers to convert to datetime64[ns] type. Note that even if there is only one, it must be listed.
df_csv = pd.read_csv('data/sample_datetime_multi.csv', parse_dates=[0])
print(df_csv)
# A B
# 0 2017-11-01 12:24:00 2017年11月1日 12时24分
# 1 2017-11-18 23:00:00 2017年11月18日 23时00分
# 2 2017-12-05 05:05:00 2017年12月5日 5时05分
# 3 2017-12-22 08:54:00 2017年12月22日 8时54分
# 4 2018-01-08 14:20:00 2018年1月8日 14时20分
# 5 2018-01-19 20:01:00 2018年1月19日 20时01分
print(df_csv.dtypes)
# A datetime64[ns]
# B object
# dtype: object
df_csv_jp = pd.read_csv('./data/sample_datetime_multi.csv',
parse_dates=[1],
date_parser=lambda date: pd.to_datetime(date, format='%Y年%m月%d日 %H时%M分'))
print(df_csv_jp)
# A B
# 0 2017-11-01 12:24 2017-11-01 12:24:00
# 1 2017-11-18 23:00 2017-11-18 23:00:00
# 2 2017-12-05 5:05 2017-12-05 05:05:00
# 3 2017-12-22 8:54 2017-12-22 08:54:00
# 4 2018-01-08 14:20 2018-01-08 14:20:00
# 5 2018-01-19 20:01 2018-01-19 20:01:00
print(df_csv_jp.dtypes)
# A object
# B datetime64[ns]
# dtype: object
The column to be indexed can be specified with the parameter index_col.
In this case, if the parameter parse_dates=True, the index column will be converted to datetime64[ns] type.
df_csv_jp_i = pd.read_csv('./data/sample_datetime_multi.csv',
index_col=1,
parse_dates=True,
date_parser=lambda date: pd.to_datetime(date, format='%Y年%m月%d日 %H时%M分'))
print(df_csv_jp_i)
# A
# B
# 2017-11-01 12:24:00 2017-11-01 12:24
# 2017-11-18 23:00:00 2017-11-18 23:00
# 2017-12-05 05:05:00 2017-12-05 5:05
# 2017-12-22 08:54:00 2017-12-22 8:54
# 2018-01-08 14:20:00 2018-01-08 14:20
# 2018-01-19 20:01:00 2018-01-19 20:01
print(df_csv_jp_i.index)
# DatetimeIndex(['2017-11-01 12:24:00', '2017-11-18 23:00:00',
# '2017-12-05 05:05:00', '2017-12-22 08:54:00',
# '2018-01-08 14:20:00', '2018-01-19 20:01:00'],
# dtype='datetime64[ns]', name='B', freq=None)
The pandas.read_excel() function that reads Excel files also has parameters parse_dates, date_parser, and index_col, so similar conversions can be done while reading. See the following article for information on the pandas.read_excel() function.