52_Pandas processing date and time columns (string conversion, date extraction, etc.)

52_Pandas processing date and time columns (string conversion, date extraction, etc.)

Will explain how to manipulate columns representing dates and times (date and time) of pandas.DataFrame. The mutual conversion between string and datetime64[ns] type, the method of extracting date and time as numbers, etc.

The following contents are explained.

Convert the string to datetime64[ns] type (timestamp type): to_datetime()

Timestamp type properties/methods

Batch the entire column using the dt accessor

Extract dates, days of the week, etc.

Convert datetime to string in any format

Convert to Python data frame type, NumPy datetime64[ns] type array

For methods not provided in dt

For datetime indexes

Convert string to datetime64[ns] type when reading from file

How to specify the datetime64[ns] type as an index and process it as time series data and how to use it, please refer to the following article.

Take for example a pandas.DataFrame with the following csv file.

import pandas as pd
import datetime

df = pd.read_csv('./data/sample_datetime_multi.csv')

print(df)
#                  A                   B
#0  2017-11-01 12:24   2017年11月1日 12时24分
#1  2017-11-18 23:00  2017年11月18日 23时00分
#2   2017-12-05 5:05    2017年12月5日 5时05分
#3   2017-12-22 8:54   2017年12月22日 8时54分
#4  2018-01-08 14:20    2018年1月8日 14时20分
#5  2018-01-19 20:01   2018年1月19日 20时01分

Convert the string to datetime64[ns] type (timestamp type): to_datetime()

Using the pandas.to_datetime() function, you can convert a pandas.Series of string columns representing dates and times to datetime64[ns] type.

print(pd.to_datetime(df['A']))
# 0   2017-11-01 12:24:00
# 1   2017-11-18 23:00:00
# 2   2017-12-05 05:05:00
# 3   2017-12-22 08:54:00
# 4   2018-01-08 14:20:00
# 5   2018-01-19 20:01:00
# Name: A, dtype: datetime64[ns]

If the format is not standard, specify a format string in the parameter format.

print(pd.to_datetime(df['B'], format='%Y年%m月%d日 %H时%M分'))
# 0   2017-11-01 12:24:00
# 1   2017-11-18 23:00:00
# 2   2017-12-05 05:05:00
# 3   2017-12-22 08:54:00
# 4   2018-01-08 14:20:00
# 5   2018-01-19 20:01:00
# Name: B, dtype: datetime64[ns]

Even if the original formats are different, datetime64[ns] type values ​​are equivalent if the indicated date and time are the same.

print(pd.to_datetime(df['A']) == pd.to_datetime(df['B'], format='%Y年%m月%d日 %H时%M分'))
# 0    True
# 1    True
# 2    True
# 3    True
# 4    True
# 5    True
# dtype: bool

If you want to add a column converted to datetime64[ns] type as a new column to a pandas.DataFrame, specify the new column name and assign it. If you specify the original column name, it will be overwritten.

df['X'] = pd.to_datetime(df['A'])

print(df)
#                   A                   B                   X
#0  2017-11-01 12:24   2017年11月1日 12时24分 2017-11-01 12:24:00
#1  2017-11-18 23:00  2017年11月18日 23时00分 2017-11-18 23:00:00
#2   2017-12-05 5:05    2017年12月5日 5时05分 2017-12-05 05:05:00
#3   2017-12-22 8:54   2017年12月22日 8时54分 2017-12-22 08:54:00
#4  2018-01-08 14:20    2018年1月8日 14时20分 2018-01-08 14:20:00
#5  2018-01-19 20:01   2018年1月19日 20时01分 2018-01-19 20:01:00

Timestamp type properties/methods

The dtype of the column converted by the pandas.to_datetime() function is the datetime64[ns] type, and each element is of the Timestamp type.

print(df)
#                   A                   B                   X
# 0  2017-11-01 12:24   2017年11月1日 12时24分 2017-11-01 12:24:00
# 1  2017-11-18 23:00  2017年11月18日 23时00分 2017-11-18 23:00:00
# 2   2017-12-05 5:05    2017年12月5日 5时05分 2017-12-05 05:05:00
# 3   2017-12-22 8:54   2017年12月22日 8时54分 2017-12-22 08:54:00
# 4  2018-01-08 14:20    2018年1月8日 14时20分 2018-01-08 14:20:00
# 5  2018-01-19 20:01   2018年1月19日 20时01分 2018-01-19 20:01:00

print(df.dtypes)
# A            object
# B            object
# X    datetime64[ns]
# dtype: object

print(df['X'][0])
# 2017-11-01 12:24:00

print(type(df['X'][0]))
# <class 'pandas._libs.tslib.Timestamp'>

The Timestamp type inherits from and extends the datetime type of the Python standard library datetime.

print(issubclass(pd.Timestamp, datetime.datetime))
# True

Year, month, day (year, month, day), hour, minute, second (hour, minute, second), day of the week (string: weekday_name, number: dayofweek), etc. can be obtained as attributes.

print(df['X'][0].year)
# 2017

print(df['X'][0].weekday_name)
# Wednesday

You can also use to_pydatetime() to convert to the Python standard library datetime type, and to_datetime64() to convert to the NumPy datetime64[ns] type.

py_dt = df['X'][0].to_pydatetime()
print(type(py_dt))
# <class 'datetime.datetime'>

dt64 = df['X'][0].to_datetime64()
print(type(dt64))
# <class 'numpy.datetime64'>

timestamp() is a method that returns the UNIX time (epoch seconds = seconds since January 1, 1970 00:00:00) as a float type. If you need an integer, use int().

print(df['X'][0].timestamp())
# 1509539040.0

print(pd.to_datetime('1970-01-01 00:00:00').timestamp())
# 0.0

print(int(df['X'][0].timestamp()))
# 1509539040

Like the datetime type in the Python standard library, strftime() can be used to convert to a string of any format. See below for how to apply this to all elements of a column.

print(df['X'][0].strftime('%Y/%m/%d'))
# 2017/11/01

Batch the entire column using the dt accessor

There is a str accessor to apply string processing to the whole pandas.Series.

Extract the date, day of the week.

Like the Timestamp type, year, month, day (year, month, day), hour, minute, second (hour, minute, second), day of the week (string: weekday_name, number: dayofweek), etc. can be obtained as attributes. Write each attribute name after dt. Each element of the pandas.Series is processed and a pandas.Series is returned.

print(df['X'].dt.year)
# 0    2017
# 1    2017
# 2    2017
# 3    2017
# 4    2018
# 5    2018
# Name: X, dtype: int64

print(df['X'].dt.hour)
# 0    12
# 1    23
# 2     5
# 3     8
# 4    14
# 5    20
# Name: X, dtype: int64

It is also possible to use dayofweek (0 for monday, 6 for sunday) to fetch only rows for specific days of the week.

print(df['X'].dt.dayofweek)
# 0    2
# 1    5
# 2    1
# 3    4
# 4    0
# 5    4
# Name: X, dtype: int64

print(df[df['X'].dt.dayofweek == 4])
#                   A                  B                   X
# 3   2017-12-22 8:54  2017年12月22日 8时54分 2017-12-22 08:54:00
# 5  2018-01-19 20:01  2018年1月19日 20时01分 2018-01-19 20:01:00

Convert datetime to string in any format

When converting a column of type datetime64[ns] to type string str using the astype() method, it is converted to a string in the standard format.

print(df['X'].astype(str))
# 0    2017-11-01 12:24:00
# 1    2017-11-18 23:00:00
# 2    2017-12-05 05:05:00
# 3    2017-12-22 08:54:00
# 4    2018-01-08 14:20:00
# 5    2018-01-19 20:01:00
# Name: X, dtype: object

dt.strftime() can be used to convert a column to a string of any format in one go. It is also possible to make it a string with only date or only time.

print(df['X'].dt.strftime('%A, %B %d, %Y'))
# 0    Wednesday, November 01, 2017
# 1     Saturday, November 18, 2017
# 2      Tuesday, December 05, 2017
# 3       Friday, December 22, 2017
# 4        Monday, January 08, 2018
# 5        Friday, January 19, 2018
# Name: X, dtype: object

print(df['X'].dt.strftime('%Y年%m月%d日'))
# 0    2017年11月01日
# 1    2017年11月18日
# 2    2017年12月05日
# 3    2017年12月22日
# 4    2018年01月08日
# 5    2018年01月19日
# Name: X, dtype: object

If you want to add a column converted to a string as a new column to a pandas.DataFrame, specify the new column name and assign it. If you specify the original column name, it will be overwritten.

df['en'] = df['X'].dt.strftime('%A, %B %d, %Y')
df['cn'] = df['X'].dt.strftime('%Y年%m月%d日')

print(df)
#                   A                   B                   X  \
# 0  2017-11-01 12:24   2017年11月1日 12时24分 2017-11-01 12:24:00   
# 1  2017-11-18 23:00  2017年11月18日 23时00分 2017-11-18 23:00:00   
# 2   2017-12-05 5:05    2017年12月5日 5时05分 2017-12-05 05:05:00   
# 3   2017-12-22 8:54   2017年12月22日 8时54分 2017-12-22 08:54:00   
# 4  2018-01-08 14:20    2018年1月8日 14时20分 2018-01-08 14:20:00   
# 5  2018-01-19 20:01   2018年1月19日 20时01分 2018-01-19 20:01:00   
#                              en           cn
# 0  Wednesday, November 01, 2017  2017年11月01日  
# 1   Saturday, November 18, 2017  2017年11月18日  
# 2    Tuesday, December 05, 2017  2017年12月05日  
# 3     Friday, December 22, 2017  2017年12月22日  
# 4      Monday, January 08, 2018  2018年01月08日  
# 5      Friday, January 19, 2018  2018年01月19日  

Convert to Python data frame type, NumPy datetime64[ns] type array

A NumPy array ndarray whose elements are Python standard library datetime type objects can be obtained using dt.to_pydatetime().

print(df['X'].dt.to_pydatetime())
# [datetime.datetime(2017, 11, 1, 12, 24)
#  datetime.datetime(2017, 11, 18, 23, 0)
#  datetime.datetime(2017, 12, 5, 5, 5)
#  datetime.datetime(2017, 12, 22, 8, 54)
#  datetime.datetime(2018, 1, 8, 14, 20)
#  datetime.datetime(2018, 1, 19, 20, 1)]

print(type(df['X'].dt.to_pydatetime()))
print(type(df['X'].dt.to_pydatetime()[0]))
# <class 'numpy.ndarray'>
# <class 'datetime.datetime'>

The datetime64[ns] type array of NumPy can be obtained by using the values ​​attribute instead of the method.

print(df['X'].values)
# ['2017-11-01T12:24:00.000000000' '2017-11-18T23:00:00.000000000'
#  '2017-12-05T05:05:00.000000000' '2017-12-22T08:54:00.000000000'
#  '2018-01-08T14:20:00.000000000' '2018-01-19T20:01:00.000000000']

print(type(df['X'].values))
print(type(df['X'].values[0]))
# <class 'numpy.ndarray'>
# <class 'numpy.datetime64'>

For methods not provided in dt

For example, the Timestamp type has a method (timestamp()) that returns the UNIX time in seconds, but the dt accessor does not. In this case, just use map().

print(df['X'].map(pd.Timestamp.timestamp))
# 0    1.509539e+09
# 1    1.511046e+09
# 2    1.512450e+09
# 3    1.513933e+09
# 4    1.515421e+09
# 5    1.516392e+09
# Name: X, dtype: float64

If you want to convert to an integer type int, use the astype() method.

print(df['X'].map(pd.Timestamp.timestamp).astype(int))
# 0    1509539040
# 1    1511046000
# 2    1512450300
# 3    1513932840
# 4    1515421200
# 5    1516392060
# Name: X, dtype: int64

For datetime indexes

Very useful when working with time series data. See the article below for more details.

In the example, set_index() is used to designate an existing column as an index, and the drop() method is used to remove redundant columns for convenience.

df_i = df.set_index('X').drop(['en', 'cn'], axis=1)

print(df_i)
#                                     A                   B
# X                                                        
# 2017-11-01 12:24:00  2017-11-01 12:24   2017年11月1日 12时24分
# 2017-11-18 23:00:00  2017-11-18 23:00  2017年11月18日 23时00分
# 2017-12-05 05:05:00   2017-12-05 5:05    2017年12月5日 5时05分
# 2017-12-22 08:54:00   2017-12-22 8:54   2017年12月22日 8时54分
# 2018-01-08 14:20:00  2018-01-08 14:20    2018年1月8日 14时20分
# 2018-01-19 20:01:00  2018-01-19 20:01   2018年1月19日 20时01分

print(df_i.index)
# DatetimeIndex(['2017-11-01 12:24:00', '2017-11-18 23:00:00',
#                '2017-12-05 05:05:00', '2017-12-22 08:54:00',
#                '2018-01-08 14:20:00', '2018-01-19 20:01:00'],
#               dtype='datetime64[ns]', name='X', freq=None)

The DatetimeIndex type index has attributes such as year, month, day (year, month, day), hour, minute, second (hour, minute, second), day of the week (string: weekday_name, number: dayofweek), and methods such as strftime() is added so that all indexed elements can be processed at once without going through the dt attribute.

The return type varies by property and method, it is not pandas.Series, but if you want to add new columns in pandas.DataFrame, you can specify new column names and assign.

print(df_i.index.minute)
# Int64Index([24, 0, 5, 54, 20, 1], dtype='int64', name='X')

print(df_i.index.strftime('%y/%m/%d'))
# ['17/11/01' '17/11/18' '17/12/05' '17/12/22' '18/01/08' '18/01/19']

df_i['min'] = df_i.index.minute
df_i['str'] = df_i.index.strftime('%y/%m/%d')

print(df_i)
#                                     A                   B  min       str
# X
# 2017-11-01 12:24:00  2017-11-01 12:24   2017年11月1日 12时24分   24  17/11/01
# 2017-11-18 23:00:00  2017-11-18 23:00  2017年11月18日 23时00分    0  17/11/18
# 2017-12-05 05:05:00   2017-12-05 5:05    2017年12月5日 5时05分    5  17/12/05
# 2017-12-22 08:54:00   2017-12-22 8:54   2017年12月22日 8时54分   54  17/12/22
# 2018-01-08 14:20:00  2018-01-08 14:20    2018年1月8日 14时20分   20  18/01/08
# 2018-01-19 20:01:00  2018-01-19 20:01   2018年1月19日 20时01分    1  18/01/19

Convert string to datetime64[ns] type when reading from file

When reading data from a file, you can convert the string to datetime64[ns] type while reading. For the pandas.read_csv() function, specify in the parameter parse_dates a list of column numbers to convert to datetime64[ns] type. Note that even if there is only one, it must be listed.

df_csv = pd.read_csv('data/sample_datetime_multi.csv', parse_dates=[0])

print(df_csv)
#                     A                   B
# 0 2017-11-01 12:24:00   2017年11月1日 12时24分
# 1 2017-11-18 23:00:00  2017年11月18日 23时00分
# 2 2017-12-05 05:05:00    2017年12月5日 5时05分
# 3 2017-12-22 08:54:00   2017年12月22日 8时54分
# 4 2018-01-08 14:20:00    2018年1月8日 14时20分
# 5 2018-01-19 20:01:00   2018年1月19日 20时01分

print(df_csv.dtypes)
# A    datetime64[ns]
# B            object
# dtype: object
df_csv_jp = pd.read_csv('./data/sample_datetime_multi.csv',
                        parse_dates=[1],
                        date_parser=lambda date: pd.to_datetime(date, format='%Y年%m月%d日 %H时%M分'))

print(df_csv_jp)
#                   A                   B
# 0  2017-11-01 12:24 2017-11-01 12:24:00
# 1  2017-11-18 23:00 2017-11-18 23:00:00
# 2   2017-12-05 5:05 2017-12-05 05:05:00
# 3   2017-12-22 8:54 2017-12-22 08:54:00
# 4  2018-01-08 14:20 2018-01-08 14:20:00
# 5  2018-01-19 20:01 2018-01-19 20:01:00

print(df_csv_jp.dtypes)
# A            object
# B    datetime64[ns]
# dtype: object

The column to be indexed can be specified with the parameter index_col.

In this case, if the parameter parse_dates=True, the index column will be converted to datetime64[ns] type.

df_csv_jp_i = pd.read_csv('./data/sample_datetime_multi.csv',
                          index_col=1,
                          parse_dates=True,
                          date_parser=lambda date: pd.to_datetime(date, format='%Y年%m月%d日 %H时%M分'))

print(df_csv_jp_i)
#                                     A
# B                                    
# 2017-11-01 12:24:00  2017-11-01 12:24
# 2017-11-18 23:00:00  2017-11-18 23:00
# 2017-12-05 05:05:00   2017-12-05 5:05
# 2017-12-22 08:54:00   2017-12-22 8:54
# 2018-01-08 14:20:00  2018-01-08 14:20
# 2018-01-19 20:01:00  2018-01-19 20:01

print(df_csv_jp_i.index)
# DatetimeIndex(['2017-11-01 12:24:00', '2017-11-18 23:00:00',
#                '2017-12-05 05:05:00', '2017-12-22 08:54:00',
#                '2018-01-08 14:20:00', '2018-01-19 20:01:00'],
#               dtype='datetime64[ns]', name='B', freq=None)

The pandas.read_excel() function that reads Excel files also has parameters parse_dates, date_parser, and index_col, so similar conversions can be done while reading. See the following article for information on the pandas.read_excel() function.

Guess you like

Origin blog.csdn.net/qq_18351157/article/details/127703926