I am trying to create some sort of "functional select" that gives users flexibility to create configuration to select data in pandas dataframes. However I ran into some issues that puzzle me.
The following is a simplified example:
>>> import pandas as pd
>>> df = pd.DataFrame({'date': pd.date_range(start='2020-01-01', periods=4), 'val': [1, 2, 3, 4]})
>>> df
date val
0 2020-01-01 1
1 2020-01-02 2
2 2020-01-03 3
3 2020-01-04 4
Question 1: Why do I get different result when I apply the function on the column differently?
>>> import datetime
>>> bydatetime = lambda x : x == datetime.date(2020, 1, 1)
>>> bydatetime(df['date'])
0 False
1 False
2 False
3 False
Name: date, dtype: bool
>>> df['date'].apply(bydatetime) # why does this one work?
0 True
1 False
2 False
3 False
Name: date, dtype: bool
However if I use numpy's datetime64
or pandas' Timestamp
types to create the lambda function, it would work.
>>> import numpy as np
>>> bynpdatetime = lambda x : x == np.datetime64('2020-01-01')
>>> bynpdatetime(df['date'])
0 True
1 False
2 False
3 False
Name: date, dtype: bool
>>> df['date'].apply(bynpdatetime)
0 True
1 False
2 False
3 False
Name: date, dtype: bool
>>> bypdtimestamp = lambda x : x == pd.Timestamp('2020-01-01')
>>> bypdtimestamp(df['date'])
0 True
1 False
2 False
3 False
Name: date, dtype: bool
>>> df['date'].apply(bypdtimestamp)
0 True
1 False
2 False
3 False
Name: date, dtype: bool
So I reverted to use the following simple selection, and using datetime.date
didn't work. If datetime.date
just wouldn't work, why would df['date'].apply(bydatetime)
work?
>>> df[df['date'] == datetime.date(2020, 1, 1)]
Empty DataFrame
Columns: [date, val]
Index: []
>>> df[df['date'] == np.datetime64('2020-01-01')]
date val
0 2020-01-01 1
>>> df[df['date'] == pd.Timestamp('2020-01-01')]
date val
0 2020-01-01 1
Last but not least, why is the type of the date
column datetime64
in the DataFrame but Timestamp
when selected one cell? What is exactly the difference between them?
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 val 4 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 192.0 bytes
>>>
>>> df['date'][0]
Timestamp('2020-01-01 00:00:00')
I am sure there is something fundamental that I don't understand here. Thank you very much for anything constructive.
Luckily I have an older version of pandas
(0.25) and you get a warning when you do bynpdatetime(df['date'])
, which explains exactly why you see that behavior. There was a bit of back and forth on how to handle this so seeing this behavior will be highly version specific:
FutureWarning: Comparing Series of datetimes with 'datetime.date'. Currently, the 'datetime.date' is coerced to a datetime. In the future pandas will not coerce, and 'the values will not compare equal to the 'datetime.date'. To retain the current behavior, convert the 'datetime.date' to a datetime with 'pd.Timestamp'.
Datetime functionality in pandas
is built upon the np.datetime64
and np.timedelta64
dtypes. You should not use the datetime module as they have made certain choices that are inconsistent with the standard library. All of the unintended behavior is because of this.
To answer the other un-related question. datetime64
is like the array-type, or the concept. That array (in this case a pd.Series
) would be made up of scalar timedelta64
objects. This is explained in the documentation