select data based on datetime in pandas dataframe

dhu :

I am trying to create some sort of "functional select" that gives users flexibility to create configuration to select data in pandas dataframes. However I ran into some issues that puzzle me.

The following is a simplified example:

>>> import pandas as pd
>>> df = pd.DataFrame({'date': pd.date_range(start='2020-01-01', periods=4), 'val': [1, 2, 3, 4]})
>>> df
        date  val
0 2020-01-01    1
1 2020-01-02    2
2 2020-01-03    3
3 2020-01-04    4

Question 1: Why do I get different result when I apply the function on the column differently?

>>> import datetime
>>> bydatetime = lambda x : x == datetime.date(2020, 1, 1)
>>> bydatetime(df['date'])
0    False
1    False
2    False
3    False
Name: date, dtype: bool
>>> df['date'].apply(bydatetime) # why does this one work?
0     True
1    False
2    False
3    False
Name: date, dtype: bool

However if I use numpy's datetime64 or pandas' Timestamp types to create the lambda function, it would work.

>>> import numpy as np
>>> bynpdatetime = lambda x : x == np.datetime64('2020-01-01')
>>> bynpdatetime(df['date'])
0     True
1    False
2    False
3    False
Name: date, dtype: bool
>>> df['date'].apply(bynpdatetime)
0     True
1    False
2    False
3    False
Name: date, dtype: bool
>>> bypdtimestamp = lambda x : x == pd.Timestamp('2020-01-01')
>>> bypdtimestamp(df['date'])
0     True
1    False
2    False
3    False
Name: date, dtype: bool
>>> df['date'].apply(bypdtimestamp)
0     True
1    False
2    False
3    False
Name: date, dtype: bool

So I reverted to use the following simple selection, and using datetime.date didn't work. If datetime.date just wouldn't work, why would df['date'].apply(bydatetime) work?

>>> df[df['date'] == datetime.date(2020, 1, 1)]
Empty DataFrame
Columns: [date, val]
Index: []
>>> df[df['date'] == np.datetime64('2020-01-01')]
        date  val
0 2020-01-01    1
>>> df[df['date'] == pd.Timestamp('2020-01-01')]
        date  val
0 2020-01-01    1

Last but not least, why is the type of the date column datetime64 in the DataFrame but Timestamp when selected one cell? What is exactly the difference between them?

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    4 non-null      datetime64[ns]
 1   val     4 non-null      int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 192.0 bytes
>>>
>>> df['date'][0]
Timestamp('2020-01-01 00:00:00')

I am sure there is something fundamental that I don't understand here. Thank you very much for anything constructive.

ALollz :

Luckily I have an older version of pandas (0.25) and you get a warning when you do bynpdatetime(df['date']), which explains exactly why you see that behavior. There was a bit of back and forth on how to handle this so seeing this behavior will be highly version specific:

FutureWarning: Comparing Series of datetimes with 'datetime.date'. Currently, the 'datetime.date' is coerced to a datetime. In the future pandas will not coerce, and 'the values will not compare equal to the 'datetime.date'. To retain the current behavior, convert the 'datetime.date' to a datetime with 'pd.Timestamp'.

Datetime functionality in pandas is built upon the np.datetime64 and np.timedelta64 dtypes. You should not use the datetime module as they have made certain choices that are inconsistent with the standard library. All of the unintended behavior is because of this.


To answer the other un-related question. datetime64 is like the array-type, or the concept. That array (in this case a pd.Series) would be made up of scalar timedelta64 objects. This is explained in the documentation

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=360979&siteId=1