-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Closed
Labels
Milestone
Description
Code Sample, a copy-pastable example if possible
import pandas as pd
df = pd.DataFrame({'end_time': [pd.to_datetime('now', utc=True).tz_convert('Asia/Singapore')],
'id': [1]})
df['max_end_time'] = df.groupby('id').end_time.transform(max)
df.info()
shows
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
end_time 1 non-null datetime64[ns, Asia/Singapore]
id 1 non-null int64
max_end_time 1 non-null datetime64[ns]
dtypes: datetime64[ns, Asia/Singapore](1), datetime64[ns](1), int64(1)
memory usage: 104.0 bytes
df.to_dict()
shows
{'end_time': {0: Timestamp('2018-12-10 17:08:52.630644+0800', tz='Asia/Singapore')},
'id': {0: 1},
'max_end_time': {0: Timestamp('2018-12-10 09:08:52.630644')}}
Problem description
The timezone is dropped silently and timestamp converted to UTC after groupby - transform operation on tz aware datetime column
Expected Output
assert df['end_time'] == df['max_end_time']
Output of pd.show_versions()
```
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.85-38.58.amzn1.x86_64
machine: x86_64
processor:
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.10.0
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
</details>
Metadata
Metadata
Assignees
Labels
Type
Projects
Relationships
Development
Select code repository
Activity
TomAugspurger commentedon Dec 10, 2018
This is working correctly on my branch at #24024, but not on master.
I'll add this as a test case.
TomAugspurger commentedon Dec 10, 2018
Ahh... but it isn't correct over there either. The value is wrong.
WillAyd commentedon Dec 10, 2018
Yea I'm not terribly surprised by this as a lot of the cythonized groupby operations use the float value of timestamps for comparisons and may never cast back.
Can such ops be generalized when working with tz-aware data? Is it ever ambiguous if crossing time zones to do such a thing?
If it makes sense than wondering if this is something than an EA needs to handle appropriately
WillAyd commentedon Dec 10, 2018
Potential duplicate or at least related to #21603
mroeschke commentedon Dec 10, 2018
@WillAyd Since it appears that the groupby routines convert to UTC first before running the aggregates, should be as straightforward to ensure that tz-aware columns undergo a
tz_localize('UTC').tz_convert(tz)
after the cython routine. Shouldn't need to worry about ambiguous/nonexistent times.WillAyd commentedon Dec 10, 2018
Makes sense. Do we want the groupby ops to handle this special casing though or is this better fit in the EA implementation? The former would probably be less work atm and provide better performance, though the latter would probably fit better into our programming paradigm go forward
mroeschke commentedon Dec 10, 2018
I don't have a clear picture how generically the groupby ops will interface with the EA in this case. I don't think we'd want to do timezone conversion within EA internals (timezone localization would be okay though) i.e.
EA(utc --> tz_local)
vsEA(utc).tz_convert(tz_local)
I think the most important piece is that it works generically across all groupby ops and methods.
WillAyd commentedon Dec 10, 2018
The more I think about this I'm not sure it can actually be supported generically, at least not for operations that reduce. If we had two timestamps with the same UTC time but in different timezones which timezone would max return?
This would assume that the input always has the same timezone, no?
mroeschke commentedon Dec 10, 2018
Timestamps with different timezones would exist in an object dtype column, but yeah I am not sure how the max dispatch with object dtype would handle that scenario since they would be equivalent. But for more common ops, the timestamps should have the same timezone.
I was making a naive assumption that
.values
would be called before passing to the groupby cython routines, and.values
would be returning naive timestamps but in UTC. I was also assuming thetz_convert
step needs to happen per column since the result can have multiple timezones potentially.