Skip to content

Computing time difference with timezone in Series with one row gives wrong result #12290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Feb 11, 2016 · 6 comments
Closed
Labels
Bug Timezones Timezone data dtype
Milestone

Comments

@ghost
Copy link

ghost commented Feb 11, 2016

When there are only one row containing timestamps with timezone in DataFrame, computing time difference between columns gives incorrect result - always '0 days':

In [133]: import pandas as pd

In [134]: df = pd.DataFrame(columns=['s', 'f'])

In [135]: df.s = [pd.Timestamp('2016-02-08 13:43:14.605000', tz='America/Sao_Paulo')]

In [136]: df.f = [pd.Timestamp('2016-02-10 13:43:14.605000', tz='America/Sao_Paulo')]

In [137]: df
Out[137]:
                                 s                                f
0 2016-02-08 13:43:14.605000-02:00 2016-02-10 13:43:14.605000-02:00

In [138]: df.f - df.s    # should give '2 days'
Out[138]:
0   0 days
Name: f, dtype: timedelta64[ns]

In [139]: df.s - df.f    # should give '-2 days'
Out[139]:
0   0 days
Name: s, dtype: timedelta64[ns]

If there are more than one row, result is correct:

In [140]: import pandas as pd

In [141]: df = pd.DataFrame(columns=['s', 'f'])

In [142]: df.s = [pd.Timestamp('2016-02-08 13:43:14.605000', tz='America/Sao_Paulo'), pd.Timestamp('2016-02-08 13:43:14.605000', tz='America/Sao_P aulo')]

In [143]: df.f = [pd.Timestamp('2016-02-10 13:43:14.605000', tz='America/Sao_Paulo'), pd.Timestamp('2016-02-11 15:43:14.605000', tz='America/Sao_P aulo')]

In [144]: df
Out[144]:
                                 s                                f
0 2016-02-08 13:43:14.605000-02:00 2016-02-10 13:43:14.605000-02:00
1 2016-02-08 13:43:14.605000-02:00 2016-02-11 15:43:14.605000-02:00

In [145]: df.f - df.s
Out[145]:
0   2 days 00:00:00
1   3 days 02:00:00
Name: f, dtype: timedelta64[ns]

One row containing timestamps in naive datetime also gives correct result:

In [147]: import pandas as pd

In [148]: df = pd.DataFrame(columns=['s', 'f'])

In [149]: df.s = [pd.Timestamp('2016-02-08 13:43:14.605000')]

In [150]: df.f = [pd.Timestamp('2016-02-10 13:43:14.605000')]

In [151]: df
Out[151]:
                        s                       f
0 2016-02-08 13:43:14.605 2016-02-10 13:43:14.605

In [152]: df.f - df.s
Out[152]:
0   2 days
dtype: timedelta64[ns]

Installed versions:

In [162]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-49-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: None
pip: 8.0.2
setuptools: 18.2
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
IPython: 4.0.3
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.4.2
sqlalchemy: None
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None
@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

hmm that does look like a bug.

@jreback jreback added this to the Next Major Release milestone Feb 11, 2016
@gfyoung
Copy link
Member

gfyoung commented Feb 11, 2016

Where is "-" being implemented in for Series? The sub() and subtract() methods work perfectly fine. I suspect it's somewhere hidden here, but if someone with a better understanding could enlighten, that would be great!

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

@gfyoung that's the right place. you should just step thru.

@gfyoung
Copy link
Member

gfyoung commented Feb 12, 2016

pandas is trolling me.

I was able to track the bug down to this line here. For some strange reason, the lvalue (and also the rvalue) become NaT which is why the difference becomes zero because the np.int64 view value is just np.iinfo(np.int64).min. However, my attempts to replicate that error directly were strangely not very successful.

What I did was this, using a local installation of the codebase:

  1. Right before this line, I placed a raise Exception(lvalues, rvalues). Afterwards, if you walk through the code presented in the issue, you should be able to confirm that they are both DateTimeIndex objects with the correct times.
  2. I then removed the Exception, and right before this line, I placed a raise Exception(lvalues, rvalues). Afterwards, if you walk through the code presented in the issue, you should be able to confirm that both are converted to NaT.
  3. I then removed the Exception, and right before this line. I assigned attributes old_lvalues and old_rvalues to the _TimeOp attribute.
  4. I then ran this code:
import operator
import pandas as pd

df = pd.DataFrame(columns=['s', 'f'])

df.s = [pd.Timestamp('2016-02-08 13:43:14.605000', tz='America/Sao_Paulo')]
df.f = [pd.Timestamp('2016-02-10 13:43:14.605000', tz='America/Sao_Paulo')]

t = _TimeOp.maybe_convert_for_time_op(df.s, df.f, '__sub__', operator.sub)

print t.lvalue, t.rvalue
print t.old_lvalues.tz_localize(None), t.old_rvalues.tz_localize(None)

You should be able to see my confusion.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

that IS odd.

@gfyoung
Copy link
Member

gfyoung commented Feb 12, 2016

@jreback @polart : Also, the title should be changed, as it's really an issue with Series arithmetic operations and has nothing to do actually with DataFrame.

@jreback jreback modified the milestones: 0.18.0, Next Major Release Feb 12, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Feb 12, 2016
Fixes unusual bug with timezone Series
subtraction in which single element Series
objects containing tz-aware objects would
return a timedelta of zero, even though it
visually could not be the case.

The bug was traced to the conversion of
the contained timezones to UTC, in which
the method call was somehow returning NaT,
even though attempts to replicate that
behaviour were unsuccessful. This new
method call fixes the issue and is in some
ways more intuitive given the comment above
the conversions.

Closes pandas-devgh-12290.
@ghost ghost changed the title Computing time difference with timezone in DataFrame with one row gives wrong result Computing time difference with timezone in Series with one row gives wrong result Feb 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants