-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: astype with timedelta and datetime string (#22100) #22107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @fjdiod! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on July 31, 2018 at 10:45 Hours UTC |
pandas/tests/dtypes/test_cast.py
Outdated
def test_astype_nansafe(arr, dtype): | ||
# GH #22100 | ||
result = astype_nansafe(arr, dtype) | ||
is_dtype_equal(result.dtype, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be an assertion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Piggybacking on this I think it would be better to explicitly parametrize the expected value and make assertions around that.
On a somewhat related note how do we want to handle datetime64[ns, tz]
? Off the top of my head I can't think of how to do a conversion to that via astype
so maybe should still raise for that?
Why do we need this? pandas/pandas/core/dtypes/cast.py Lines 688 to 699 in 0b7a08b
With that we have: >>> s = pd.Series([None])
>>> s.astype('m8')
0 NaT
dtype: timedelta64[ns]
>>> s.astype('m8').astype('m8')
0 NaN
dtype: float64 which looks strange. >>> s = pd.Series([None, 1, 2])
>>> s.astype('timedelta64[ns]')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sergey/projects/pandas/pandas/util/_decorators.py", line 177, in wrapper
return func(*args, **kwargs)
File "/home/sergey/projects/pandas/pandas/core/generic.py", line 5182, in astype
**kwargs)
File "/home/sergey/projects/pandas/pandas/core/internals/managers.py", line 554, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/home/sergey/projects/pandas/pandas/core/internals/managers.py", line 421, in apply
applied = getattr(b, f)(**kwargs)
File "/home/sergey/projects/pandas/pandas/core/internals/blocks.py", line 564, in astype
**kwargs)
File "/home/sergey/projects/pandas/pandas/core/internals/blocks.py", line 655, in _astype
values = astype_nansafe(values.ravel(), dtype, copy=True)
File "/home/sergey/projects/pandas/pandas/core/dtypes/cast.py", line 709, in astype_nansafe
raise ValueError('Cannot convert non-finite values (NA or inf) to '
ValueError: Cannot convert non-finite values (NA or inf) to integer Is it expected? |
Codecov Report
@@ Coverage Diff @@
## master #22107 +/- ##
==========================================
- Coverage 92.07% 92.06% -0.02%
==========================================
Files 170 170
Lines 50688 50706 +18
==========================================
+ Hits 46671 46682 +11
- Misses 4017 4024 +7
Continue to review full report at Codecov.
|
Hello, @jbrockmendel, @WillAyd |
pandas/core/dtypes/cast.py
Outdated
@@ -712,7 +712,8 @@ def astype_nansafe(arr, dtype, copy=True): | |||
elif is_object_dtype(arr): | |||
|
|||
# work around NumPy brokenness, #1987 | |||
if np.issubdtype(dtype.type, np.integer): | |||
is_time = is_datetime64_dtype(dtype) or dtype == _TD_DTYPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_TD_DTYPE is specifically timedelta64[ns]
. Are there paths that get here with e.g. timedelta64[D]
? (I'm assuming those should be excluded too.
IIRC datetimete64 dtype won't trigger np.issubtype(dtype.type, np.integer)
, am I remembering incorrectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, seems I was wrong about datetime64, they worked fine before.
pandas/tests/dtypes/test_cast.py
Outdated
(np.array(['2000'], dtype='object'), | ||
'datetime64', 'datetime64[ns]'), | ||
]) | ||
def test_astype_nansafe(arr, dtype, expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are datetime64[ns, TZ]
dtypes relevant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI whatever the solution is here would impact the other limitation with orient='table'
around timezone data:
pandas/pandas/io/json/table_schema.py
Line 304 in d010469
# Cannot directly use as_type with timezone data on object; raise for now |
I'd be fine with just raising as part of this PR until its explicitly resolved but just bringing that up for visibility
pandas/tests/dtypes/test_cast.py
Outdated
(np.array(['2000'], dtype='object'), | ||
'datetime64[ns]', 'datetime64[ns]'), | ||
(np.array(['2000'], dtype='object'), | ||
'datetime64', 'datetime64[ns]'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be made less verbose (easier to read if on one line, but not a huge deal) by changing 'object'
--> 'O'
, 'timedelta64[ns]'
--> 'm8[ns]'
, and 'datetime64[ns]''
--> M8[ns]
@@ -712,7 +712,8 @@ def astype_nansafe(arr, dtype, copy=True): | |||
elif is_object_dtype(arr): | |||
|
|||
# work around NumPy brokenness, #1987 | |||
if np.issubdtype(dtype.type, np.integer): | |||
is_time = is_timedelta64_dtype(dtype) | |||
if np.issubdtype(dtype.type, np.integer) and is_time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_integer_dtype and also add a comment here (and can remove the existing work around comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but it breaks:
pandas/pandas/tests/reshape/merge/test_merge.py
Lines 531 to 538 in f76a3f3
def test_other_timedelta_unit(self, unit): | |
# GH 13389 | |
df1 = pd.DataFrame({'entity_id': [101, 102]}) | |
s = pd.Series([None, None], index=[101, 102], name='days') | |
dtype = "m8[{}]".format(unit) | |
df2 = s.astype(dtype).to_frame('days') | |
assert df2['days'].dtype == 'm8[ns]' |
Because of:
pandas/pandas/core/dtypes/cast.py
Lines 688 to 697 in f76a3f3
if ((PY3 and dtype not in [_INT64_DTYPE, _TD_DTYPE]) or | |
(not PY3 and dtype != _TD_DTYPE)): | |
# allow frequency conversions | |
# we return a float here! | |
if dtype.kind == 'm': | |
mask = isna(arr) | |
result = arr.astype(dtype).astype(np.float64) | |
result[mask] = np.nan | |
return result |
Are this necessary? Here timedelta64 are turned to float64.
Should I change the test or this part of
astype_nansafe
?Also, I've notices that conversions like
pd.Series([None, 1, 2]).astype('timedelta64')
do not work now. Should I open a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be missing something but shouldn't it not be going down this branch? i.e. isn't the statement dtype not in [_INT64_DTYPE, _TD_DTYPE]
supposed to be false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is IntNADType handled correctly here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd yah at this point I’m not sure if it’s tangential, but it looks like the TDDTYPE check should go above the (PY3 ...) check in line 688, after which that check is always-true.
closing as stale. if you'd like to continue pls ping. |
git diff upstream/master -u -- "*.py" | flake8 --diff
Not sure about test. Should I test
DataFrame
methodastype
instead? If so, where to put test?