Skip to content

BUG GH22858 When creating empty dataframe, only cast int to float if index given #22963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 4, 2018

Conversation

JustinZhengBC
Copy link
Contributor

@JustinZhengBC JustinZhengBC commented Oct 3, 2018

Previously, when creating a dataframe with no data of dtype int, the dtype would be changed to float. This is necessary when a predefined number of rows is included as the index parameter, so that they can be filled with nan. However, when no index is passed, this cast is unexpected. This PR changes it so dtype is only altered when necessary.

@pep8speaks
Copy link

Hello @JustinZhengBC! Thanks for submitting the PR.

@WillAyd
Copy link
Member

WillAyd commented Oct 3, 2018

Looks pretty good. Can you add a whatsnew note for v0.24?

@WillAyd WillAyd added Dtype Conversions Unexpected or buggy dtype conversions DataFrame DataFrame data structure labels Oct 3, 2018
@WillAyd WillAyd added this to the 0.24.0 milestone Oct 3, 2018
@JustinZhengBC
Copy link
Contributor Author

I updated the whatsnew for v0.24. Also, after reverting to master, I found that while passing "int64" as the dtype caused the bug, merely passing int did not. I have updated the test to reflect this.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the tests in this module could be improved with parametrization. On the fence but I'm generally OK with not requiring that for this PR (others may have a differing opinion), but would be a good follow up if you are interested.

@@ -806,6 +806,9 @@ def test_constructor_corner(self):
df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=object)
assert df.values.dtype == np.object_

df = DataFrame(columns=['a', 'b'], dtype="int64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the GH issue number as a comment to reference?

@@ -194,6 +194,7 @@ Other Enhancements
- :meth:`Index.to_frame` now supports overriding column name(s) (:issue:`22580`).
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
- A newly constructed empty :class:`DataFrame` of integers will now only be cast to ``float64`` if an index is specified (:issue:`22858`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think the wording DataFrame of integers is somewhat misleading as there aren't actually any integers here; the type is just passed explicitly during construction and not subsequently coerced.

@codecov
Copy link

codecov bot commented Oct 3, 2018

Codecov Report

Merging #22963 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22963      +/-   ##
==========================================
+ Coverage   92.18%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50833    50823      -10     
==========================================
- Hits        46862    46853       -9     
+ Misses       3971     3970       -1
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 42.36% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 88.58% <100%> (ø) ⬆️
pandas/core/internals/blocks.py 93.48% <0%> (-0.37%) ⬇️
pandas/core/frame.py 97.2% <0%> (-0.01%) ⬇️
pandas/core/indexes/datetimelike.py 98.11% <0%> (ø) ⬆️
pandas/core/resample.py 96.98% <0%> (+0.01%) ⬆️
pandas/core/ops.py 97.43% <0%> (+0.01%) ⬆️
pandas/core/nanops.py 95.2% <0%> (+0.05%) ⬆️
pandas/core/indexes/interval.py 94.32% <0%> (+0.16%) ⬆️
pandas/util/testing.py 86.03% <0%> (+0.2%) ⬆️
pandas/core/arrays/interval.py 92.81% <0%> (+0.25%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0f9a10...e7efa9d. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments lgtm
ping on green

@@ -194,6 +194,7 @@ Other Enhancements
- :meth:`Index.to_frame` now supports overriding column name(s) (:issue:`22580`).
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to api breaking changes

if is_integer_dtype(dtype) and isna(value):
# GH 22858: only cast to float if an index
# (passed here as length) is specified
if is_integer_dtype(dtype) and isna(value) and length:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the length check first

dtype=int)
assert df.values.dtype == np.dtype('float64')
@pytest.mark.parametrize("data, index, columns, dtype, ans", [
(None, lrange(10), ['a', 'b'], object, np.object_),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u rename ans -> expected

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc comment, otherwise lgtm ping on green


.. _whatsnew_0240.api_breaking:
- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in the section itself, just below this (it’s a backward incompatible change)

@jreback
Copy link
Contributor

jreback commented Oct 4, 2018

@WillAyd merge when satisfied

@WillAyd WillAyd merged commit a6c1ff1 into pandas-dev:master Oct 4, 2018
@WillAyd
Copy link
Member

WillAyd commented Oct 4, 2018

Very nice change - thanks @JustinZhengBC !

@JustinZhengBC JustinZhengBC deleted the BUG-22858 branch October 5, 2018 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG/ENH: Bad columns dtype when creating empty DataFrame
4 participants