-
Notifications
You must be signed in to change notification settings - Fork 666
FIX-#3498 Fix groupby to use kwargs for getting the groups #3523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahallermed Great PR! The changes look good to me.
commit lint is failing due to a missing colon (:
) between the issue number and commit description. The first line needs to look like this:
FIX-#3498: Fix groupby to use kwargs for getting the groups
Codecov Report
@@ Coverage Diff @@
## master #3523 +/- ##
==========================================
+ Coverage 85.45% 89.19% +3.73%
==========================================
Files 197 197
Lines 16437 17434 +997
==========================================
+ Hits 14047 15551 +1504
+ Misses 2390 1883 -507
Continue to review full report at Codecov.
|
Co-authored-by: Devin Petersohn <[email protected]> Signed-off-by: Andreas Haller <[email protected]>
cb36c50
to
5dcf83c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
@ahallermed please note some tests are failing, and those do not seem like some false positives. Could you please look into those?
Co-authored-by: Vasily Litvinov <[email protected]>
I think @dchigarev is looking into this PR, as it apparently is a little more complicated than that. |
Currently, tests are failing because of the lines that pass kwargs to the There is a bug in pandas (pandas-dev/pandas#35202) that doesn't allow us to take >>> df = pandas.DataFrame({"a": [1, 2, 2, np.nan]})
>>> grp = df.groupby("a", dropna=False)
>>> grp.groups
ValueError: Categorical categories cannot be null
>>> grp.indices # `.indices` works fine though
{1.0: array([0], dtype=int64), 2.0: array([1, 2], dtype=int64), nan: array([3], dtype=int64)} OFC we could use an I also found that the groups comparator works a bit incorrect and doesn't properly verify that the bug (#3498) is actually fixed. Filled an issue as well (#3756). It also must be fixed before this PR. |
Thank you @vnlitvinov and @dchigarev for the deep dive. I wasn't able to have a look at your discussion for a while. Sorry about that. |
Yes, and I'm going to push the solution to the PR's branch in a short time |
Signed-off-by: Dmitry Chigarev <[email protected]>
Signed-off-by: Dmitry Chigarev <[email protected]>
assert g1[0] == g2[0] | ||
value_equals(g1[0], g2[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed so nan
values be compared correctly
PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good to me, thanks @dchigarev for the updates. Should we have a default to pandas warning here?
Signed-off-by: Dmitry Chigarev <[email protected]>
Yes, we should. Added a warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ps (modin-project#3523) Co-authored-by: Devin Petersohn <[email protected]> Co-authored-by: Vasily Litvinov <[email protected]> Co-authored-by: Dmitry Chigarev <[email protected]> Signed-off-by: Andreas Haller <[email protected]> Signed-off-by: Doris Lee <[email protected]>
Signed-off-by: Andreas Haller [email protected]
What do these changes do?
Adds
dropna=False
support for getting the group indices (GroupBy.indices/.groups
) by forwarding this parameter to the pandas groupby object that actually computes the thing.Note that this PR doesn't ensure full support of
dropna=False
. See the epic (#3828) for the rest of the tasks.flake8 modin
black --check modin
git commit -s
GroupBy.groups/.indices
ignores dropna=False as kwarg #3498docs/developer/architecture.rst
is up-to-date -> Not needed