Skip to content

Support median in Groupby.aggregate #766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jan 29, 2024
Merged

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Jan 18, 2024

@hendrikmakait hendrikmakait marked this pull request as ready for review January 19, 2024 09:32
@phofl
Copy link
Collaborator

phofl commented Jan 19, 2024

the dask/dask pr is merged

@hendrikmakait hendrikmakait marked this pull request as ready for review January 19, 2024 14:07
Comment on lines -473 to +480
if not isinstance(self.split_out, bool) and self.split_out == 1 or sort:
if not self.should_shuffle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change. I think we can remove sort = getattr(self, "sort", False) a few lines above this.

Comment on lines +450 to +455
@property
def should_shuffle(self):
sort = getattr(self, "sort", False)
return not (
not isinstance(self.split_out, bool) and self.split_out == 1 or sort
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this logic is just copied, but is it correct? We want to shuffle if split_out > 1 or if sort is True (we can't use a tree reduction if sort=True). Maybe this should be something like:

    @property
    def should_shuffle(self):
        return int(self.split_out) > 1 or getattr(self, "sort", False)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn`t cover the bool case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my suggestion was off, but I was pointing out the sort check is wrong (I think).

Copy link
Member Author

@hendrikmakait hendrikmakait Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember operator precedence correctly,

not (not isinstance(self.split_out, bool) and self.split_out == 1 or sort)

is the same as

not (((not isinstance(self.split_out, bool)) and (self.split_out == 1)) or sort)

i.e., if sort is True, the expression evaluates to True False. Yeah, that seems problematic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XREF #817

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use a tree reduction? The modus is to reduce to one partition and then sort, so we can not shuffle if sort=True, we have to reduce to one partition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for leaving a "this is a bug" comment, and then getting pulled away.

I think some of my confusion comes from that fact that dask/dask allows the tree reduction path to be used with split_out > 1, but that algorithm will produce incorrect results if sort=True.

In dask-expr, we do not allow the tree reduction to be used with split_out > 1, so the sort check seems unnecessary/confusing. If split_out == , then we don't really care about sort.

@crusaderky crusaderky closed this Jan 25, 2024
@crusaderky crusaderky reopened this Jan 25, 2024
@crusaderky
Copy link
Collaborator

Empty commits with [skip-caching] don't seem to be doing anything. See #804

@hendrikmakait
Copy link
Member Author

CI failure is #806

@crusaderky
Copy link
Collaborator

You won't reach the dask/dask tests until #806 is merged.
#808 works around the issue.

@crusaderky
Copy link
Collaborator

@hendrikmakait I'm confused.

These are still failing in the dask-expr CI:

FAILED ../../../micromamba/envs/dask-expr/lib/python3.9/site-packages/dask/dataframe/tests/test_groupby.py::test_groupby_aggregate_partial_function_unexpected_args[disk-<lambda>0] - AssertionError: Regex pattern did not match.
 Regex: "doesn't support positional arguments|'Series' object cannot be interpreted as an integer"
 Input: "cannot convert the series to <class 'int'>"
FAILED ../../../micromamba/envs/dask-expr/lib/python3.9/site-packages/dask/dataframe/tests/test_groupby.py::test_groupby_aggregate_partial_function_unexpected_args[disk-<lambda>1] - AssertionError: Regex pattern did not match.
 Regex: "doesn't support positional arguments|'Series' object cannot be interpreted as an integer"
 Input: "cannot convert the series to <class 'int'>"
FAILED ../../../micromamba/envs/dask-expr/lib/python3.9/site-packages/dask/dataframe/tests/test_groupby.py::test_groupby_aggregate_partial_function_unexpected_args[tasks-<lambda>0] - AssertionError: Regex pattern did not match.
 Regex: "doesn't support positional arguments|'Series' object cannot be interpreted as an integer"
 Input: "cannot convert the series to <class 'int'>"
FAILED ../../../micromamba/envs/dask-expr/lib/python3.9/site-packages/dask/dataframe/tests/test_groupby.py::test_groupby_aggregate_partial_function_unexpected_args[tasks-<lambda>1] - AssertionError: Regex pattern did not match.
 Regex: "doesn't support positional arguments|'Series' object cannot be interpreted as an integer"
 Input: "cannot convert the series to <class 'int'>"

However, if I run manually in dask/dask, I get everything green. Same pandas version...

@phofl phofl closed this Jan 29, 2024
@phofl phofl reopened this Jan 29, 2024
@phofl phofl merged commit 71e5237 into dask:main Jan 29, 2024
@phofl
Copy link
Collaborator

phofl commented Jan 29, 2024

thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add median support to GroupBy.aggregate
4 participants