-
-
Notifications
You must be signed in to change notification settings - Fork 27
Support median in Groupby.aggregate #766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
the dask/dask pr is merged |
if not isinstance(self.split_out, bool) and self.split_out == 1 or sort: | ||
if not self.should_shuffle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the change. I think we can remove sort = getattr(self, "sort", False)
a few lines above this.
@property | ||
def should_shuffle(self): | ||
sort = getattr(self, "sort", False) | ||
return not ( | ||
not isinstance(self.split_out, bool) and self.split_out == 1 or sort | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this logic is just copied, but is it correct? We want to shuffle if split_out > 1 or if sort is True
(we can't use a tree reduction if sort=True
). Maybe this should be something like:
@property
def should_shuffle(self):
return int(self.split_out) > 1 or getattr(self, "sort", False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn`t cover the bool case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my suggestion was off, but I was pointing out the sort check is wrong (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember operator precedence correctly,
not (not isinstance(self.split_out, bool) and self.split_out == 1 or sort)
is the same as
not (((not isinstance(self.split_out, bool)) and (self.split_out == 1)) or sort)
i.e., if sort is True
, the expression evaluates to True
False
. Yeah, that seems problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XREF #817
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use a tree reduction? The modus is to reduce to one partition and then sort, so we can not shuffle if sort=True, we have to reduce to one partition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for leaving a "this is a bug" comment, and then getting pulled away.
I think some of my confusion comes from that fact that dask/dask allows the tree reduction path to be used with split_out > 1
, but that algorithm will produce incorrect results if sort=True
.
In dask-expr, we do not allow the tree reduction to be used with split_out > 1
, so the sort check seems unnecessary/confusing. If split_out ==
, then we don't really care about sort
.
Empty commits with |
CI failure is #806 |
d27c774
to
695c824
Compare
@hendrikmakait I'm confused. These are still failing in the dask-expr CI:
However, if I run manually in dask/dask, I get everything green. Same pandas version... |
thx |
Fix_agg_finalize
fordask-expr
dask#10835Adjust tests for median support in groupby-aggregate in dask-expr dask#10840