-
Notifications
You must be signed in to change notification settings - Fork 3.7k
ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels #10476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
34c3f48
to
5b24ca3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ parts look good to me.
Now that the kernel has options, should we add a Python convenience function in pyarrow/compute.py
?
(Oh sorry, I missed that this is a draft.) |
I'm thinking to if boolean checks would be better instead of counts?
Yeah definitely, next commit :) |
I debated whether to comment on this. I would slightly prefer the boolean check over the count but the intent is clear either way. |
8b4d1a2
to
4a864b3
Compare
@lidavidm added the python test and switched from |
Hey @thisisnic, @jonkeane, could you take a look at the R part of this PR? It would be great to have an extra check :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm not quite sure what to make of the R tests so I'll let someone else comment on those.
@@ -494,30 +494,40 @@ def test_min_max(): | |||
|
|||
def test_any(): | |||
# ARROW-1846 | |||
|
|||
options = pc.ScalarAggregateOptions(skip_nulls=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I had thought that generally, for kernels with options, we add a wrapper in pyarrow/compute.py. But now that I look, this isn't true for many of the kernels. So having Python tests is enough.
data <- c(1:10, NA, NA) | ||
|
||
expect_vector_equal(any(input > 5), data) | ||
expect_vector_equal(any(input > 5, na.rm = TRUE), data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test fail without the , na.rm = TRUE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does indeed:
Failure (test-compute-aggregate.R:388:3): any.Array and any.ChunkedArray
as.vector(x) not equal to `y`.
'is.NA' value mismatch: 0 in current 1 in target
Backtrace:
1. expect_vector_equal(any(input > 5), data) test-compute-aggregate.R:388:2
2. expect_as_vector(via_array, expected, ignore_attr, ...) helper-expectation.R:174:4
3. testthat:::expect_fun(as.vector(x), y, ...) helper-expectation.R:24:2
Failure (test-compute-aggregate.R:388:3): any.Array and any.ChunkedArray
as.vector(x) not equal to `y`.
'is.NA' value mismatch: 0 in current 1 in target
Backtrace:
1. expect_vector_equal(any(input > 5), data) test-compute-aggregate.R:388:2
2. expect_as_vector(via_chunked, expected, ignore_attr, ...) helper-expectation.R:187:4
3. testthat:::expect_fun(as.vector(x), y, ...) helper-expectation.R:24:2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to read what happens in expect_vector_equal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_vector_equal
takes an expression and a vector and evaluates the expression in base R, then converts the vector converts it into an array and evaluates the expression again (which typically will call arrow compute functions), and finally converts the vector to a chunked array and does the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the errors you're seeing indicate that we've got a mismatch in behavior. I see that Ian has commented below about more robust tests + how to match the behavior (but I'm happy to add to it if that's unclear!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed @jonkeane it was a behavior issue. Should be fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions
f8c7a4e
to
6976376
Compare
@lidavidm - I've introduced some small changes to c++ to enable desired behaviour so perhaps re-review is in order. Changes are all in this commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the and
/or
kernels (the non-reducing variants of this), we have a separate "kleene" version of those kernels. So I am wondering: do we need that here as well?
Apparently a few months ago I though that this was needed, given ARROW-10291 / #8294 (comment), but to be honest I don't fully understand my own comment there any more .. ;)
It seems that the current PR implements the "kleene" version, and I am not fully sure how a non-kleene version would behave otherwise and if this would actually ever be useful.
In any case we should probably describe the behaviour somewhere more explicitly.
Do we want any/all to follow the min_count
parameter of the ScalarAggregateOptions? If not, we should probably document that it is being ignored.
Status Finalize(KernelContext*, Datum* out) override { | ||
out->value = std::make_shared<BooleanScalar>(this->any); | ||
Status Finalize(KernelContext* ctx, Datum* out) override { | ||
if (!options.skip_nulls && !this->any && this->has_nulls) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for the && !this->any
part? That seems to suggest "kleene logic"? But for the non-kleene version, I expect any null in the input to always give null for the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this is to to make the PR "kleen" to match R behavior. Meanwhile Pandas' any is non-kleen.
>>> import pandas as pd
>>> pd.Series([None, None]).any(skipna=True)
False
>>> pd.Series([None, None]).any(skipna=False)
>>>
We have three options IMO:
- Revert to non-kleene c++ behaviour and add a small fix to R
- Add kleen any/all kernels and route in R depending on flags
- Keep c++ as is and add a fix to Python (that could introduce a lot of new unvanted logic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile Pandas' any is non-kleen.
The pandas any
/all
methods were broken for object dtype (and since numpy doesn't support nulls in its boolean dtype, whenever you have missing values, you have object dtype), so best not to use that as a reference (see eg pandas-dev/pandas#27709)
For the new nullable boolean dtype in pandas, the any/all methods also use kleene logic like in R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how about then just making this kernel "kleene" and just document that fact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I:
- Document the behavior as kleen?
- Rather revert the change and first implement
al_kleen/any_kleen
(ARROW-10291) and then map R to those forskip_nulls==False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure there is a good use case for a non-kleene version, so I am fine with just documenting for now that the behaviour follows Kleene logic (so is the reducing version of and/or_kleene)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I documented the new behavior in compute.rst and api_aggregate.h.
1c1b90b
to
97092b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the docstrings which are defined in aggregate_basic.cc ?
Can you take a look at my comment from above, which I think is not yet addressed:
|
Oh sorry, I missed that one. Added in last commit. |
Ping :) |
9ce08ca
to
7886d90
Compare
6acdae0
to
a32441c
Compare
@jorisvandenbossche @jonkeane @ianmcook Does this need another round of reviews? |
19cb1fb
to
20a5399
Compare
Co-authored-by: Ian Cook <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Thanks all! |
This is to resolve ARROW-12499.