Skip to content

ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels #10476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

rok
Copy link
Member

@rok rok commented Jun 8, 2021

This is to resolve ARROW-12499.

@github-actions
Copy link

github-actions bot commented Jun 8, 2021

@rok rok force-pushed the ARROW-12499 branch 3 times, most recently from 34c3f48 to 5b24ca3 Compare June 8, 2021 13:45
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ parts look good to me.

Now that the kernel has options, should we add a Python convenience function in pyarrow/compute.py?

@lidavidm
Copy link
Member

lidavidm commented Jun 8, 2021

(Oh sorry, I missed that this is a draft.)

@rok
Copy link
Member Author

rok commented Jun 8, 2021

C++ parts look good to me.

I'm thinking to if boolean checks would be better instead of counts?

Now that the kernel has options, should we add a Python convenience function in pyarrow/compute.py?

Yeah definitely, next commit :)

@lidavidm
Copy link
Member

lidavidm commented Jun 8, 2021

C++ parts look good to me.

I'm thinking to if boolean checks would be better instead of counts?

I debated whether to comment on this. I would slightly prefer the boolean check over the count but the intent is clear either way.

@rok rok force-pushed the ARROW-12499 branch 2 times, most recently from 8b4d1a2 to 4a864b3 Compare June 8, 2021 15:56
@rok
Copy link
Member Author

rok commented Jun 8, 2021

@lidavidm added the python test and switched from null count to has_nulls boolean. Also added a Python test so this should be good to review now.

@rok rok marked this pull request as ready for review June 8, 2021 16:38
@rok
Copy link
Member Author

rok commented Jun 8, 2021

Hey @thisisnic, @jonkeane, could you take a look at the R part of this PR? It would be great to have an extra check :).

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm not quite sure what to make of the R tests so I'll let someone else comment on those.

@@ -494,30 +494,40 @@ def test_min_max():

def test_any():
# ARROW-1846

options = pc.ScalarAggregateOptions(skip_nulls=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I had thought that generally, for kernels with options, we add a wrapper in pyarrow/compute.py. But now that I look, this isn't true for many of the kernels. So having Python tests is enough.

data <- c(1:10, NA, NA)

expect_vector_equal(any(input > 5), data)
expect_vector_equal(any(input > 5, na.rm = TRUE), data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test fail without the , na.rm = TRUE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does indeed:

Failure (test-compute-aggregate.R:388:3): any.Array and any.ChunkedArray
as.vector(x) not equal to `y`.
'is.NA' value mismatch: 0 in current 1 in target
Backtrace:
 1. expect_vector_equal(any(input > 5), data) test-compute-aggregate.R:388:2
 2. expect_as_vector(via_array, expected, ignore_attr, ...) helper-expectation.R:174:4
 3. testthat:::expect_fun(as.vector(x), y, ...) helper-expectation.R:24:2

Failure (test-compute-aggregate.R:388:3): any.Array and any.ChunkedArray
as.vector(x) not equal to `y`.
'is.NA' value mismatch: 0 in current 1 in target
Backtrace:
 1. expect_vector_equal(any(input > 5), data) test-compute-aggregate.R:388:2
 2. expect_as_vector(via_chunked, expected, ignore_attr, ...) helper-expectation.R:187:4
 3. testthat:::expect_fun(as.vector(x), y, ...) helper-expectation.R:24:2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to read what happens in expect_vector_equal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect_vector_equal takes an expression and a vector and evaluates the expression in base R, then converts the vector converts it into an array and evaluates the expression again (which typically will call arrow compute functions), and finally converts the vector to a chunked array and does the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the errors you're seeing indicate that we've got a mismatch in behavior. I see that Ian has commented below about more robust tests + how to match the behavior (but I'm happy to add to it if that's unclear!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed @jonkeane it was a behavior issue. Should be fixed now.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions

@rok rok force-pushed the ARROW-12499 branch 3 times, most recently from f8c7a4e to 6976376 Compare June 9, 2021 00:56
@rok
Copy link
Member Author

rok commented Jun 9, 2021

@lidavidm - I've introduced some small changes to c++ to enable desired behaviour so perhaps re-review is in order. Changes are all in this commit.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the and/or kernels (the non-reducing variants of this), we have a separate "kleene" version of those kernels. So I am wondering: do we need that here as well?

Apparently a few months ago I though that this was needed, given ARROW-10291 / #8294 (comment), but to be honest I don't fully understand my own comment there any more .. ;)

It seems that the current PR implements the "kleene" version, and I am not fully sure how a non-kleene version would behave otherwise and if this would actually ever be useful.

In any case we should probably describe the behaviour somewhere more explicitly.


Do we want any/all to follow the min_count parameter of the ScalarAggregateOptions? If not, we should probably document that it is being ignored.

Status Finalize(KernelContext*, Datum* out) override {
out->value = std::make_shared<BooleanScalar>(this->any);
Status Finalize(KernelContext* ctx, Datum* out) override {
if (!options.skip_nulls && !this->any && this->has_nulls) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for the && !this->any part? That seems to suggest "kleene logic"? But for the non-kleene version, I expect any null in the input to always give null for the output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is to to make the PR "kleen" to match R behavior. Meanwhile Pandas' any is non-kleen.

>>> import pandas as pd
>>> pd.Series([None, None]).any(skipna=True)
False
>>> pd.Series([None, None]).any(skipna=False)
>>> 

We have three options IMO:

  • Revert to non-kleene c++ behaviour and add a small fix to R
  • Add kleen any/all kernels and route in R depending on flags
  • Keep c++ as is and add a fix to Python (that could introduce a lot of new unvanted logic)

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile Pandas' any is non-kleen.

The pandas any/all methods were broken for object dtype (and since numpy doesn't support nulls in its boolean dtype, whenever you have missing values, you have object dtype), so best not to use that as a reference (see eg pandas-dev/pandas#27709)

For the new nullable boolean dtype in pandas, the any/all methods also use kleene logic like in R.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how about then just making this kernel "kleene" and just document that fact?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I:

  1. Document the behavior as kleen?
  2. Rather revert the change and first implement al_kleen/any_kleen (ARROW-10291) and then map R to those for skip_nulls==False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure there is a good use case for a non-kleene version, so I am fine with just documenting for now that the behaviour follows Kleene logic (so is the reducing version of and/or_kleene)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I documented the new behavior in compute.rst and api_aggregate.h.

@rok rok force-pushed the ARROW-12499 branch 2 times, most recently from 1c1b90b to 97092b5 Compare June 14, 2021 22:21
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the docstrings which are defined in aggregate_basic.cc ?

@jorisvandenbossche
Copy link
Member

Can you take a look at my comment from above, which I think is not yet addressed:

Can you also update the docstrings which are defined in aggregate_basic.cc ?

@rok
Copy link
Member Author

rok commented Jun 24, 2021

Can you take a look at my comment from above, which I think is not yet addressed:

Can you also update the docstrings which are defined in aggregate_basic.cc ?

Oh sorry, I missed that one. Added in last commit.

@rok
Copy link
Member Author

rok commented Jun 30, 2021

Ping :)

@rok rok force-pushed the ARROW-12499 branch 2 times, most recently from 9ce08ca to 7886d90 Compare July 5, 2021 17:40
@rok rok force-pushed the ARROW-12499 branch 2 times, most recently from 6acdae0 to a32441c Compare July 9, 2021 22:33
@rok
Copy link
Member Author

rok commented Jul 12, 2021

@jorisvandenbossche @jonkeane @ianmcook Does this need another round of reviews?

@rok rok force-pushed the ARROW-12499 branch 2 times, most recently from 19cb1fb to 20a5399 Compare July 13, 2021 18:29
@jonkeane jonkeane closed this in 1c002fc Jul 14, 2021
@rok
Copy link
Member Author

rok commented Jul 14, 2021

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants