ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels #10476

rok · 2021-06-08T01:24:35Z

This is to resolve ARROW-12499.

github-actions · 2021-06-08T01:24:54Z

https://issues.apache.org/jira/browse/ARROW-12499

lidavidm

C++ parts look good to me.

Now that the kernel has options, should we add a Python convenience function in pyarrow/compute.py?

lidavidm · 2021-06-08T14:16:11Z

(Oh sorry, I missed that this is a draft.)

rok · 2021-06-08T14:35:16Z

C++ parts look good to me.

I'm thinking to if boolean checks would be better instead of counts?

Now that the kernel has options, should we add a Python convenience function in pyarrow/compute.py?

Yeah definitely, next commit :)

lidavidm · 2021-06-08T14:38:29Z

C++ parts look good to me.

I'm thinking to if boolean checks would be better instead of counts?

I debated whether to comment on this. I would slightly prefer the boolean check over the count but the intent is clear either way.

rok · 2021-06-08T16:38:21Z

@lidavidm added the python test and switched from null count to has_nulls boolean. Also added a Python test so this should be good to review now.

rok · 2021-06-08T16:42:38Z

Hey @thisisnic, @jonkeane, could you take a look at the R part of this PR? It would be great to have an extra check :).

lidavidm

LGTM. I'm not quite sure what to make of the R tests so I'll let someone else comment on those.

lidavidm · 2021-06-08T16:42:12Z

python/pyarrow/tests/test_compute.py

@@ -494,30 +494,40 @@ def test_min_max():

 def test_any():
    # ARROW-1846
+
+    options = pc.ScalarAggregateOptions(skip_nulls=False)


Hmm, I had thought that generally, for kernels with options, we add a wrapper in pyarrow/compute.py. But now that I look, this isn't true for many of the kernels. So having Python tests is enough.

jonkeane · 2021-06-08T16:49:42Z

r/tests/testthat/test-compute-aggregate.R

  data <- c(1:10, NA, NA)

-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)


Does this test fail without the , na.rm = TRUE?

It does indeed:

Failure (test-compute-aggregate.R:388:3): any.Array and any.ChunkedArray as.vector(x) not equal to `y`. 'is.NA' value mismatch: 0 in current 1 in target Backtrace: 1. expect_vector_equal(any(input > 5), data) test-compute-aggregate.R:388:2 2. expect_as_vector(via_array, expected, ignore_attr, ...) helper-expectation.R:174:4 3. testthat:::expect_fun(as.vector(x), y, ...) helper-expectation.R:24:2 Failure (test-compute-aggregate.R:388:3): any.Array and any.ChunkedArray as.vector(x) not equal to `y`. 'is.NA' value mismatch: 0 in current 1 in target Backtrace: 1. expect_vector_equal(any(input > 5), data) test-compute-aggregate.R:388:2 2. expect_as_vector(via_chunked, expected, ignore_attr, ...) helper-expectation.R:187:4 3. testthat:::expect_fun(as.vector(x), y, ...) helper-expectation.R:24:2

I'm not sure how to read what happens in expect_vector_equal.

expect_vector_equal takes an expression and a vector and evaluates the expression in base R, then converts the vector converts it into an array and evaluates the expression again (which typically will call arrow compute functions), and finally converts the vector to a chunked array and does the same.

So the errors you're seeing indicate that we've got a mismatch in behavior. I see that Ian has commented below about more robust tests + how to match the behavior (but I'm happy to add to it if that's unclear!)

Indeed @jonkeane it was a behavior issue. Should be fixed now.

jonkeane

A few questions

r/tests/testthat/test-compute-aggregate.R

rok · 2021-06-09T01:03:15Z

@lidavidm - I've introduced some small changes to c++ to enable desired behaviour so perhaps re-review is in order. Changes are all in this commit.

jorisvandenbossche

For the and/or kernels (the non-reducing variants of this), we have a separate "kleene" version of those kernels. So I am wondering: do we need that here as well?

Apparently a few months ago I though that this was needed, given ARROW-10291 / #8294 (comment), but to be honest I don't fully understand my own comment there any more .. ;)

It seems that the current PR implements the "kleene" version, and I am not fully sure how a non-kleene version would behave otherwise and if this would actually ever be useful.

In any case we should probably describe the behaviour somewhere more explicitly.

Do we want any/all to follow the min_count parameter of the ScalarAggregateOptions? If not, we should probably document that it is being ignored.

jorisvandenbossche · 2021-06-09T07:36:34Z

cpp/src/arrow/compute/kernels/aggregate_basic.cc

-  Status Finalize(KernelContext*, Datum* out) override {
-    out->value = std::make_shared<BooleanScalar>(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+    if (!options.skip_nulls && !this->any && this->has_nulls) {


What's the reason for the && !this->any part? That seems to suggest "kleene logic"? But for the non-kleene version, I expect any null in the input to always give null for the output.

Indeed this is to to make the PR "kleen" to match R behavior. Meanwhile Pandas' any is non-kleen.

>>> import pandas as pd >>> pd.Series([None, None]).any(skipna=True) False >>> pd.Series([None, None]).any(skipna=False) >>>

We have three options IMO:

Revert to non-kleene c++ behaviour and add a small fix to R

Add kleen any/all kernels and route in R depending on flags

Keep c++ as is and add a fix to Python (that could introduce a lot of new unvanted logic)

Meanwhile Pandas' any is non-kleen.

The pandas any/all methods were broken for object dtype (and since numpy doesn't support nulls in its boolean dtype, whenever you have missing values, you have object dtype), so best not to use that as a reference (see eg pandas-dev/pandas#27709)

For the new nullable boolean dtype in pandas, the any/all methods also use kleene logic like in R.

So how about then just making this kernel "kleene" and just document that fact?

Shall I:

Document the behavior as kleen?

Rather revert the change and first implement al_kleen/any_kleen (ARROW-10291) and then map R to those for skip_nulls==False?

I am not sure there is a good use case for a non-kleene version, so I am fine with just documenting for now that the behaviour follows Kleene logic (so is the reducing version of and/or_kleene)

I documented the new behavior in compute.rst and api_aggregate.h.

jorisvandenbossche

Can you also update the docstrings which are defined in aggregate_basic.cc ?

cpp/src/arrow/compute/api_aggregate.h

docs/source/cpp/compute.rst

cpp/src/arrow/compute/api_aggregate.h

jorisvandenbossche · 2021-06-24T07:24:24Z

Can you take a look at my comment from above, which I think is not yet addressed:

Can you also update the docstrings which are defined in aggregate_basic.cc ?

rok · 2021-06-24T18:51:06Z

Can you take a look at my comment from above, which I think is not yet addressed:

Can you also update the docstrings which are defined in aggregate_basic.cc ?

Oh sorry, I missed that one. Added in last commit.

rok · 2021-06-30T12:13:32Z

Ping :)

rok · 2021-07-12T09:38:33Z

@jorisvandenbossche @jonkeane @ianmcook Does this need another round of reviews?

Co-authored-by: Ian Cook <[email protected]>

Co-authored-by: Joris Van den Bossche <[email protected]>

rok · 2021-07-14T16:02:37Z

Thanks all!

github-actions bot added Component: C++ Component: R labels Jun 8, 2021

rok force-pushed the ARROW-12499 branch 3 times, most recently from 34c3f48 to 5b24ca3 Compare June 8, 2021 13:45

lidavidm reviewed Jun 8, 2021

View reviewed changes

rok force-pushed the ARROW-12499 branch 2 times, most recently from 8b4d1a2 to 4a864b3 Compare June 8, 2021 15:56

github-actions bot added the Component: Python label Jun 8, 2021

rok force-pushed the ARROW-12499 branch from 4a864b3 to 9f2f039 Compare June 8, 2021 16:37

rok marked this pull request as ready for review June 8, 2021 16:38

lidavidm approved these changes Jun 8, 2021

View reviewed changes

jonkeane reviewed Jun 8, 2021

View reviewed changes

jonkeane requested changes Jun 8, 2021

View reviewed changes

r/tests/testthat/test-compute-aggregate.R Outdated Show resolved Hide resolved

ianmcook reviewed Jun 8, 2021

View reviewed changes

r/tests/testthat/test-compute-aggregate.R Outdated Show resolved Hide resolved

ianmcook reviewed Jun 8, 2021

View reviewed changes

r/tests/testthat/test-compute-aggregate.R Show resolved Hide resolved

rok force-pushed the ARROW-12499 branch 3 times, most recently from f8c7a4e to 6976376 Compare June 9, 2021 00:56

jorisvandenbossche reviewed Jun 9, 2021

View reviewed changes

rok force-pushed the ARROW-12499 branch 2 times, most recently from 1c1b90b to 97092b5 Compare June 14, 2021 22:21

jorisvandenbossche reviewed Jun 15, 2021

View reviewed changes

cpp/src/arrow/compute/api_aggregate.h Outdated Show resolved Hide resolved

cpp/src/arrow/compute/api_aggregate.h Outdated Show resolved Hide resolved

docs/source/cpp/compute.rst Outdated Show resolved Hide resolved

rok force-pushed the ARROW-12499 branch from a28ea56 to 46586ca Compare June 15, 2021 13:06

jorisvandenbossche reviewed Jun 15, 2021

View reviewed changes

cpp/src/arrow/compute/api_aggregate.h Outdated Show resolved Hide resolved

rok force-pushed the ARROW-12499 branch from adcead3 to f9e5f12 Compare June 21, 2021 10:58

rok requested a review from jorisvandenbossche June 23, 2021 15:26

rok force-pushed the ARROW-12499 branch from f9e5f12 to c5976a9 Compare June 24, 2021 18:50

rok force-pushed the ARROW-12499 branch from c5976a9 to 47e12cd Compare June 28, 2021 15:47

rok force-pushed the ARROW-12499 branch 2 times, most recently from 9ce08ca to 7886d90 Compare July 5, 2021 17:40

rok force-pushed the ARROW-12499 branch 2 times, most recently from 6acdae0 to a32441c Compare July 9, 2021 22:33

rok force-pushed the ARROW-12499 branch 2 times, most recently from 19cb1fb to 20a5399 Compare July 13, 2021 18:29

rok and others added 8 commits July 13, 2021 23:25

Adding ScalarAggregateOptions to Any and All kernels.

162c7cb

Apply suggestions from code review

3762482

Co-authored-by: Ian Cook <[email protected]>

Fixing behaviour.

5590491

Documenting any/all as kleen when skip_null=True.

a67de91

Apply suggestions from code review

263fd7a

Co-authored-by: Joris Van den Bossche <[email protected]>

Update cpp/src/arrow/compute/api_aggregate.h

2fa7505

Co-authored-by: Joris Van den Bossche <[email protected]>

Adding comments to aggregate_basic.cc.

5150ec9

Adding lole32 to configure.win.

2743729

rok force-pushed the ARROW-12499 branch from 20a5399 to 2743729 Compare July 13, 2021 21:25

jonkeane approved these changes Jul 14, 2021

View reviewed changes

jonkeane closed this in 1c002fc Jul 14, 2021

asfimport mentioned this pull request Jul 14, 2021

[C++][Compute][R] Add ScalarAggregateOptions to Any and All kernels #28264

Closed

ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels #10476

ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels #10476

Uh oh!

Conversation

rok commented Jun 8, 2021

Uh oh!

github-actions bot commented Jun 8, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Jun 8, 2021

Uh oh!

rok commented Jun 8, 2021

Uh oh!

lidavidm commented Jun 8, 2021

Uh oh!

rok commented Jun 8, 2021

Uh oh!

rok commented Jun 8, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rok commented Jun 9, 2021

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jun 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche commented Jun 24, 2021

Uh oh!

rok commented Jun 24, 2021

Uh oh!

rok commented Jun 30, 2021

Uh oh!

rok commented Jul 12, 2021

Uh oh!

jorisvandenbossche Jun 9, 2021 •

edited

Loading