-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@amoeba I tried the approach you suggested here but because we use I think if we work only in R, we would want to remove the label and then restore them later, but trying to find an uncomplicated way of doing this. I think we definitely want to stop the segfault regardless and error instead. Users technically can use library(haven)
library(arrow)
library(tibble)
library(dplyr)
d <- tibble(
a = labelled(x = 1:5),
b = labelled(x = 11:15)
)
tf <- tempfile()
write_parquet(d, tf)
# still fails
read_parquet(tf, as_data_frame = FALSE) %>%
filter(a > 3) %>%
collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater' has no kernel matching input types (<labelled<integer>[0]>, <labelled<integer>[0]>) tf <- tempfile()
write_parquet(d, tf)
# works
read_parquet(tf, as_data_frame = FALSE) %>%
mutate(a = as.integer(a)) %>%
filter(a > 3) %>%
collect()
#> # A tibble: 2 × 2
#> a b
#> <int> <int+lbl>
#> 1 4 14
#> 2 5 15 # fails
open_dataset(tf) %>%
mutate(a = as.integer(a)) %>%
filter(a > 3) %>%
collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater_equal' has no kernel matching input types (<labelled<integer>[0]>, <labelled<integer>[0]>) # works but potentially higher resource usage
open_dataset(tf) %>%
mutate(a = as.integer(a)) %>%
compute() %>%
filter(a > 3) %>%
collect()
#> # A tibble: 2 × 2
#> a b
#> <int> <int+lbl>
#> 1 4 14
#> 2 5 15 |
I've stopped it segfaulting on printing, but I think the actual fix needs to be more layers deep. |
I'm also wondering if instead of supporting this we should just stop the segfault and then error appropriately and recommend folks do something like:
Otherwise we're getting into the territory of supporting compute functions on extension types, which we don't actually do and if implemented should be done lower down the stack anyway. More discussion on computing on extension types here: https://lists.apache.org/thread/2j61nrod7x0s5vjhc6q9tlj898drz7rn |
Hey @thisisnic, thanks for working on this. I think fixing the segfault and erroring with a helpful message sounds great. |
Rationale for this change
There is a bug where we end up crashing when working on labelled columns in table
What changes are included in this PR?
Remove labels from columns
Are these changes tested?
Yes
Are there any user-facing changes?
Yes
Draft PR - this works for tables but not datasets yet