GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

thisisnic · 2025-05-13T20:50:34Z

Rationale for this change

There is a bug where we end up crashing when working on labelled columns in table

What changes are included in this PR?

Remove labels from columns

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

Draft PR - this works for tables but not datasets yet

GitHub Issue: [R] R arrow cannot handle labelled data in arrow tables #45601

github-actions · 2025-05-13T20:51:00Z

⚠️ GitHub issue #45601 has been automatically assigned in GitHub to PR creator.

thisisnic · 2025-05-24T08:42:52Z

@amoeba I tried the approach you suggested here but because we use as_arrow_table() internally in a lot more functions, we end up breaking roundtripping with Feather etc.

I think if we work only in R, we would want to remove the label and then restore them later, but trying to find an uncomplicated way of doing this.

I think we definitely want to stop the segfault regardless and error instead.

Users technically can use mutate() to change the type to something we can work with, but there'll be resource costs with doing this on a dataset. See my reprex below.

library(haven)
library(arrow)
library(tibble)
library(dplyr)

d <- tibble(
  a = labelled(x = 1:5),
  b = labelled(x = 11:15)
)

tf <- tempfile()
write_parquet(d, tf)

# still fails
read_parquet(tf, as_data_frame = FALSE) %>%
  filter(a > 3) %>%
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater' has no kernel matching input types (<labelled<integer>[0]>, <labelled<integer>[0]>)

tf <- tempfile()
write_parquet(d, tf)

# works
read_parquet(tf, as_data_frame = FALSE) %>%
  mutate(a = as.integer(a)) %>%
  filter(a > 3) %>%
  collect()
#> # A tibble: 2 × 2
#>       a b        
#>   <int> <int+lbl>
#> 1     4 14       
#> 2     5 15

# fails
open_dataset(tf) %>%
  mutate(a = as.integer(a)) %>%
  filter(a > 3) %>%
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater_equal' has no kernel matching input types (<labelled<integer>[0]>, <labelled<integer>[0]>)

# works but potentially higher resource usage
open_dataset(tf) %>%
  mutate(a = as.integer(a)) %>%
  compute() %>%
  filter(a > 3) %>%
  collect()
#> # A tibble: 2 × 2
#>       a b        
#>   <int> <int+lbl>
#> 1     4 14       
#> 2     5 15

thisisnic · 2025-05-24T11:55:33Z

I've stopped it segfaulting on printing, but I think the actual fix needs to be more layers deep.

thisisnic · 2025-05-24T19:08:05Z

I'm also wondering if instead of supporting this we should just stop the segfault and then error appropriately and recommend folks do something like:

open_dataset(whatever) %>%
   mutate(col = cast(col, int32()) %>%
   write_dataset(newlocation)

open_dataset(newlocation) %>%
  filter(col > 3) %>%
  collect()

Otherwise we're getting into the territory of supporting compute functions on extension types, which we don't actually do and if implemented should be done lower down the stack anyway.

More discussion on computing on extension types here: https://lists.apache.org/thread/2j61nrod7x0s5vjhc6q9tlj898drz7rn

…ypes so can print query

amoeba · 2025-05-28T00:24:24Z

Hey @thisisnic, thanks for working on this. I think fixing the segfault and erroring with a helpful message sounds great.

Add test for haven labelled datasets

1b07558

github-actions bot added Component: R awaiting committer review Awaiting committer review labels May 13, 2025

Stop it segfaulting when printing

1f3305b

thisisnic added 2 commits May 25, 2025 21:06

Try adding method to make string

f9de3aa

Remove unlabel function and implement creating Arrays from ExtensionT…

900e2d5

…ypes so can print query

github-actions bot added the Component: C++ label May 25, 2025

thisisnic added 3 commits May 25, 2025 22:32

Run linter

69044bd

Undo printing of extension type name for scalar expressions

23ad73a

Ditch whitespace added accidentally

f37f908

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

Uh oh!

thisisnic commented May 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

amoeba commented May 28, 2025

Uh oh!

Uh oh!

GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

Are you sure you want to change the base?

GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

Uh oh!

Conversation

thisisnic commented May 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

amoeba commented May 28, 2025

Uh oh!

Uh oh!

thisisnic commented May 13, 2025 •

edited by github-actions bot

Loading