-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-39010: [Python] Introduce maps_as_pydicts
parameter for to_pylist
, to_pydict
, as_py
#45471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39010: [Python] Introduce maps_as_pydicts
parameter for to_pylist
, to_pydict
, as_py
#45471
Conversation
Fix ExampleUuidScalarType Add tests for `maps_as_pydicts` Add test for duplicate map keys Formatting fixes Add docstring for 'maps_as_pydicts' Formatting fixes Call from_arrays from Table Fix last hopefully issues Correct MapScalar method "as_py" when there are multiple keys present
|
While this is not a bad idea in itself, it seems like the roundtripping concern could be solved more efficiently by making |
Also:
Please note that |
Let me clarify what this is about. Map fields are already createable with
You can use
Yes, but this is part of a very large distributed machine learning setup, where relatively intricate filters applied on deeply nested list/struct/map columns. The compute of the actual machine learning outclasses the compute one has to do to deserialize Python objects by many orders of magnitude. For pure data queries, we would not use bare Python objects of course. |
I see, thanks. Then, do we want to reuse the same parameter signature as in the Pandas-related PR? I.e., allow either |
Sure, I actually also stumbled across that when I revisited that original Github issue. Before I do that, I'd like to ask whether you're generally fine with adding this new parameter to every |
That sounds ok to me. Ideally, |
By the way, we probably want to make the new parameter keyword-only? |
I addressed the remarks :) There is some weird error in the "Docs" job, I don't know what this is about. |
Hmm, it looks like some of the CI failures will need #45500 to be merged first |
I rebased the branch, now the CI tests seem fine again, I think? Could we get a approval/review of this? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonded94 ! This looks good on the principle, here are some assorted comments.
python/pyarrow/array.pxi
Outdated
This can change the ordering of (key, value) pairs, and will | ||
deduplicate multiple keys, resulting in a possible loss of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ordering comment is obsolete, as Python dicts are ordered nowadays. Unless the underlying implementation does something weird, ordering should therefore be preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the ordering part, added some explanation of which value survives on duplicate keys.
python/pyarrow/table.pxi
Outdated
Arrow Map, as in [(key1, value1), (key2, value2), ...]. | ||
|
||
If 'lossy' or 'strict', convert Arrow Map arrays to native Python dicts. | ||
This can change the ordering of (key, value) pairs, and will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment re: ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
python/pyarrow/tests/test_scalars.py
Outdated
with pytest.raises(ValueError): | ||
assert s.as_py(maps_as_pydicts="strict") | ||
|
||
assert s.as_py(maps_as_pydicts="lossy") == {'a': 2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check that a warning is actually emitted? See pytest.warns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented a check for this warning
python/pyarrow/scalar.pxi
Outdated
raise ValueError( | ||
"Invalid value for 'maps_as_pydicts': " | ||
+ "valid values are 'lossy', 'strict' or `None` (default). " | ||
+ f"Received '{maps_as_pydicts}'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it may be more idiomatic to use the repr here
+ f"Received '{maps_as_pydicts}'." | |
+ f"Received {maps_as_pydicts!r}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the suggested change
python/pyarrow/scalar.pxi
Outdated
for key, value in self: | ||
if key in result_dict: | ||
if maps_as_pydicts == "strict": | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this a KeyError
. Also, the message should perhaps contain the duplicate key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it a KeyError
@github-actions crossbow submit -g python |
Revision: 93045c4 Submitted crossbow builds: ursacomputing/crossbow @ actions-9728f80818 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, will merge if CI is green.
CI failures are unrelated. |
Just fyi this might cause backward incompatibility issue because the user defined extension types are not expecting
|
…o_pylist`, `to_pydict`, `as_py` (apache#45471) ### Rationale for this change Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist() # [{'x': [('a', 1)]}] ``` This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds. ### What changes are included in this PR? A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict") # [{'x': {'a': 1}}] ``` ### Are these changes tested? Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested. Also, duplicate keys now should throw an error, which is also tested for. ### Are there any user-facing changes? No callsites should be broken, simply a new keyword-only optional parameter is added. * GitHub Issue: apache#39010 Authored-by: Jonas Dedden <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
We (Ray Data team) are also running into backward compatibility issues like this in our tests against pyarrow nightly with the same error mentioned here: [2025-02-25T06:27:20Z] =================================== FAILURES ===================================
--
| [2025-02-25T06:27:20Z] ____________ test_convert_to_pyarrow_array_object_ext_type_fallback ____________
| [2025-02-25T06:27:20Z]
| [2025-02-25T06:27:20Z] def test_convert_to_pyarrow_array_object_ext_type_fallback():
| [2025-02-25T06:27:20Z] column_values = create_ragged_ndarray(
| [2025-02-25T06:27:20Z] [
| [2025-02-25T06:27:20Z] "hi",
| [2025-02-25T06:27:20Z] 1,
| [2025-02-25T06:27:20Z] None,
| [2025-02-25T06:27:20Z] [[[[]]]],
| [2025-02-25T06:27:20Z] {"a": [[{"b": 2, "c": UserObj(i=123)}]]},
| [2025-02-25T06:27:20Z] UserObj(i=456),
| [2025-02-25T06:27:20Z] ]
| [2025-02-25T06:27:20Z] )
| [2025-02-25T06:27:20Z] column_name = "py_object_column"
| [2025-02-25T06:27:20Z]
| [2025-02-25T06:27:20Z] # First, assert that straightforward conversion into Arrow native types fails
| [2025-02-25T06:27:20Z] with pytest.raises(ArrowConversionError) as exc_info:
| [2025-02-25T06:27:20Z] _convert_to_pyarrow_native_array(column_values, column_name)
| [2025-02-25T06:27:20Z]
| [2025-02-25T06:27:20Z] assert (
| [2025-02-25T06:27:20Z] str(exc_info.value)
| [2025-02-25T06:27:20Z] == "Error converting data to Arrow: ['hi' 1 None list([[[[]]]]) {'a': [[{'b': 2, 'c': UserObj(i=123)}]]}\n UserObj(i=456)]" # noqa: E501
| [2025-02-25T06:27:20Z] )
| [2025-02-25T06:27:20Z]
| [2025-02-25T06:27:20Z] # Subsequently, assert that fallback to `ArrowObjectExtensionType` succeeds
| [2025-02-25T06:27:20Z] pa_array = convert_to_pyarrow_array(column_values, column_name)
| [2025-02-25T06:27:20Z]
| [2025-02-25T06:27:20Z] > assert pa_array.to_pylist() == column_values.tolist()
| [2025-02-25T06:27:20Z]
| [2025-02-25T06:27:20Z] python/ray/air/tests/test_arrow.py:121:
| [2025-02-25T06:27:20Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| [2025-02-25T06:27:20Z]
| [2025-02-25T06:27:20Z] > ???
| [2025-02-25T06:27:20Z] E TypeError: as_py() got an unexpected keyword argument 'maps_as_pydicts'
|
@Linchin @omatthew98 I think the way around this would be to take a For example turn this: class JSONArrowScalar(pa.ExtensionScalar):
def as_py(self):
return JSONArray._deserialize_json(self.value.as_py() if self.value else None) into this: class JSONArrowScalar(pa.ExtensionScalar):
def as_py(self, **kwargs):
return JSONArray._deserialize_json(self.value.as_py(**kwargs) if self.value else None) |
I've updated the PR description, we should remember to call out this potential incompatibility in the release notes for the next version. |
…o_pylist`, `to_pydict`, `as_py` (apache#45471) ### Rationale for this change Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist() # [{'x': [('a', 1)]}] ``` This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds. ### What changes are included in this PR? A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict") # [{'x': {'a': 1}}] ``` ### Are these changes tested? Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested. Also, duplicate keys now should throw an error, which is also tested for. ### Are there any user-facing changes? No callsites should be broken, simply a new keyword-only optional parameter is added. * GitHub Issue: apache#39010 Authored-by: Jonas Dedden <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
#51041) ## Why are these changes needed? Our tests with pyarrow nightly caught a backwards incompatibility bug with a [recent pyarrow change](apache/arrow#45471). To fix this we simply need to pass along kwargs in our `as_py` method as suggested by the pyarrow team [here](apache/arrow#45471 (comment)). --------- Signed-off-by: Matthew Owen <[email protected]>
ray-project#51041) ## Why are these changes needed? Our tests with pyarrow nightly caught a backwards incompatibility bug with a [recent pyarrow change](apache/arrow#45471). To fix this we simply need to pass along kwargs in our `as_py` method as suggested by the pyarrow team [here](apache/arrow#45471 (comment)). --------- Signed-off-by: Matthew Owen <[email protected]>
#51041) ## Why are these changes needed? Our tests with pyarrow nightly caught a backwards incompatibility bug with a [recent pyarrow change](apache/arrow#45471). To fix this we simply need to pass along kwargs in our `as_py` method as suggested by the pyarrow team [here](apache/arrow#45471 (comment)). --------- Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: Abrar Sheikh <[email protected]>
ray-project#51041) ## Why are these changes needed? Our tests with pyarrow nightly caught a backwards incompatibility bug with a [recent pyarrow change](apache/arrow#45471). To fix this we simply need to pass along kwargs in our `as_py` method as suggested by the pyarrow team [here](apache/arrow#45471 (comment)). --------- Signed-off-by: Matthew Owen <[email protected]>
ray-project#51041) ## Why are these changes needed? Our tests with pyarrow nightly caught a backwards incompatibility bug with a [recent pyarrow change](apache/arrow#45471). To fix this we simply need to pass along kwargs in our `as_py` method as suggested by the pyarrow team [here](apache/arrow#45471 (comment)). --------- Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: Jay Chia <[email protected]>
ray-project#51041) ## Why are these changes needed? Our tests with pyarrow nightly caught a backwards incompatibility bug with a [recent pyarrow change](apache/arrow#45471). To fix this we simply need to pass along kwargs in our `as_py` method as suggested by the pyarrow team [here](apache/arrow#45471 (comment)). --------- Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: Jay Chia <[email protected]>
Rationale for this change
Currently, unfortunately
MapScalar
/Array
types are not deserialized into proper Pythondict
s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python:This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native
pyarrow
methods without doing extremely ugly and computationally costly workarounds.What changes are included in this PR?
A new parameter
maps_as_pydicts
is introduced toto_pylist
,to_pydict
,as_py
which will allow proper roundtrips:Are these changes tested?
Yes. There are tests for
to_pylist
andto_pydict
included forpyarrow.Table
, whilst low-levelMapScalar
and especially a nesting withListScalar
andStructScalar
is tested.Also, duplicate keys now should throw an error, which is also tested for.
Are there any user-facing changes?
Yes. The
as_py()
method on Scalar instances can be called with a new keyword argumentmaps_as_pydicts
.As a consequence, if you implement your own Scalar subclass (for example for an extension type), you should change its signature to accept that new argument. For example this definition:
could be changed to: