You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
apacheGH-39010: [Python] Introduce maps_as_pydicts parameter for to_pylist, to_pydict, as_py (apache#45471)
### Rationale for this change
Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python:
```
import pyarrow as pa
schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))])
data = [{'x': {'a': 1}}]
pa.RecordBatch.from_pylist(data, schema=schema).to_pylist()
# [{'x': [('a', 1)]}]
```
This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds.
### What changes are included in this PR?
A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips:
```
import pyarrow as pa
schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))])
data = [{'x': {'a': 1}}]
pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict")
# [{'x': {'a': 1}}]
```
### Are these changes tested?
Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested.
Also, duplicate keys now should throw an error, which is also tested for.
### Are there any user-facing changes?
No callsites should be broken, simply a new keyword-only optional parameter is added.
* GitHub Issue: apache#39010
Authored-by: Jonas Dedden <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
0 commit comments