Skip to content

Commit eefc70a

Browse files
jonded94andishgar
authored andcommitted
apacheGH-39010: [Python] Introduce maps_as_pydicts parameter for to_pylist, to_pydict, as_py (apache#45471)
### Rationale for this change Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist() # [{'x': [('a', 1)]}] ``` This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds. ### What changes are included in this PR? A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict") # [{'x': {'a': 1}}] ``` ### Are these changes tested? Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested. Also, duplicate keys now should throw an error, which is also tested for. ### Are there any user-facing changes? No callsites should be broken, simply a new keyword-only optional parameter is added. * GitHub Issue: apache#39010 Authored-by: Jonas Dedden <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
1 parent 24c8e2b commit eefc70a

File tree

6 files changed

+473
-55
lines changed

6 files changed

+473
-55
lines changed

python/pyarrow/array.pxi

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1651,16 +1651,30 @@ cdef class Array(_PandasConvertible):
16511651
array = array.copy()
16521652
return array
16531653

1654-
def to_pylist(self):
1654+
def to_pylist(self, *, maps_as_pydicts=None):
16551655
"""
16561656
Convert to a list of native Python objects.
16571657
1658+
Parameters
1659+
----------
1660+
maps_as_pydicts : str, optional, default `None`
1661+
Valid values are `None`, 'lossy', or 'strict'.
1662+
The default behavior (`None`), is to convert Arrow Map arrays to
1663+
Python association lists (list-of-tuples) in the same order as the
1664+
Arrow Map, as in [(key1, value1), (key2, value2), ...].
1665+
1666+
If 'lossy' or 'strict', convert Arrow Map arrays to native Python dicts.
1667+
1668+
If 'lossy', whenever duplicate keys are detected, a warning will be printed.
1669+
The last seen value of a duplicate key will be in the Python dictionary.
1670+
If 'strict', this instead results in an exception being raised when detected.
1671+
16581672
Returns
16591673
-------
16601674
lst : list
16611675
"""
16621676
self._assert_cpu()
1663-
return [x.as_py() for x in self]
1677+
return [x.as_py(maps_as_pydicts=maps_as_pydicts) for x in self]
16641678

16651679
def tolist(self):
16661680
"""
@@ -2286,12 +2300,18 @@ cdef class MonthDayNanoIntervalArray(Array):
22862300
Concrete class for Arrow arrays of interval[MonthDayNano] type.
22872301
"""
22882302

2289-
def to_pylist(self):
2303+
def to_pylist(self, *, maps_as_pydicts=None):
22902304
"""
22912305
Convert to a list of native Python objects.
22922306
22932307
pyarrow.MonthDayNano is used as the native representation.
22942308
2309+
Parameters
2310+
----------
2311+
maps_as_pydicts : str, optional, default `None`
2312+
Valid values are `None`, 'lossy', or 'strict'.
2313+
This parameter is ignored for non-nested Scalars.
2314+
22952315
Returns
22962316
-------
22972317
lst : list

0 commit comments

Comments
 (0)