Dask's Arrow serialization slow & memory intensive #2521

randerzander · 2019-02-11T17:21:35Z

I'm creating a dummy 80MB single-partition Dask distributed DataFrame, and attempting to convert it to a PyArrow Table.

Doing so causes a notebook to throw GC warnings, and takes consistently over 20 seconds.

Versions:
PyArrow: 0.12.0
Dask: 1.1.1

Repro:

from dask.distributed import Client, wait, LocalCluster
import pyarrow as pa

ip = '0.0.0.0'
cluster = LocalCluster(ip=ip)
client = Client(cluster)

import dask.array as da
import dask.dataframe as dd

n_rows = 5000000
n_keys = 5000000

ddf = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='x'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1).persist()

def get_arrow(df):
    return pa.Table.from_pandas(df)

%time arrow_tables = ddf.map_partitions(get_arrow).compute()

Result:

distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 26% CPU time recently (threshold: 10%)
CPU times: user 20.6 s, sys: 1.17 s, total: 21.7 s
Wall time: 22.5 s

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-02-11T17:29:43Z

map_partitions is generally expected to return a dask DataFrame. The meta on your ddf.map_partitions(get_arrow) isn't accurate.

The easiest solution is to probably delay pa.Table.from_pandas, to get a list of delayed Table objects.

tables = [dask.delayed(pa.Table.from_pandas)(x) for x in ddf.to_delayed()]

and you can continue processing from there as needed.

mrocklin · 2019-02-11T17:35:29Z

@TomAugspurger does this explain the long delay though? When I run this with prun/snakeviz I find that it's spending a bunch of time in pandas/core/dtypes/ construct_1d_object_array_from_listlike in Pandas.

mrocklin · 2019-02-11T17:36:30Z

Specifying meta=object also doesn't help here.

mrocklin · 2019-02-11T17:38:01Z

Also, here is a more minimal example that doesn't engage the distributed scheduler, but still takes a long time.

import pyarrow as pa

import dask.array as da
import dask.dataframe as dd

n_rows = 5000000
n_keys = 5000000

ddf = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='x'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1).persist()

def get_arrow(df):
    return pa.Table.from_pandas(df)

arrow_tables = ddf.map_partitions(get_arrow, meta=object).compute(scheduler='single-threaded')

TomAugspurger · 2019-02-11T17:46:57Z

The type of arrow_tables is still a Series[object] containing a single pa.Table.

If I had to guess, dask.dataframe is doing the equivalent of

pd.Series(pa.Table.from_pandas(ddf.compute()))

In theory, that shouldn't take long for pandas to do, but this is a really thorny area. We (pandas) essentially end up doing

arr = np.empty((1,), dtype=object)
arr[:] = [table]

where table is pa.Table.from_pandas(ddf.compute()), which takes forever.

mrocklin · 2019-02-11T17:48:12Z

Why pa.Table.from_pandas(ddf.compute()) ? and not pa.Table.from_pandas(a_partition)?

TomAugspurger · 2019-02-11T17:51:07Z

Right; anywhere in
#2521 (comment) where I have ddf.compute it'll be a partition (that just happens to be the entire dask dataframe in this case since there's a single partition). So you'd get a Series with one row per original partition, where each row is a Table.

mrocklin · 2019-02-11T17:51:45Z

A more minimal example, this time without Dask:

In [1]: import numpy as np, pandas as pd, pyarrow as pa

In [2]: df = pd.DataFrame({'x': np.arange(1000000)})

In [3]: %time t = pa.Table.from_pandas(df)
CPU times: user 5.36 ms, sys: 3.22 ms, total: 8.58 ms
Wall time: 7.47 ms

In [4]: %time s = pd.Series([t], dtype=object)
CPU times: user 2.7 s, sys: 114 ms, total: 2.82 s
Wall time: 2.81 s

mrocklin · 2019-02-11T17:52:55Z

@TomAugspurger should I raise this upstream at Pandas or leave it here?

TomAugspurger · 2019-02-11T17:54:25Z

We have issues for it :)

Rewriting the constructors are on my medium-term TODO: pandas-dev/pandas#24387 started for dataframe.

randerzander · 2019-02-11T17:54:58Z

It's worth noting that I used from_pandas in the repro snippet to exclude the thornier example, which is with cudf: rapidsai/cudf#899

While there may be an issue with Pandas, I think there's also an issue with Dask

mrocklin · 2019-02-11T17:58:15Z

While there may be an issue with Pandas, I think there's also an issue with Dask

@randerzander can I ask you to expand on this?

randerzander · 2019-02-11T18:06:48Z

cudf.DataFrame.to_arrow is a similar operation to pyarrow.Table.from_pandas.

cudf.DataFrame.to_arrow (for the same dataset) returns in milliseconds. When I pull it from map_partitions with Dask, I get the GC warnings and 20s+ slowdown. I know the worker has to send the data back to the client, but 20s seems excessive for a small, local transfer.

mrocklin · 2019-02-11T18:52:52Z

Yup, totally agree. My example above shows that this might be because we're putting these things into Pandas series, and apparently putting an Arrow Table into a Pandas Series takes several seconds. Probably this isn't a problem with serialization or communication on the Dask end, it's rather that by using Dask dataframe you're trying to keep these things around in Pandas, which is currently borking. (At least until @TomAugspurger refactors the Pandas constructor logic).

Short term the solution here is probably to avoid calling map_partitions(pa.Table.from_pandas), which will try to form another dataframe object, and instead use Dask Delayed as @TomAugspurger suggests above:

tables = [dask.delayed(pa.Table.from_pandas)(x) for x in ddf.to_delayed()]

This will avoid trying to put PyArrow table objects in Pandas series, which seems to be the fundamental bug here.

mrocklin · 2019-02-20T16:48:02Z

Closing here as some combination of "not a bug" and "an upstream pandas problem".

I've re-raraised upstream at pandas-dev/pandas#25389

d6tdev · 2019-10-20T20:07:15Z

Got this when calling .to_parquet() without having fastparquet installed so it used pyarrow to write and had trouble reading it. Problem was fixed by installing fastparquet.

mrocklin mentioned this issue Feb 20, 2019

Putting an Arrow table into a Pandas series takes a long time pandas-dev/pandas#25389

Closed

mrocklin closed this as completed Feb 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Dask's Arrow serialization slow & memory intensive #2521

Dask's Arrow serialization slow & memory intensive #2521

randerzander commented Feb 11, 2019

TomAugspurger commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

TomAugspurger commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

TomAugspurger commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

TomAugspurger commented Feb 11, 2019

Uh oh!

randerzander commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

randerzander commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 20, 2019

Uh oh!

d6tdev commented Oct 20, 2019

Uh oh!

Uh oh!

Dask's Arrow serialization slow & memory intensive #2521

Dask's Arrow serialization slow & memory intensive #2521

Comments

randerzander commented Feb 11, 2019

TomAugspurger commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

TomAugspurger commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

TomAugspurger commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

TomAugspurger commented Feb 11, 2019

Uh oh!

randerzander commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

randerzander commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 11, 2019

Uh oh!

mrocklin commented Feb 20, 2019

Uh oh!

d6tdev commented Oct 20, 2019

Uh oh!