-
-
Notifications
You must be signed in to change notification settings - Fork 733
Dask's Arrow serialization slow & memory intensive #2521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The easiest solution is to probably delay tables = [dask.delayed(pa.Table.from_pandas)(x) for x in ddf.to_delayed()] and you can continue processing from there as needed. |
@TomAugspurger does this explain the long delay though? When I run this with |
Specifying |
Also, here is a more minimal example that doesn't engage the distributed scheduler, but still takes a long time. import pyarrow as pa
import dask.array as da
import dask.dataframe as dd
n_rows = 5000000
n_keys = 5000000
ddf = dd.concat([
da.random.random(n_rows).to_dask_dataframe(columns='x'),
da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1).persist()
def get_arrow(df):
return pa.Table.from_pandas(df)
arrow_tables = ddf.map_partitions(get_arrow, meta=object).compute(scheduler='single-threaded') |
The type of If I had to guess, dask.dataframe is doing the equivalent of pd.Series(pa.Table.from_pandas(ddf.compute())) In theory, that shouldn't take long for pandas to do, but this is a really thorny area. We (pandas) essentially end up doing arr = np.empty((1,), dtype=object)
arr[:] = [table] where table is |
Why |
Right; anywhere in |
A more minimal example, this time without Dask: In [1]: import numpy as np, pandas as pd, pyarrow as pa
In [2]: df = pd.DataFrame({'x': np.arange(1000000)})
In [3]: %time t = pa.Table.from_pandas(df)
CPU times: user 5.36 ms, sys: 3.22 ms, total: 8.58 ms
Wall time: 7.47 ms
In [4]: %time s = pd.Series([t], dtype=object)
CPU times: user 2.7 s, sys: 114 ms, total: 2.82 s
Wall time: 2.81 s |
@TomAugspurger should I raise this upstream at Pandas or leave it here? |
We have issues for it :) Rewriting the constructors are on my medium-term TODO: pandas-dev/pandas#24387 started for dataframe. |
It's worth noting that I used While there may be an issue with Pandas, I think there's also an issue with Dask |
@randerzander can I ask you to expand on this? |
cudf.DataFrame.to_arrow is a similar operation to pyarrow.Table.from_pandas. cudf.DataFrame.to_arrow (for the same dataset) returns in milliseconds. When I pull it from map_partitions with Dask, I get the GC warnings and 20s+ slowdown. I know the worker has to send the data back to the client, but 20s seems excessive for a small, local transfer. |
Yup, totally agree. My example above shows that this might be because we're putting these things into Pandas series, and apparently putting an Arrow Table into a Pandas Series takes several seconds. Probably this isn't a problem with serialization or communication on the Dask end, it's rather that by using Dask dataframe you're trying to keep these things around in Pandas, which is currently borking. (At least until @TomAugspurger refactors the Pandas constructor logic). Short term the solution here is probably to avoid calling tables = [dask.delayed(pa.Table.from_pandas)(x) for x in ddf.to_delayed()] This will avoid trying to put PyArrow table objects in Pandas series, which seems to be the fundamental bug here. |
Closing here as some combination of "not a bug" and "an upstream pandas problem". I've re-raraised upstream at pandas-dev/pandas#25389 |
Got this when calling |
I'm creating a dummy 80MB single-partition Dask distributed DataFrame, and attempting to convert it to a PyArrow Table.
Doing so causes a notebook to throw GC warnings, and takes consistently over 20 seconds.
Versions:
PyArrow: 0.12.0
Dask: 1.1.1
Repro:
Result:
The text was updated successfully, but these errors were encountered: