Skip to content

lazily load dask arrays to dask data frames by calling to_dask_dataframe #1489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Oct 28, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
66452cc
Merge remote-tracking branch 'pydata/master'
jmunroe Apr 15, 2017
f4b564e
Merge branch 'master' of https://github.com/pydata/xarray
jmunroe May 2, 2017
aac672d
Merge remote-tracking branch 'upstream/master'
jmunroe Jul 26, 2017
67a71b6
Merge remote-tracking branch 'upstream/master'
jmunroe Aug 1, 2017
55417aa
add test for conversion to dask dataframes
jmunroe Jul 26, 2017
bf92c4c
WIP: beginning implementation
jmunroe Jul 26, 2017
84fe8e4
use dd.from_dask_array to convert dask arrays to dask dataframes
jmunroe Jul 26, 2017
157613b
initial attempt of creating dask dataframes from dask dataarrays
jmunroe Jul 27, 2017
414be29
create separate to_dask_dataframe method
jmunroe Jul 27, 2017
bf9ec78
added docstsring
jmunroe Jul 27, 2017
c64db76
minor code clean up
jmunroe Jul 27, 2017
6703a41
default to set_index=False
jmunroe Jul 28, 2017
138a237
create dask frame directly to support multiple data types in datafram…
jmunroe Aug 1, 2017
17d7819
fixed error in calculating divisions for dataframe; fixed dataarray a…
jmunroe Aug 1, 2017
41a8e0d
refactor to use dask dataframe api to construct dataframes
jmunroe Sep 5, 2017
47833f4
merge from upstream
jmunroe Oct 10, 2017
3c6dcb6
fix style issues reported by flake8
jmunroe Oct 10, 2017
27de6b3
added note describing ne to_dask_dataframe method
jmunroe Oct 10, 2017
2ef7983
add entry in api docs for Dataset.to_dask_dataframe
jmunroe Oct 10, 2017
1cf80a7
added reference to to_dask_dataframe in the pandas doc discussing to_…
jmunroe Oct 10, 2017
024a1aa
add missing method identifier in docs
jmunroe Oct 10, 2017
67dbbe5
retain coordinate variables in dask dataframe even index is not set
jmunroe Oct 11, 2017
dd3c9c5
add example of to_dask_dataframe() to docs
jmunroe Oct 11, 2017
2948c86
fix flake8 issues
jmunroe Oct 11, 2017
d4247a9
clarify behaviour in doc string of to_dask_dataframe
jmunroe Oct 11, 2017
5fd1fc7
Merge branch 'master' of https://github.com/pydata/xarray into dask_d…
jmunroe Oct 19, 2017
4705fde
resolve merge
jmunroe Oct 20, 2017
c73f5b4
restructure to handle coordinate and data variables more carefully
jmunroe Oct 20, 2017
64458e2
add test if dataset has a dimension without coordinates
jmunroe Oct 20, 2017
6f6b48d
break up tests into sub tests
jmunroe Oct 20, 2017
9dcdbca
fix flake8 issues
jmunroe Oct 20, 2017
a422965
Use assert_frame_equal
shoyer Oct 27, 2017
ab8180b
Use int64
shoyer Oct 27, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,7 @@ Dataset methods
save_mfdataset
Dataset.to_array
Dataset.to_dataframe
Dataset.to_dask_dataframe
Dataset.to_dict
Dataset.from_dataframe
Dataset.from_dict
Expand Down
11 changes: 10 additions & 1 deletion doc/dask.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,15 @@ Once you've manipulated a dask array, you can still write a dataset too big to
fit into memory back to disk by using :py:meth:`~xarray.Dataset.to_netcdf` in the
usual way.

A dataset can also be converted to a dask DataFrame using :py:meth:`~xarray.Dataset.to_dask_dataframe`.

.. ipython:: python

df = ds.to_dask_dataframe()
df

Dask DataFrames do not support multi-indexes so the coordinate variables from the dataset are included as columns in the dask DataFrame.

Using dask with xarray
----------------------

Expand Down Expand Up @@ -145,7 +154,7 @@ Explicit conversion by wrapping a DataArray with ``np.asarray`` also works:
...

Alternatively you can load the data into memory but keep the arrays as
dask arrays using the `~xarray.Dataset.persist` method:
dask arrays using the :py:meth:`~xarray.Dataset.persist` method:

.. ipython::

Expand Down
3 changes: 3 additions & 0 deletions doc/pandas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ To convert the ``DataFrame`` to any other convenient representation,
use ``DataFrame`` methods like :py:meth:`~pandas.DataFrame.reset_index`,
:py:meth:`~pandas.DataFrame.stack` and :py:meth:`~pandas.DataFrame.unstack`.

For datasets containing dask arrays where the data should be lazily loaded, see the
:py:meth:`Dataset.to_dask_dataframe() <xarray.Dataset.to_dask_dataframe>` method.

To create a ``Dataset`` from a ``DataFrame``, use the
:py:meth:`~xarray.Dataset.from_dataframe` class method or the equivalent
:py:meth:`pandas.DataFrame.to_xarray <DataFrame.to_xarray>` method (pandas
Expand Down
5 changes: 5 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,11 @@ Enhancements
functions on data stored as dask arrays (:issue:`1279`).
By `Joe Hamman <https://github.com/jhamman>`_.

- Added new method :py:meth:`~Dataset.to_dask_dataframe` to
``Dataset``, convert a dataset into a dask dataframe.
This allows lazy loading of data from a dataset containing dask arrays (:issue:`1462`).
By `James Munroe <https://github.com/jmunroe>`_.

- Support reading and writing unlimited dimensions with h5netcdf (:issue:`1636`).
By `Joe Hamman <https://github.com/jhamman>`_.

Expand Down
60 changes: 60 additions & 0 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2416,6 +2416,66 @@ def from_dataframe(cls, dataframe):
obj[name] = (dims, data)
return obj

def to_dask_dataframe(self, set_index=False):
"""
Convert this dataset into a dask.dataframe.DataFrame.

Both the coordinate and data variables in this dataset form
the columns of the DataFrame.

If set_index=True, the dask DataFrame is indexed by this dataset's
coordinate. Since dask DataFrames to not support multi-indexes,
set_index only works if there is one coordinate dimension.
"""

import dask.dataframe as dd

ordered_dims = self.dims
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add that dims_order keyword argument. Then this becomes something like:

if dims_order is None:
    dims_order = self.dims
ordered_dims = OrderedDict((k, self.dims[k]) for k in dims_order)

chunks = self.chunks

# order columns so that coordinates appear before data
columns = list(self.coords) + list(self.data_vars)

data = []
for k in columns:
v = self._variables[k]

# consider coordinate variables as well as data varibles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good place to mention in a comment your discovery that we need to convert to base variables in order for chunk() to work properly.

if isinstance(v, xr.IndexVariable):
v = v.to_base_variable()

# ensure all variables span the same dimensions
v = v.set_dims(ordered_dims)

# ensure all variables have the same chunking structure
if v.chunks != chunks:
v = v.chunk(chunks)

# reshape variable contents as a 1d array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: some of these comments are probably slightly overboard -- if they simply restate what's in the code it's better to omit them.

d = v.data.reshape(-1)

# convert to dask DataFrames
s = dd.from_array(d, columns=[k])

data.append(s)

df = dd.concat(data, axis=1)

if set_index:

if len(ordered_dims) != 1:
raise ValueError(
'set_index=True only is valid for '
'for one-dimensional datasets')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include the list of multiple dimensions in the error message?


# extract out first (and only) coordinate variable
coord_dim = list(ordered_dims)[0]

if coord_dim in df.columns:
df = df.set_index(coord_dim)

return df

def to_dict(self):
"""
Convert this dataset to a dictionary following xarray naming
Expand Down
6 changes: 6 additions & 0 deletions xarray/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@
from xarray.core.pycompat import PY3
from xarray.testing import assert_equal, assert_identical, assert_allclose

try:
from pandas.testing import assert_frame_equal
except ImportError:
# old location, for pandas < 0.20
from pandas.util.testing import assert_frame_equal

try:
import unittest2 as unittest
except ImportError:
Expand Down
103 changes: 99 additions & 4 deletions xarray/tests/test_dask.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,14 @@
import xarray as xr
from xarray import Variable, DataArray, Dataset
import xarray.ufuncs as xu
from xarray.core.pycompat import suppress
from . import TestCase
from xarray.core.pycompat import suppress, OrderedDict
from . import TestCase, assert_frame_equal

from xarray.tests import mock

dask = pytest.importorskip('dask')
import dask.array as da
import dask.dataframe as dd


class DaskTestCase(TestCase):
Expand All @@ -29,9 +30,9 @@ def assertLazyAnd(self, expected, actual, test):
if isinstance(actual, Dataset):
for k, v in actual.variables.items():
if k in actual.dims:
self.assertIsInstance(var.data, np.ndarray)
self.assertIsInstance(v.data, np.ndarray)
else:
self.assertIsInstance(var.data, da.Array)
self.assertIsInstance(v.data, da.Array)
elif isinstance(actual, DataArray):
self.assertIsInstance(actual.data, da.Array)
for k, v in actual.coords.items():
Expand Down Expand Up @@ -546,6 +547,100 @@ def test_from_dask_variable(self):
coords={'x': range(4)}, name='foo')
self.assertLazyAndIdentical(self.lazy_array, a)

def test_to_dask_dataframe(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be appreciated if you could break this into a few more sub-methods. We don't always follow this well currently, but smaller tests that only test one things are easier to work with.

There's no strict line limit, but aim for less than 10-20 lines if possible. Another good time to break a test into parts is when you have different input data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Done.

# Test conversion of Datasets to dask DataFrames
x = da.from_array(np.random.randn(10), chunks=4)
y = np.arange(10, dtype='uint8')
t = list('abcdefghij')

ds = Dataset(OrderedDict([('a', ('t', x)),
('b', ('t', y)),
('t', ('t', t))]))

expected_pd = pd.DataFrame({'a': x,
'b': y},
index=pd.Index(t, name='t'))

# test if 1-D index is correctly set up
expected = dd.from_pandas(expected_pd, chunksize=4)
actual = ds.to_dask_dataframe(set_index=True)
# test if we have dask dataframes
self.assertIsInstance(actual, dd.DataFrame)

# use the .equals from pandas to check dataframes are equivalent
assert_frame_equal(expected.compute(), actual.compute())

# test if no index is given
expected = dd.from_pandas(expected_pd.reset_index(drop=False),
chunksize=4)

actual = ds.to_dask_dataframe(set_index=False)

self.assertIsInstance(actual, dd.DataFrame)
assert_frame_equal(expected.compute(), actual.compute())

def test_to_dask_dataframe_2D(self):
# Test if 2-D dataset is supplied
w = da.from_array(np.random.randn(2, 3), chunks=(1, 2))
ds = Dataset({'w': (('x', 'y'), w)})
ds['x'] = ('x', np.array([0, 1], np.int64))
ds['y'] = ('y', list('abc'))

# dask dataframes do not (yet) support multiindex,
# but when it does, this would be the expected index:
exp_index = pd.MultiIndex.from_arrays(
[[0, 0, 0, 1, 1, 1], ['a', 'b', 'c', 'a', 'b', 'c']],
names=['x', 'y'])
expected = pd.DataFrame({'w': w.reshape(-1)},
index=exp_index)
# so for now, reset the index
expected = expected.reset_index(drop=False)

actual = ds.to_dask_dataframe(set_index=False)

self.assertIsInstance(actual, dd.DataFrame)
assert_frame_equal(expected, actual.compute())

def test_to_dask_dataframe_coordinates(self):
# Test if coordinate is also a dask array
x = da.from_array(np.random.randn(10), chunks=4)
t = da.from_array(np.arange(10)*2, chunks=4)

ds = Dataset(OrderedDict([('a', ('t', x)),
('t', ('t', t))]))

expected_pd = pd.DataFrame({'a': x},
index=pd.Index(t, name='t'))
expected = dd.from_pandas(expected_pd, chunksize=4)
actual = ds.to_dask_dataframe(set_index=True)
self.assertIsInstance(actual, dd.DataFrame)
assert_frame_equal(expected.compute(), actual.compute())

def test_to_dask_dataframe_not_daskarray(self):
# Test if DataArray is not a dask array
x = np.random.randn(10)
y = np.arange(10, dtype='uint8')
t = list('abcdefghij')

ds = Dataset(OrderedDict([('a', ('t', x)),
('b', ('t', y)),
('t', ('t', t))]))

expected = pd.DataFrame({'a': x, 'b': y},
index=pd.Index(t, name='t'))

actual = ds.to_dask_dataframe(set_index=True)
self.assertIsInstance(actual, dd.DataFrame)
assert_frame_equal(expected, actual.compute())

def test_to_dask_dataframe_no_coordinate(self):
# Test if Dataset has a dimension without coordinates
x = da.from_array(np.random.randn(10), chunks=4)
ds = Dataset({'x': ('dim_0', x)})
expected = pd.DataFrame({'x': x.compute()})
actual = ds.to_dask_dataframe(set_index=True)
assert_frame_equal(expected, actual.compute())


@pytest.mark.parametrize("method", ['load', 'compute'])
def test_dask_kwargs_variable(method):
Expand Down