Open
Description
What is your issue?
Copied from xarray-contrib/datatree#152
Issue
Hi,
I'm very excited about this package and I'm just familiarising myself to see where I can use it for my use cases. I followed the example in the documentation to apply a groupby
to the datatree. However, I did use dask because my dataset is too large to fit it into memory. I realised that my group_by
function is not being applied to lazy loaded dask arrays.
Minimal example
import datatree
import xarray as xr
import pandas as pd
import dask
import numpy as np
def group_by(da, groupby_type="time.floor('1D')"):
gb = da.groupby(groupby_type)
mean = gb.mean()
return mean
times = pd.date_range("2022-09-01","2022-09-03", freq="6H")
a=xr.Dataset({'x': ('time', np.random.randint(0,10,len(times)))}, coords={'time':times})
b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))}, coords={'time':times})
dt=datatree.DataTree.from_dict({'first':a, 'second':b})
dt.map_blocks(group_by, kwargs={"groupby_type": "time.day"}, template=dt)
Please compare the results for the eager (a
) and lazy (b
) loaded datasets below:
DataTree('None', parent=None)
├── DataTree('first')
│ Dimensions: (day: 3)
│ Coordinates:
│ * day (day) int64 1 2 3
│ Data variables:
│ x (day) float64 5.75 7.75 6.0
└── DataTree('second')
Dimensions: (time: 9)
Coordinates:
* time (time) datetime64[ns] 2022-09-01 2022-09-01T06:00:00 ... 2022-09-03
Data variables:
x (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>
Any ideas what is going wrong?
This can likely be generalised for any map_blocks
function:
def func(da):
return da.mean('time')
b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))})
dt=datatree.DataTree.from_dict({'second':b})
dt.map_blocks(func, template=dt)
DataTree('None', parent=None)
└── DataTree('second')
Dimensions: (time: 9)
Dimensions without coordinates: time
Data variables:
x (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>
Versions
xarray: 2022.6.0
datatree: 0.0.9