This repository was archived by the owner on Oct 24, 2024. It is now read-only.
This repository was archived by the owner on Oct 24, 2024. It is now read-only.
map_blocks
fails with lazy loaded dask array #152
Closed
Description
Hi,
I'm very excited about this package and I'm just familiarising myself to see where I can use it for my use cases. I followed the example in the documentation to apply a groupby
to the datatree. However, I did use dask because my dataset is too large to fit it into memory. I realised that my group_by
function is not being applied to lazy loaded dask arrays.
Minimal example
import datatree
import xarray as xr
import pandas as pd
import dask
import numpy as np
def group_by(da, groupby_type="time.floor('1D')"):
gb = da.groupby(groupby_type)
mean = gb.mean()
return mean
times = pd.date_range("2022-09-01","2022-09-03", freq="6H")
a=xr.Dataset({'x': ('time', np.random.randint(0,10,len(times)))}, coords={'time':times})
b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))}, coords={'time':times})
dt=datatree.DataTree.from_dict({'first':a, 'second':b})
dt.map_blocks(group_by, kwargs={"groupby_type": "time.day"}, template=dt)
Please compare the results for the eager (a
) and lazy (b
) loaded datasets below:
DataTree('None', parent=None)
├── DataTree('first')
│ Dimensions: (day: 3)
│ Coordinates:
│ * day (day) int64 1 2 3
│ Data variables:
│ x (day) float64 5.75 7.75 6.0
└── DataTree('second')
Dimensions: (time: 9)
Coordinates:
* time (time) datetime64[ns] 2022-09-01 2022-09-01T06:00:00 ... 2022-09-03
Data variables:
x (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>
Any ideas what is going wrong?
This can likely be generalised for any map_blocks
function:
def func(da):
return da.mean('time')
b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))})
dt=datatree.DataTree.from_dict({'second':b})
dt.map_blocks(func, template=dt)
DataTree('None', parent=None)
└── DataTree('second')
Dimensions: (time: 9)
Dimensions without coordinates: time
Data variables:
x (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>
Versions
xarray: 2022.6.0
datatree: 0.0.9
Metadata
Metadata
Assignees
Labels
No labels