Skip to content

map_blocks fails with lazy loaded dask array #9504

Open
@eni-awowale

Description

@eni-awowale

What is your issue?

Copied from xarray-contrib/datatree#152

Issue

Hi,

I'm very excited about this package and I'm just familiarising myself to see where I can use it for my use cases. I followed the example in the documentation to apply a groupby to the datatree. However, I did use dask because my dataset is too large to fit it into memory. I realised that my group_by function is not being applied to lazy loaded dask arrays.

Minimal example

import datatree
import xarray as xr
import pandas as pd
import dask
import numpy as np

def group_by(da, groupby_type="time.floor('1D')"):
    gb = da.groupby(groupby_type)
    mean = gb.mean()
    return mean

times = pd.date_range("2022-09-01","2022-09-03", freq="6H")

a=xr.Dataset({'x': ('time', np.random.randint(0,10,len(times)))}, coords={'time':times})
b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))}, coords={'time':times})
dt=datatree.DataTree.from_dict({'first':a, 'second':b})

dt.map_blocks(group_by, kwargs={"groupby_type": "time.day"}, template=dt)

Please compare the results for the eager (a) and lazy (b) loaded datasets below:

DataTree('None', parent=None)
├── DataTree('first')
│       Dimensions:  (day: 3)
│       Coordinates:
│         * day      (day) int64 1 2 3
│       Data variables:
│           x        (day) float64 5.75 7.75 6.0
└── DataTree('second')
        Dimensions:  (time: 9)
        Coordinates:
          * time     (time) datetime64[ns] 2022-09-01 2022-09-01T06:00:00 ... 2022-09-03
        Data variables:
            x        (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>

Any ideas what is going wrong?

This can likely be generalised for any map_blocks function:

def func(da):
    return da.mean('time')

b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))})
dt=datatree.DataTree.from_dict({'second':b})

dt.map_blocks(func, template=dt)
DataTree('None', parent=None)
└── DataTree('second')
        Dimensions:  (time: 9)
        Dimensions without coordinates: time
        Data variables:
            x        (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>

Versions

xarray: 2022.6.0
datatree: 0.0.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic-DataTreeRelated to the implementation of a DataTree classtopic-dask

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions