Skip to content
This repository was archived by the owner on Oct 24, 2024. It is now read-only.
This repository was archived by the owner on Oct 24, 2024. It is now read-only.

map_blocks fails with lazy loaded dask array #152

Closed
@observingClouds

Description

@observingClouds

Hi,

I'm very excited about this package and I'm just familiarising myself to see where I can use it for my use cases. I followed the example in the documentation to apply a groupby to the datatree. However, I did use dask because my dataset is too large to fit it into memory. I realised that my group_by function is not being applied to lazy loaded dask arrays.

Minimal example

import datatree
import xarray as xr
import pandas as pd
import dask
import numpy as np

def group_by(da, groupby_type="time.floor('1D')"):
    gb = da.groupby(groupby_type)
    mean = gb.mean()
    return mean

times = pd.date_range("2022-09-01","2022-09-03", freq="6H")

a=xr.Dataset({'x': ('time', np.random.randint(0,10,len(times)))}, coords={'time':times})
b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))}, coords={'time':times})
dt=datatree.DataTree.from_dict({'first':a, 'second':b})

dt.map_blocks(group_by, kwargs={"groupby_type": "time.day"}, template=dt)

Please compare the results for the eager (a) and lazy (b) loaded datasets below:

DataTree('None', parent=None)
├── DataTree('first')
│       Dimensions:  (day: 3)
│       Coordinates:
│         * day      (day) int64 1 2 3
│       Data variables:
│           x        (day) float64 5.75 7.75 6.0
└── DataTree('second')
        Dimensions:  (time: 9)
        Coordinates:
          * time     (time) datetime64[ns] 2022-09-01 2022-09-01T06:00:00 ... 2022-09-03
        Data variables:
            x        (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>

Any ideas what is going wrong?

This can likely be generalised for any map_blocks function:

def func(da):
    return da.mean('time')

b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))})
dt=datatree.DataTree.from_dict({'second':b})

dt.map_blocks(func, template=dt)
DataTree('None', parent=None)
└── DataTree('second')
        Dimensions:  (time: 9)
        Dimensions without coordinates: time
        Data variables:
            x        (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>

Versions

xarray: 2022.6.0
datatree: 0.0.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions