xarray, chunking and rolling operation adds chunking along new dimension (previously worked) #3277

p-d-moore · 2019-09-03T11:25:23Z

I was testing the latest version of xarray (0.12.3) from the conda-forge channel and this broke some code I had. Under the defaults installation not using conda-forge (xarray=0.12.1), the following code works correctly with desired output:

Test code

import pandas as pd
import xarray as xr
import numpy as np

s_date = '1990-01-01'
e_date = '2019-05-01'
days = pd.date_range(start=s_date, end=e_date, freq='B', name='day')
items = pd.Index([str(i) for i in range(300)], name = 'item')
dat = xr.DataArray(np.random.rand(len(days), len(items)), coords=[days, items])
dat_chunk = dat.chunk({'item': 20})
dat_mean = dat_chunk.rolling(day=10).mean()

print(dat_chunk)
print(' ')
print(dat_mean)

dat_std_avg = dat_mean.rolling(day=250).std()

print(' ')
print(dat_std_avg)

Output (correct) with xarray=0.12.1 - note the chunksizes

<xarray.DataArray (day: 7653, item: 300)>
dask.array<shape=(7653, 300), dtype=float64, chunksize=(7653, 20)>
Coordinates:
  * day      (day) datetime64[ns] 1990-01-01 1990-01-02 ... 2019-05-01
  * item     (item) object '0' '1' '2' '3' '4' ... '295' '296' '297' '298' '299'
 
<xarray.DataArray '_trim-8c9287bf114d61cb3ad74780465cd19f' (day: 7653, item: 300)>
dask.array<shape=(7653, 300), dtype=float64, chunksize=(7653, 20)>
Coordinates:
  * day      (day) datetime64[ns] 1990-01-01 1990-01-02 ... 2019-05-01
  * item     (item) object '0' '1' '2' '3' '4' ... '295' '296' '297' '298' '299'
 
<xarray.DataArray '_trim-2ee90b6c2f29f71a7798a204a4ad3305' (day: 7653, item: 300)>
dask.array<shape=(7653, 300), dtype=float64, chunksize=(7653, 20)>
Coordinates:
  * day      (day) datetime64[ns] 1990-01-01 1990-01-02 ... 2019-05-01
  * item     (item) object '0' '1' '2' '3' '4' ... '295' '296' '297' '298' '299'

Output (now failing) with xarray=0.12.3 (note the chunksizes)

<xarray.DataArray (day: 7653, item: 300)>
dask.array<shape=(7653, 300), dtype=float64, chunksize=(7653, 20)>
Coordinates:
  * day      (day) datetime64[ns] 1990-01-01 1990-01-02 ... 2019-05-01
  * item     (item) object '0' '1' '2' '3' '4' ... '295' '296' '297' '298' '299'
 
<xarray.DataArray (day: 7653, item: 300)>
dask.array<shape=(7653, 300), dtype=float64, chunksize=(5, 20)>
Coordinates:
  * day      (day) datetime64[ns] 1990-01-01 1990-01-02 ... 2019-05-01
  * item     (item) object '0' '1' '2' '3' '4' ... '295' '296' '297' '298' '299'

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...

ValueError: For window size 250, every chunk should be larger than 125, but the smallest chunk size is 5. Rechunk your array
with a larger chunk size or a chunk size that
more evenly divides the shape of your array.

Problem Description

Using dask + rolling + xarray=0.12.3 appears to add undesirable chunking in a new dimension which was not the case previously using xarray=0.12.1 This additional chunking made the the queuing of a further rolling operation fail with a ValueError. This (at the very least) makes queuing dask based delayed operations difficult when multiple rolling operations are used.

Output of `xr.show_versions()` for the not working version

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD byteorder: little LC_ALL: None LANG: None LOCALE: None.None libhdf5: 1.10.4 libnetcdf: 4.6.1

xarray: 0.12.3
pandas: 0.25.1
numpy: 1.16.4
scipy: 1.3.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.3.0
distributed: 2.3.2
matplotlib: 3.1.1
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 41.0.1
pip: 19.2.2
conda: 4.7.11
pytest: None
IPython: 7.8.0
sphinx: None

Apologies if this issue is reported, I was unable to find a case that appeared equivalent.

The text was updated successfully, but these errors were encountered:

dcherian · 2019-09-03T17:15:22Z

Yikes. Thanks @p-d-moore .

Looks like the rolling mean changes the chunks

>>> print(dat_chunk.chunks)
((7653,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20))

>>> print(dat_mean.chunks)
((5, 7648), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20))

A workaround is to do dat_mean = dat_mean.chunk({'day': None}) and then calculate the rolling std

p-d-moore · 2019-09-03T21:14:44Z

Some additional notes:

The bug also appears for xarray=0.12.2 (so I presume was introduced between 0.12.1 and 0.12.2). Other rolling operations are similarly affected - replacing .mean() in the sample code with .count(), .sum(), .std(), .max() etc results in the same erroneous chunking behaviour.

Another workaround is to downgrade xarray to 0.12.1

p-d-moore · 2019-10-07T02:44:10Z

Confirm bug is still present in 0.13.0.

p-d-moore · 2019-10-09T00:18:34Z

Using xarray=0.13.0, the chunking behaviour has changed again (but still incorrect):

<xarray.DataArray (item: 300, day: 7653)>
dask.array<xarray-<this-array>, shape=(300, 7653), dtype=float64, chunksize=(20, 7653), chunktype=numpy.ndarray>
Coordinates:
  * item     (item) object '0' '1' '2' '3' '4' ... '295' '296' '297' '298' '299'
  * day      (day) datetime64[ns] 1990-01-01 1990-01-02 ... 2019-05-01
 
<xarray.DataArray (item: 300, day: 7653)>
dask.array<where, shape=(300, 7653), dtype=float64, chunksize=(20, 7648), chunktype=numpy.ndarray>
Coordinates:
  * item     (item) object '0' '1' '2' '3' '4' ... '295' '296' '297' '298' '299'
  * day      (day) datetime64[ns] 1990-01-01 1990-01-02 ... 2019-05-01
---------------------------------------------------------------------------
ValueError      
...

Note the chunksize is now (20, 7648) instead of (20, 7653).

I believe it might be related to #2942 -> I think disabling bottleneck for dask arrays for the rolling operation caused the bug above to appear (so the bug may have been there for a while, but doesn't appear because bottleneck was being used).

Doing a quick trace, in rolling.py , I think it's the line windows = self.construct(rolling_dim) in the reduce function which creates windows with incorrect chunking, possibly as a consequence of some transpose operations and dimension mix up?

It seems strange that other applications aren't having problems with this, unless I am doing something different in my code? Note that I am very specifically chunking in a different dimension to the rolling operation.

p-d-moore · 2019-11-07T07:25:50Z

Error still present in 0.14.0

I believe the bug occurs in dask_array_ops.py: rolling_window

My best guess at understanding the code: I believe there is an attempt to "pad" rolling windows to ensure the rolling windows doesn't miss data across chunk boundaries. I think the "padding" is supposed to be truncated later, but something is miscalculated and the final array ends up with the wrong chunking.

In the case I presented, the "chunking" happens along a different dimension to the "rolling" and padding is not necessary. Perhaps something goes haywire because the code was written to guard against rolling along a chunked dimension (and missing data across chunk boundaries)? Additionally, as the padding is not necessary in this case, there is a performance penalty that could be avoided?

A simple fix for my case is to not do any "padding" whenever the chunksize along the rolling dimension is equal to the arraysize along the rolling dimension.

e.g. in the function dask_array_ops.py: rolling_window

pad_size = max(start, end) + offset - depth[axis]

becomes

if a.shape[axis] == a.chunksize[axis]:
    pad_size = 0
else:
    pad_size = max(start, end) + offset - depth[axis]

This fixes the code for my usage case, perhaps someone could advise if I have understood the issue correctly?

dcherian added bug topic-dask labels Sep 3, 2019

MaximilianHoffmann mentioned this issue May 14, 2020

roll operation changes chunksize dask/dask#6209

Closed

dcherian added the topic-rolling label Feb 16, 2021

dcherian mentioned this issue Mar 1, 2021

Use numpy & dask sliding_window_view for rolling #4977

Merged

4 tasks

dcherian closed this as completed in #4977 Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

xarray, chunking and rolling operation adds chunking along new dimension (previously worked) #3277

xarray, chunking and rolling operation adds chunking along new dimension (previously worked) #3277

p-d-moore commented Sep 3, 2019 •

edited

Loading

dcherian commented Sep 3, 2019

Uh oh!

p-d-moore commented Sep 3, 2019 •

edited

Loading

Uh oh!

p-d-moore commented Oct 7, 2019

Uh oh!

p-d-moore commented Oct 9, 2019 •

edited

Loading

Uh oh!

p-d-moore commented Nov 7, 2019

Uh oh!

Uh oh!

xarray, chunking and rolling operation adds chunking along new dimension (previously worked) #3277

xarray, chunking and rolling operation adds chunking along new dimension (previously worked) #3277

Comments

p-d-moore commented Sep 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test code

Output (correct) with xarray=0.12.1 - note the chunksizes

Output (now failing) with xarray=0.12.3 (note the chunksizes)

Problem Description

Output of xr.show_versions() for the not working version

dcherian commented Sep 3, 2019

Uh oh!

p-d-moore commented Sep 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-d-moore commented Oct 7, 2019

Uh oh!

p-d-moore commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-d-moore commented Nov 7, 2019

Uh oh!

p-d-moore commented Sep 3, 2019 •

edited

Loading

Output of `xr.show_versions()` for the not working version

p-d-moore commented Sep 3, 2019 •

edited

Loading

p-d-moore commented Oct 9, 2019 •

edited

Loading