Skip to content

DataArray.rolling() does not preserve chunksizes in some cases #2531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cchwala opened this issue Oct 31, 2018 · 2 comments · Fixed by #4977
Closed

DataArray.rolling() does not preserve chunksizes in some cases #2531

cchwala opened this issue Oct 31, 2018 · 2 comments · Fixed by #4977

Comments

@cchwala
Copy link
Contributor

cchwala commented Oct 31, 2018

This issue was found and discussed in the related issue #2514

I open a separate issue for clarity.

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
import xarray as xr

t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H')
bar = np.sin(np.arange(len(t)))
baz = np.cos(np.arange(len(t)))

da_test = xr.DataArray(data=np.stack([bar, baz]),
                       coords={'time': t,
                               'sensor': ['one', 'two']},
                       dims=('sensor', 'time'))

print(da_test.chunk({'time': 100}).rolling(time=60).mean().chunks)

print(da_test.chunk({'time': 100}).rolling(time=60).count().chunks)
Output for `mean`: ((2,), (745,))
Output for `count`: ((2,), (100, 100, 100, 100, 100, 100, 100, 45))
Desired Output: ((2,), (100, 100, 100, 100, 100, 100, 100, 45))

Problem description

DataArray.rolling() does not preserve the chunksizes, apparently depending on the applied method.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None

xarray: 0.10.9
pandas: 0.23.3
numpy: 1.13.3
scipy: 1.0.0
netCDF4: 1.4.1
h5netcdf: 0.5.0
h5py: 2.8.0
Nio: None
zarr: None
cftime: 1.0.1
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: 1.2.1
cyordereddict: 1.0.0
dask: 0.19.4
distributed: 1.23.3
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: 0.8.1
setuptools: 38.5.2
pip: 9.0.1
conda: 4.5.11
pytest: 3.4.2
IPython: 5.5.0
sphinx: None

@cchwala
Copy link
Contributor Author

cchwala commented Oct 31, 2018

The cause has been explained by @fujiisoup here #2514 (comment)

Nice catch!

For some historical reasons, mean and some reduction method uses bottleneck as default, while count does not.

mean goes through this function

xarray/xarray/core/dask_array_ops.py

Line 23 in b622c5e
def dask_rolling_wrapper(moving_func, a, window, min_count=None, axis=-1):

It looks there is another but for this function.

@mangecoeur
Copy link
Contributor

mangecoeur commented Feb 4, 2019

Perhaps related - I was running into MemoryErrors with a large array and also noticed that chunksizes were not respected (basically xarray tried to process the array in one go) - but it turned out that i'd forgotten to install both bottleneck and numexpr and after installing both (just installing bottleneck was not enough), everything worked as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants