Skip to content

interpolate_na with limit argument changes size of chunks #2514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cchwala opened this issue Oct 26, 2018 · 8 comments · Fixed by #4977
Closed

interpolate_na with limit argument changes size of chunks #2514

cchwala opened this issue Oct 26, 2018 · 8 comments · Fixed by #4977

Comments

@cchwala
Copy link
Contributor

cchwala commented Oct 26, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
import xarray as xr
import numpy as np

t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H')
foo = np.sin(np.arange(len(t)))
bar = np.cos(np.arange(len(t)))

foo[1] = np.NaN
bar[2] = np.NaN

ds_test = xr.Dataset(data_vars={'foo': ('time', foo),
                           'bar': ('time', bar)},
                    coords={'time': t}).chunk()

print(ds_test)
print("\n\n### After `.interpolate_na(dim='time')`\n")
print(ds_test.interpolate_na(dim='time'))
print("\n\n### After `.interpolate_na(dim='time', limit=5)`\n")
print(ds_test.interpolate_na(dim='time', limit=5))
print("\n\n### After `.interpolate_na(dim='time', limit=20)`\n")
print(ds_test.interpolate_na(dim='time', limit=20))

Output of the above code. Note the different chunk sizes, depending on the value of limit:

<xarray.Dataset>
Dimensions:  (time: 745)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01
Data variables:
    foo      (time) float64 dask.array<shape=(745,), chunksize=(745,)>
    bar      (time) float64 dask.array<shape=(745,), chunksize=(745,)>


### After `.interpolate_na(dim='time')`

<xarray.Dataset>
Dimensions:  (time: 745)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01
Data variables:
    foo      (time) float64 dask.array<shape=(745,), chunksize=(745,)>
    bar      (time) float64 dask.array<shape=(745,), chunksize=(745,)>


### After `.interpolate_na(dim='time', limit=5)`

<xarray.Dataset>
Dimensions:  (time: 745)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01
Data variables:
    foo      (time) float64 dask.array<shape=(745,), chunksize=(3,)>
    bar      (time) float64 dask.array<shape=(745,), chunksize=(3,)>


### After `.interpolate_na(dim='time', limit=20)`

<xarray.Dataset>
Dimensions:  (time: 745)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01
Data variables:
    foo      (time) float64 dask.array<shape=(745,), chunksize=(10,)>
    bar      (time) float64 dask.array<shape=(745,), chunksize=(10,)>

Problem description

When using xarray.DataArray.interpolate_na() with the limit kwarg this changes the chunksize of the resulting dask.arrays.

Expected Output

The chunksize should not change. Very small chunks which results from typical small values of limit are not optimal for the performance of dask. Also, things like .rolling() will fail if the chunksize is smaller than the window length of the rolling window.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None

xarray: 0.10.9
pandas: 0.23.3
numpy: 1.13.3
scipy: 1.0.0
netCDF4: 1.4.1
h5netcdf: 0.5.0
h5py: 2.8.0
Nio: None
zarr: None
cftime: 1.0.1
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: 1.2.1
cyordereddict: 1.0.0
dask: 0.19.4
distributed: 1.23.3
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: 0.8.1
setuptools: 38.5.2
pip: 9.0.1
conda: 4.5.11
pytest: 3.4.2
IPython: 5.5.0
sphinx: None

@cchwala cchwala changed the title interpolat_na with limit argument changes size of chunks interpolate_na with limit argument changes size of chunks Oct 26, 2018
@cchwala
Copy link
Contributor Author

cchwala commented Oct 26, 2018

The problem seems to occur here

def _get_valid_fill_mask(arr, dim, limit):
'''helper function to determine values that can be filled when limit is not
None'''
kw = {dim: limit + 1}
# we explicitly use construct method to avoid copy.
new_dim = rolling._get_new_dimname(arr.dims, '_window')
return (arr.isnull().rolling(min_periods=1, **kw)
.construct(new_dim, fill_value=False)
.sum(new_dim, skipna=False)) <= limit

because of the usage of .construct(). A quick try without it, shows that the chunksize is preserved then.

Hence, .construct() might need a fix for correctly dealing with the chunks of dask.arrays.

@fujiisoup
Copy link
Member

Thanks, @cchwala, for reporting the issue.

It looks that the actual chunks size is ((10, 735), ) not all 10.

In [16]: ds_test.interpolate_na(dim='time', limit=20)['foo'].chunks
Out[16]: ((10, 735),)

(why does our __repr__ only show the first chunk size?)
But it should be ((745, ), ) as you suggested.

The problem would be in

if pad_size > 0:
if pad_size < depth[axis]:
# overlapping requires each chunk larger than depth. If pad_size is
# smaller than the depth, we enlarge this and truncate it later.
drop_size = depth[axis] - pad_size
pad_size = depth[axis]
shape = list(a.shape)
shape[axis] = pad_size
chunks = list(a.chunks)
chunks[axis] = (pad_size, )
fill_array = da.full(shape, fill_value, dtype=a.dtype, chunks=chunks)
a = da.concatenate([fill_array, a], axis=axis)

This method is desinged to be used for multiplly chunked array, so I didn't care to add a small chunk on the head.
Do you mind to look the inside?

@cchwala
Copy link
Contributor Author

cchwala commented Oct 26, 2018

Thanks @fujiisoup for the quick response and the pointers. I will have a look and report back if a PR is within my capabilities or not.

@cchwala
Copy link
Contributor Author

cchwala commented Oct 26, 2018

EDIT: The issue of this post is now separated #2531

I think I have a fix, but wanted to write some failing tests before committing the changes. Doing this I discovered that also DataArray.rolling() does not preserve the chunksizes, apparently depending on the applied method.

import pandas as pd
import numpy as np
import xarray as xr

t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H')
bar = np.sin(np.arange(len(t)))
baz = np.cos(np.arange(len(t)))

da_test = xr.DataArray(data=np.stack([bar, baz]),
                       coords={'time': t,
                               'sensor': ['one', 'two']},
                       dims=('sensor', 'time'))

print(da_test.chunk({'time': 100}).rolling(time=60).mean().chunks)

print(da_test.chunk({'time': 100}).rolling(time=60).count().chunks)
Output for `mean`: ((2,), (745,))
Output for `count`: ((2,), (100, 100, 100, 100, 100, 100, 100, 45))
Desired Output: ((2,), (100, 100, 100, 100, 100, 100, 100, 45))

My fix solves my initial problem, but maybe if done correctly it should also solve this bug, too.

Any idea why this depends on whether .mean() or .count() is used?

I have already pushed some WIP changes. Should I already open a PR if though most new test still fail?

@dcherian
Copy link
Contributor

@cchwala Discussion is a lot easier on a PR so go ahead and do that. You can add WIP in the title.

@fujiisoup
Copy link
Member

fujiisoup commented Oct 26, 2018

Nice catch!

For some historical reasons, mean and some reduction method uses bottleneck as default, while count does not.

mean goes through this function

def dask_rolling_wrapper(moving_func, a, window, min_count=None, axis=-1):

It looks there is another but for this function.

@cchwala
Copy link
Contributor Author

cchwala commented Oct 29, 2018

@dcherian Okay. A WIP PR will follow, but might take some days.

@stale
Copy link

stale bot commented Oct 2, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants