interpolate_na with limit argument changes size of chunks #2514

cchwala · 2018-10-26T08:31:35Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import xarray as xr
import numpy as np

t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H')
foo = np.sin(np.arange(len(t)))
bar = np.cos(np.arange(len(t)))

foo[1] = np.NaN
bar[2] = np.NaN

ds_test = xr.Dataset(data_vars={'foo': ('time', foo),
                           'bar': ('time', bar)},
                    coords={'time': t}).chunk()

print(ds_test)
print("\n\n### After `.interpolate_na(dim='time')`\n")
print(ds_test.interpolate_na(dim='time'))
print("\n\n### After `.interpolate_na(dim='time', limit=5)`\n")
print(ds_test.interpolate_na(dim='time', limit=5))
print("\n\n### After `.interpolate_na(dim='time', limit=20)`\n")
print(ds_test.interpolate_na(dim='time', limit=20))

Output of the above code. Note the different chunk sizes, depending on the value of limit:

<xarray.Dataset>
Dimensions:  (time: 745)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01
Data variables:
    foo      (time) float64 dask.array<shape=(745,), chunksize=(745,)>
    bar      (time) float64 dask.array<shape=(745,), chunksize=(745,)>


### After `.interpolate_na(dim='time')`

<xarray.Dataset>
Dimensions:  (time: 745)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01
Data variables:
    foo      (time) float64 dask.array<shape=(745,), chunksize=(745,)>
    bar      (time) float64 dask.array<shape=(745,), chunksize=(745,)>


### After `.interpolate_na(dim='time', limit=5)`

<xarray.Dataset>
Dimensions:  (time: 745)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01
Data variables:
    foo      (time) float64 dask.array<shape=(745,), chunksize=(3,)>
    bar      (time) float64 dask.array<shape=(745,), chunksize=(3,)>


### After `.interpolate_na(dim='time', limit=20)`

<xarray.Dataset>
Dimensions:  (time: 745)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01
Data variables:
    foo      (time) float64 dask.array<shape=(745,), chunksize=(10,)>
    bar      (time) float64 dask.array<shape=(745,), chunksize=(10,)>

Problem description

When using xarray.DataArray.interpolate_na() with the limit kwarg this changes the chunksize of the resulting dask.arrays.

Expected Output

The chunksize should not change. Very small chunks which results from typical small values of limit are not optimal for the performance of dask. Also, things like .rolling() will fail if the chunksize is smaller than the window length of the rolling window.

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None

xarray: 0.10.9
pandas: 0.23.3
numpy: 1.13.3
scipy: 1.0.0
netCDF4: 1.4.1
h5netcdf: 0.5.0
h5py: 2.8.0
Nio: None
zarr: None
cftime: 1.0.1
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: 1.2.1
cyordereddict: 1.0.0
dask: 0.19.4
distributed: 1.23.3
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: 0.8.1
setuptools: 38.5.2
pip: 9.0.1
conda: 4.5.11
pytest: 3.4.2
IPython: 5.5.0
sphinx: None

The text was updated successfully, but these errors were encountered:

cchwala · 2018-10-26T09:27:19Z

The problem seems to occur here

xarray/xarray/core/missing.py

Lines 368 to 376 in 5940100

    
           def _get_valid_fill_mask(arr, dim, limit): 
        
               '''helper function to determine values that can be filled when limit is not 
        
               None''' 
        
               kw = {dim: limit + 1} 
        
               # we explicitly use construct method to avoid copy. 
        
               new_dim = rolling._get_new_dimname(arr.dims, '_window') 
        
               return (arr.isnull().rolling(min_periods=1, **kw) 
        
                       .construct(new_dim, fill_value=False) 
        
                       .sum(new_dim, skipna=False)) <= limit

because of the usage of .construct(). A quick try without it, shows that the chunksize is preserved then.

Hence, .construct() might need a fix for correctly dealing with the chunks of dask.arrays.

fujiisoup · 2018-10-26T09:49:04Z

Thanks, @cchwala, for reporting the issue.

It looks that the actual chunks size is ((10, 735), ) not all 10.

In [16]: ds_test.interpolate_na(dim='time', limit=20)['foo'].chunks
Out[16]: ((10, 735),)

(why does our __repr__ only show the first chunk size?)
But it should be ((745, ), ) as you suggested.

The problem would be in

xarray/xarray/core/dask_array_ops.py

Lines 74 to 85 in 5940100

    
           if pad_size > 0: 
        
               if pad_size < depth[axis]: 
        
                   # overlapping requires each chunk larger than depth. If pad_size is 
        
                   # smaller than the depth, we enlarge this and truncate it later. 
        
                   drop_size = depth[axis] - pad_size 
        
                   pad_size = depth[axis] 
        
               shape = list(a.shape) 
        
               shape[axis] = pad_size 
        
               chunks = list(a.chunks) 
        
               chunks[axis] = (pad_size, ) 
        
               fill_array = da.full(shape, fill_value, dtype=a.dtype, chunks=chunks) 
        
               a = da.concatenate([fill_array, a], axis=axis)

This method is desinged to be used for multiplly chunked array, so I didn't care to add a small chunk on the head.
Do you mind to look the inside?

cchwala · 2018-10-26T10:53:32Z

Thanks @fujiisoup for the quick response and the pointers. I will have a look and report back if a PR is within my capabilities or not.

cchwala · 2018-10-26T15:49:20Z

EDIT: The issue of this post is now separated #2531

I think I have a fix, but wanted to write some failing tests before committing the changes. Doing this I discovered that also DataArray.rolling() does not preserve the chunksizes, apparently depending on the applied method.

import pandas as pd
import numpy as np
import xarray as xr

t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H')
bar = np.sin(np.arange(len(t)))
baz = np.cos(np.arange(len(t)))

da_test = xr.DataArray(data=np.stack([bar, baz]),
                       coords={'time': t,
                               'sensor': ['one', 'two']},
                       dims=('sensor', 'time'))

print(da_test.chunk({'time': 100}).rolling(time=60).mean().chunks)

print(da_test.chunk({'time': 100}).rolling(time=60).count().chunks)

Output for `mean`: ((2,), (745,))
Output for `count`: ((2,), (100, 100, 100, 100, 100, 100, 100, 45))
Desired Output: ((2,), (100, 100, 100, 100, 100, 100, 100, 45))

My fix solves my initial problem, but maybe if done correctly it should also solve this bug, too.

Any idea why this depends on whether .mean() or .count() is used?

I have already pushed some WIP changes. Should I already open a PR if though most new test still fail?

dcherian · 2018-10-26T19:14:44Z

@cchwala Discussion is a lot easier on a PR so go ahead and do that. You can add WIP in the title.

fujiisoup · 2018-10-26T20:08:33Z

Nice catch!

For some historical reasons, mean and some reduction method uses bottleneck as default, while count does not.

mean goes through this function

xarray/xarray/core/dask_array_ops.py

Line 23 in b622c5e

def dask_rolling_wrapper(moving_func, a, window, min_count=None, axis=-1):

It looks there is another but for this function.

cchwala · 2018-10-29T17:01:12Z

@dcherian Okay. A WIP PR will follow, but might take some days.

stale · 2020-10-02T06:38:05Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

cchwala changed the title ~~interpolat_na with limit argument changes size of chunks~~ interpolate_na with limit argument changes size of chunks Oct 26, 2018

This was referenced Oct 31, 2018

DataArray.rolling() does not preserve chunksizes in some cases #2531

Closed

[WIP] Fix problem with wrong chunksizes when using rolling_window on dask.array #2532

Closed

stale bot added the stale label Oct 2, 2020

dcherian removed the stale label Oct 2, 2020

dcherian mentioned this issue Mar 1, 2021

Use numpy & dask sliding_window_view for rolling #4977

Merged

4 tasks

dcherian closed this as completed in #4977 Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

interpolate_na with limit argument changes size of chunks #2514

interpolate_na with limit argument changes size of chunks #2514

cchwala commented Oct 26, 2018

cchwala commented Oct 26, 2018

Uh oh!

fujiisoup commented Oct 26, 2018

Uh oh!

cchwala commented Oct 26, 2018

Uh oh!

cchwala commented Oct 26, 2018 •

edited

Loading

Uh oh!

dcherian commented Oct 26, 2018

Uh oh!

fujiisoup commented Oct 26, 2018 •

edited

Loading

Uh oh!

cchwala commented Oct 29, 2018

Uh oh!

stale bot commented Oct 2, 2020

Uh oh!

Uh oh!

interpolate_na with limit argument changes size of chunks #2514

interpolate_na with limit argument changes size of chunks #2514

Comments

cchwala commented Oct 26, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of xr.show_versions()

cchwala commented Oct 26, 2018

Uh oh!

fujiisoup commented Oct 26, 2018

Uh oh!

cchwala commented Oct 26, 2018

Uh oh!

cchwala commented Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented Oct 26, 2018

Uh oh!

fujiisoup commented Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cchwala commented Oct 29, 2018

Uh oh!

stale bot commented Oct 2, 2020

Uh oh!

Output of `xr.show_versions()`

cchwala commented Oct 26, 2018 •

edited

Loading

fujiisoup commented Oct 26, 2018 •

edited

Loading