sharing dimensions across dataarrays in a dataset #1471

cchrysostomou · 2017-07-07T14:58:18Z

I have two questions regarding proper implementation of an xarray dataset when defining dimensions. First, I am wondering whether I can share the same dimension across multiple arrays in a dataset without storing NaN values for coordinates not present in each respective array.

As a simple example, I am interested in creating two data arrays that involve the shared dimensions x and y; however, in the first data array, I only care about x-coordinates from (0->5) whereas in the second data array I only care about x-coordinates from (10-> 12)

vals1 = np.random.normal(size=(6,2))
vals2 = np.random.normal(size=(3,3))
x1 = xr.Dataset(
    {'table1': (['x', 'y'], vals1)},
    coords={
        'x': np.arange(6),
        'y': np.arange(2)
    }
)
x2 = xr.Dataset(
    {'table2': (['x', 'y'], vals2)},
    coords={
        'x': np.arange(10, 10+3),
        'y': np.arange(8, 8+3)
    }
)

If I naively merge the two datasets, then the dimensions and coordinates get merged correctly but not each of the data variables within the dataset are much larger than they need to be (store unnecessary nan values)

merged = xr.merge([x1,x2])
merged['table1']

<xarray.DataArray 'table1' (x: 9, y: 5)>
array([[ 0.553098, -1.157813,       nan,       nan,       nan],
       [-0.259999, -0.476526,       nan,       nan,       nan],
       [ 1.650893, -0.364517,       nan,       nan,       nan],
       [ 0.16149 , -0.037587,       nan,       nan,       nan],
       [ 0.799689, -0.128728,       nan,       nan,       nan],
       [-0.613603, -1.410235,       nan,       nan,       nan],
       [      nan,       nan,       nan,       nan,       nan],
       [      nan,       nan,       nan,       nan,       nan],
       [      nan,       nan,       nan,       nan,       nan]])
Coordinates:
  * y        (y) int64 0 1 8 9 10
  * x        (x) int64 0 1 2 3 4 5 10 11 12

In my second question, I want to add an extra layer of complexity to this and add a third variable that uses multi-indexing. Again naively, I would have wanted the multi-index in the third table to share dimensions (x and y) from the previous data variables

# I would have preferred to do this
index=pd.MultiIndex.from_tuples([(0, 0, 1), (1, 1, 1), (2, 2, 1)], names=('x', 'y', 'z'))
vals3 = np.random.normal(size=(3,3))
x3 = xr.Dataset(
        {'table3-multiindex': (['multi-index', 'cols'], vals3)},        
        coords={'multi-index': index}
)

# Except, merging with previous dataset raises an error due to name conflicts
xr.merge([x1, x2, x3])

ValueError: conflicting MultiIndex level name(s):
'y' (multi-index), (y)
'x' (multi-index), (x)

Currently my solution is just to rename each of the dimensions in each respective data array so that they do not overlap. While this is not ideal, I can probably get away with this, but since I would prefer the ability to share dimensions without adding in NaN values, is there another way to achieve this? (Im also assuming that I can still do joins later on using values within different dimension names.)

# current solution, merge data arrays but have each dimension be unique
vals1 = np.random.normal(size=(6,2))
vals2 = np.random.normal(size=(3,3))
x1 = xr.Dataset(
    {'table1': (['x1', 'y1'], vals1)},
    coords={
        'x1': np.arange(6),
        'y1': np.arange(2)
    }
)
x2 = xr.Dataset(
    {'table2': (['x2', 'y2'], vals2)},
    coords={
        'x2': np.arange(10, 10+3),
        'y2': np.arange(8, 8+3)
    }
)

index=pd.MultiIndex.from_tuples([(0, 0, 1), (1, 1, 1), (2, 2, 1)], names=('x3', 'y3', 'z3'))
vals3 = np.random.normal(size=(3,3))
x3 = xr.Dataset(
        {'table3-multiindex': (['multi-index', 'cols'], vals3)},        
        coords={'multi-index': index}
)

xr.merge([x1, x2, x3])

The text was updated successfully, but these errors were encountered:

shoyer · 2017-07-07T15:48:05Z

I'm afraid this isn't possible, by design. Every variable in a Dataset sharing the same coordinate system is enforced as part of the xarray data model. This makes data analysis and comparison with a Dataset quite straightforward, since everything is already on the same grid.

For cases where you need different coordinate values and/or dimension sizes, your options are to either rename dimensions for different variables or use multiple Dataset/DataArray objects (Python has nice built-in data structures).

In theory, we could add something like an "UnalignedDataset" that supports most of the Dataset methods without requiring alignment but I'm not sure it's worth the trouble.

smartass101 · 2018-10-16T17:24:42Z

I've hit this design limitation quite often as well, with several use-cases, both in experiment and simulation. It detracts from xarray's power of conveniently and transparently handling coordinate meta-data. From the Why xarray? page:

with xarray, you don’t need to keep track of the order of arrays dimensions or insert dummy dimensions

Adding effectively dummy dimensions or coordinates is essentially what this alignment design is forcing us to do.

A possible solution would be something like having (some) coordinate arrays in an (Unaligned)Dataset being a "reducible" (it would reduce to Index for each Datarray) MultiIndex. A workaround can be using MultiIndex coordinates directly, but then alignment cannot be done easily as levels do not behave as real dimensions.

Use-cases examples:

1. coordinate "metadata"

I often have measurements on related axes, but also with additional coordinates (different positions, etc.) Consider:

import numpy as np
import xarray as xr
n1 = np.arange(1, 22)
m1 = xr.DataArray(n1*0.5, coords={'num': n1, 'B': 'r', 'ar' :'A'}, dims=['num'], name='MA')
n2 = np.arange(2, 21)
m2 = xr.DataArray(n2*0.5, coords={'num': n2, 'B': 'r', 'ar' :'B'}, dims=['num'], name='MB')
ds = xr.merge([m1, m2])
print(ds)

What I would like to get (pseudocode):

<xarray.Dataset>
Dimensions:  (num: 21, ar:2)   # <-- note that MB is still of dims {'num': 19} only
Coordinates:              # <-- mostly unions as done by concat
  * num      (num) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
    B        <U1 'r'
  * ar       <U1 'A' 'B'    # <-- this is now a dim of the dataset, but not of MA or MB
Data variables:
    MA       (num) float64 0.5 1.0 1.5 2.0 2.5 3.0 ... 8.0 8.5 9.0 9.5 10.0 10.5
    MB       (num) float64 1.0 1.5 2.0 2.5 3.0 3.5 ... 7.5 8.0 8.5 9.0 9.5 10.0

Instead I get

MergeError: conflicting values for variable 'ar' on objects to be combined:
first value: <xarray.Variable ()>
array('A', dtype='<U1')
second value: <xarray.Variable ()>
array('B', dtype='<U1')

While it is possible to concat into something with dimensions (num, ar, B), it often results in huge arrays where most values are nan.
I could also store the "position" metadata as attrs, but that pretty much defeats the point of using xarray to have coordinates transparently part of the coordinate metadata. Also, sometimes I would like to select arrays from the dataset from a given location, e.g. Dataset.sel(ar='B').

2. unaligned time domains

This s a large problem especially when different time-bases are involved. A difference in sampling intervals will blow up the storage by a huge number of nan values. Which of course greatly complicates further calculations, e.g. filtering in the time domain. Or just non-overlaping time intervals will require at least double the storage area.

I often find myself resorting rather to pandas.MultiIndex which gladly manages such non-aligned coordinates while still enabling slicing and selection on various levels. So it can be done and the pandas code and functionality already exists.

shoyer · 2018-10-16T19:00:16Z

You can use a pandas.MultiIndex with xarray. The interface/abstraction could be improved and has some rough edges (e.g., see especially #1603), but I think this is the preferred way to support these use cases. It does already work for indexing.

smartass101 · 2018-10-18T09:48:20Z

I indeed often resort to using a pandas.MultiIndex, but especially the dropping of the selected coordinate value (#1408) makes it quite inconvenient.

shoyer · 2018-10-18T15:21:24Z

I'm marking #1408 as a bug so we won't forget about it. Hopefully it should be fixed automatically as part of the "explicit indexes" refactor.

…

On Thu, Oct 18, 2018 at 2:48 AM Ondrej Grover ***@***.***> wrote: I indeed often resort to using a pandas.MultiIndex, but especially the dropping of the selected coordinate value (#1408 <#1408>) makes it quite inconvenient. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1471 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1pnDztKWoTaWEjzPpP6orveOMNWRks5umE6BgaJpZM4ORDdd> .

tommylees112 · 2018-10-29T15:21:34Z

@smartass101 & @shoyer what would be the code for working with a pandas.MultiIndex object in this use case? Could you show how it would work related to your example above:

<xarray.Dataset>
Dimensions:  (num: 21, ar:2)   # <-- note that MB is still of dims {'num': 19} only
Coordinates:              # <-- mostly unions as done by concat
  * num      (num) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
    B        <U1 'r'
  * ar       <U1 'A' 'B'    # <-- this is now a dim of the dataset, but not of MA or MB
Data variables:
    MA       (num) float64 0.5 1.0 1.5 2.0 2.5 3.0 ... 8.0 8.5 9.0 9.5 10.0 10.5
    MB       (num) float64 1.0 1.5 2.0 2.5 3.0 3.5 ... 7.5 8.0 8.5 9.0 9.5 10.0

I am working with land surface model outputs. I have lots of one-dimensional data for different lat/lon points, at different times. I want to join them all into one dataset to make plotting easier. E.g. plot the evapotranspiration estimates for all the stations at their x,y coordinates.

Thanks very much!

zbarry · 2019-08-26T15:00:35Z

I just wanted to chime in as to the usefulness of being able to do something like this without the extra mental overhead being required by the workaround proposed. My use case parallels @smartass101's very closely. Have there been any updates to xarray since last year that might make streamlining this use case a bit more feasible, by any chance? :)

There is no good way of having single-node TPMs in a single Dataset without blowing up memory usage (due to alignment of non-shared singleton dimensions): pydata/xarray#1471 Instead we will implement our own indexing operation to distribute it across the sequence of DataArray nodes. As a bonus, we can implement positional indexing, which xarray doesn't support at the Dataset level.

marcel-goldschen-ohm · 2023-08-15T20:15:42Z

I also would love this feature. Consider a dataset with many 1-D time series recordings across repeated sweeps and stimulus series all stored in one N-D array (e.g., dimensions ['series', 'sweep', 'time']). For one or a small number of these time series there is an artifactual wobble in the baseline that needs removing. It would be great to be able to store these detrended time series in the same dataset as a new array that shares the same time coords, but has a subset of the series and sweep coords. This would require that each DataArray in a Dataset can have its own coords which would supersede the Dataset's coords when defined.

I imagine there are likely many use cases for working with a subset of a dataset along some, but not all dimensions. In all those cases, it would be convenient to house that data within the same dataset rather than having a bunch of datasets with repeated coord arrays for shared dimensions.

Given how stale this discussion is, I imagine this would probably be too much of a headache. But it would be very nice. I am in favor of the idea of a new UnalignedDataset as suggested by @shoyer.

jhamman added the usage question label Jul 17, 2017

pydata locked and limited conversation to collaborators Sep 12, 2023

jhamman converted this issue into discussion #8177 Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

This issue was moved to a discussion.

sharing dimensions across dataarrays in a dataset #1471

sharing dimensions across dataarrays in a dataset #1471

cchrysostomou commented Jul 7, 2017

shoyer commented Jul 7, 2017

Uh oh!

smartass101 commented Oct 16, 2018 •

edited

Loading

Uh oh!

shoyer commented Oct 16, 2018 •

edited

Loading

Uh oh!

smartass101 commented Oct 18, 2018

Uh oh!

shoyer commented Oct 18, 2018 via email

Uh oh!

tommylees112 commented Oct 29, 2018

Uh oh!

zbarry commented Aug 26, 2019

Uh oh!

marcel-goldschen-ohm commented Aug 15, 2023

Uh oh!

This issue was moved to a discussion.

Uh oh!

This issue was moved to a discussion.

sharing dimensions across dataarrays in a dataset #1471

sharing dimensions across dataarrays in a dataset #1471

Comments

cchrysostomou commented Jul 7, 2017

shoyer commented Jul 7, 2017

Uh oh!

smartass101 commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Use-cases examples:

1. coordinate "metadata"

2. unaligned time domains

Uh oh!

shoyer commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smartass101 commented Oct 18, 2018

Uh oh!

shoyer commented Oct 18, 2018 via email

Uh oh!

tommylees112 commented Oct 29, 2018

Uh oh!

zbarry commented Aug 26, 2019

Uh oh!

marcel-goldschen-ohm commented Aug 15, 2023

Uh oh!

This issue was moved to a discussion.

smartass101 commented Oct 16, 2018 •

edited

Loading

shoyer commented Oct 16, 2018 •

edited

Loading