Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

sharing dimensions across dataarrays in a dataset #1471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cchrysostomou opened this issue Jul 7, 2017 · 8 comments
Closed

sharing dimensions across dataarrays in a dataset #1471

cchrysostomou opened this issue Jul 7, 2017 · 8 comments

Comments

@cchrysostomou
Copy link

I have two questions regarding proper implementation of an xarray dataset when defining dimensions. First, I am wondering whether I can share the same dimension across multiple arrays in a dataset without storing NaN values for coordinates not present in each respective array.

As a simple example, I am interested in creating two data arrays that involve the shared dimensions x and y; however, in the first data array, I only care about x-coordinates from (0->5) whereas in the second data array I only care about x-coordinates from (10-> 12)

vals1 = np.random.normal(size=(6,2))
vals2 = np.random.normal(size=(3,3))
x1 = xr.Dataset(
    {'table1': (['x', 'y'], vals1)},
    coords={
        'x': np.arange(6),
        'y': np.arange(2)
    }
)
x2 = xr.Dataset(
    {'table2': (['x', 'y'], vals2)},
    coords={
        'x': np.arange(10, 10+3),
        'y': np.arange(8, 8+3)
    }
)

If I naively merge the two datasets, then the dimensions and coordinates get merged correctly but not each of the data variables within the dataset are much larger than they need to be (store unnecessary nan values)

merged = xr.merge([x1,x2])
merged['table1']

<xarray.DataArray 'table1' (x: 9, y: 5)>
array([[ 0.553098, -1.157813,       nan,       nan,       nan],
       [-0.259999, -0.476526,       nan,       nan,       nan],
       [ 1.650893, -0.364517,       nan,       nan,       nan],
       [ 0.16149 , -0.037587,       nan,       nan,       nan],
       [ 0.799689, -0.128728,       nan,       nan,       nan],
       [-0.613603, -1.410235,       nan,       nan,       nan],
       [      nan,       nan,       nan,       nan,       nan],
       [      nan,       nan,       nan,       nan,       nan],
       [      nan,       nan,       nan,       nan,       nan]])
Coordinates:
  * y        (y) int64 0 1 8 9 10
  * x        (x) int64 0 1 2 3 4 5 10 11 12

In my second question, I want to add an extra layer of complexity to this and add a third variable that uses multi-indexing. Again naively, I would have wanted the multi-index in the third table to share dimensions (x and y) from the previous data variables

# I would have preferred to do this
index=pd.MultiIndex.from_tuples([(0, 0, 1), (1, 1, 1), (2, 2, 1)], names=('x', 'y', 'z'))
vals3 = np.random.normal(size=(3,3))
x3 = xr.Dataset(
        {'table3-multiindex': (['multi-index', 'cols'], vals3)},        
        coords={'multi-index': index}
)

# Except, merging with previous dataset raises an error due to name conflicts
xr.merge([x1, x2, x3])

ValueError: conflicting MultiIndex level name(s):
'y' (multi-index), (y)
'x' (multi-index), (x)

Currently my solution is just to rename each of the dimensions in each respective data array so that they do not overlap. While this is not ideal, I can probably get away with this, but since I would prefer the ability to share dimensions without adding in NaN values, is there another way to achieve this? (Im also assuming that I can still do joins later on using values within different dimension names.)

# current solution, merge data arrays but have each dimension be unique
vals1 = np.random.normal(size=(6,2))
vals2 = np.random.normal(size=(3,3))
x1 = xr.Dataset(
    {'table1': (['x1', 'y1'], vals1)},
    coords={
        'x1': np.arange(6),
        'y1': np.arange(2)
    }
)
x2 = xr.Dataset(
    {'table2': (['x2', 'y2'], vals2)},
    coords={
        'x2': np.arange(10, 10+3),
        'y2': np.arange(8, 8+3)
    }
)

index=pd.MultiIndex.from_tuples([(0, 0, 1), (1, 1, 1), (2, 2, 1)], names=('x3', 'y3', 'z3'))
vals3 = np.random.normal(size=(3,3))
x3 = xr.Dataset(
        {'table3-multiindex': (['multi-index', 'cols'], vals3)},        
        coords={'multi-index': index}
)

xr.merge([x1, x2, x3])
@shoyer
Copy link
Member

shoyer commented Jul 7, 2017

I'm afraid this isn't possible, by design. Every variable in a Dataset sharing the same coordinate system is enforced as part of the xarray data model. This makes data analysis and comparison with a Dataset quite straightforward, since everything is already on the same grid.

For cases where you need different coordinate values and/or dimension sizes, your options are to either rename dimensions for different variables or use multiple Dataset/DataArray objects (Python has nice built-in data structures).

In theory, we could add something like an "UnalignedDataset" that supports most of the Dataset methods without requiring alignment but I'm not sure it's worth the trouble.

@smartass101
Copy link

smartass101 commented Oct 16, 2018

I've hit this design limitation quite often as well, with several use-cases, both in experiment and simulation. It detracts from xarray's power of conveniently and transparently handling coordinate meta-data. From the Why xarray? page:

with xarray, you don’t need to keep track of the order of arrays dimensions or insert dummy dimensions

Adding effectively dummy dimensions or coordinates is essentially what this alignment design is forcing us to do.

A possible solution would be something like having (some) coordinate arrays in an (Unaligned)Dataset being a "reducible" (it would reduce to Index for each Datarray) MultiIndex. A workaround can be using MultiIndex coordinates directly, but then alignment cannot be done easily as levels do not behave as real dimensions.

Use-cases examples:

1. coordinate "metadata"

I often have measurements on related axes, but also with additional coordinates (different positions, etc.) Consider:

import numpy as np
import xarray as xr
n1 = np.arange(1, 22)
m1 = xr.DataArray(n1*0.5, coords={'num': n1, 'B': 'r', 'ar' :'A'}, dims=['num'], name='MA')
n2 = np.arange(2, 21)
m2 = xr.DataArray(n2*0.5, coords={'num': n2, 'B': 'r', 'ar' :'B'}, dims=['num'], name='MB')
ds = xr.merge([m1, m2])
print(ds)

What I would like to get (pseudocode):

<xarray.Dataset>
Dimensions:  (num: 21, ar:2)   # <-- note that MB is still of dims {'num': 19} only
Coordinates:              # <-- mostly unions as done by concat
  * num      (num) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
    B        <U1 'r'
  * ar       <U1 'A' 'B'    # <-- this is now a dim of the dataset, but not of MA or MB
Data variables:
    MA       (num) float64 0.5 1.0 1.5 2.0 2.5 3.0 ... 8.0 8.5 9.0 9.5 10.0 10.5
    MB       (num) float64 1.0 1.5 2.0 2.5 3.0 3.5 ... 7.5 8.0 8.5 9.0 9.5 10.0

Instead I get

MergeError: conflicting values for variable 'ar' on objects to be combined:
first value: <xarray.Variable ()>
array('A', dtype='<U1')
second value: <xarray.Variable ()>
array('B', dtype='<U1')

While it is possible to concat into something with dimensions (num, ar, B), it often results in huge arrays where most values are nan.
I could also store the "position" metadata as attrs, but that pretty much defeats the point of using xarray to have coordinates transparently part of the coordinate metadata. Also, sometimes I would like to select arrays from the dataset from a given location, e.g. Dataset.sel(ar='B').

2. unaligned time domains

This s a large problem especially when different time-bases are involved. A difference in sampling intervals will blow up the storage by a huge number of nan values. Which of course greatly complicates further calculations, e.g. filtering in the time domain. Or just non-overlaping time intervals will require at least double the storage area.

I often find myself resorting rather to pandas.MultiIndex which gladly manages such non-aligned coordinates while still enabling slicing and selection on various levels. So it can be done and the pandas code and functionality already exists.

@shoyer
Copy link
Member

shoyer commented Oct 16, 2018

You can use a pandas.MultiIndex with xarray. The interface/abstraction could be improved and has some rough edges (e.g., see especially #1603), but I think this is the preferred way to support these use cases. It does already work for indexing.

@smartass101
Copy link

I indeed often resort to using a pandas.MultiIndex, but especially the dropping of the selected coordinate value (#1408) makes it quite inconvenient.

@shoyer
Copy link
Member

shoyer commented Oct 18, 2018 via email

@tommylees112
Copy link

@smartass101 & @shoyer what would be the code for working with a pandas.MultiIndex object in this use case? Could you show how it would work related to your example above:

<xarray.Dataset>
Dimensions:  (num: 21, ar:2)   # <-- note that MB is still of dims {'num': 19} only
Coordinates:              # <-- mostly unions as done by concat
  * num      (num) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
    B        <U1 'r'
  * ar       <U1 'A' 'B'    # <-- this is now a dim of the dataset, but not of MA or MB
Data variables:
    MA       (num) float64 0.5 1.0 1.5 2.0 2.5 3.0 ... 8.0 8.5 9.0 9.5 10.0 10.5
    MB       (num) float64 1.0 1.5 2.0 2.5 3.0 3.5 ... 7.5 8.0 8.5 9.0 9.5 10.0

I am working with land surface model outputs. I have lots of one-dimensional data for different lat/lon points, at different times. I want to join them all into one dataset to make plotting easier. E.g. plot the evapotranspiration estimates for all the stations at their x,y coordinates.

Thanks very much!

@zbarry
Copy link

zbarry commented Aug 26, 2019

I just wanted to chime in as to the usefulness of being able to do something like this without the extra mental overhead being required by the workaround proposed. My use case parallels @smartass101's very closely. Have there been any updates to xarray since last year that might make streamlining this use case a bit more feasible, by any chance? :)

isacdaavid added a commit to wmayner/pyphi that referenced this issue Feb 20, 2023
There is no good way of having single-node TPMs in a single
Dataset without blowing up memory usage (due to alignment of
non-shared singleton dimensions):

pydata/xarray#1471

Instead we will implement our own indexing operation to
distribute it across the sequence of DataArray nodes.

As a bonus, we can implement positional indexing, which xarray
doesn't support at the Dataset level.
@marcel-goldschen-ohm
Copy link

I also would love this feature. Consider a dataset with many 1-D time series recordings across repeated sweeps and stimulus series all stored in one N-D array (e.g., dimensions ['series', 'sweep', 'time']). For one or a small number of these time series there is an artifactual wobble in the baseline that needs removing. It would be great to be able to store these detrended time series in the same dataset as a new array that shares the same time coords, but has a subset of the series and sweep coords. This would require that each DataArray in a Dataset can have its own coords which would supersede the Dataset's coords when defined.

I imagine there are likely many use cases for working with a subset of a dataset along some, but not all dimensions. In all those cases, it would be convenient to house that data within the same dataset rather than having a bunch of datasets with repeated coord arrays for shared dimensions.

Given how stale this discussion is, I imagine this would probably be too much of a headache. But it would be very nice. I am in favor of the idea of a new UnalignedDataset as suggested by @shoyer.

@pydata pydata locked and limited conversation to collaborators Sep 12, 2023
@jhamman jhamman converted this issue into discussion #8177 Sep 12, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

7 participants