-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
sharing dimensions across dataarrays in a dataset #1471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm afraid this isn't possible, by design. Every variable in a Dataset sharing the same coordinate system is enforced as part of the xarray data model. This makes data analysis and comparison with a Dataset quite straightforward, since everything is already on the same grid. For cases where you need different coordinate values and/or dimension sizes, your options are to either rename dimensions for different variables or use multiple Dataset/DataArray objects (Python has nice built-in data structures). In theory, we could add something like an "UnalignedDataset" that supports most of the Dataset methods without requiring alignment but I'm not sure it's worth the trouble. |
I've hit this design limitation quite often as well, with several use-cases, both in experiment and simulation. It detracts from xarray's power of conveniently and transparently handling coordinate meta-data. From the Why xarray? page:
Adding effectively dummy dimensions or coordinates is essentially what this alignment design is forcing us to do. A possible solution would be something like having (some) coordinate arrays in an (Unaligned)Dataset being a "reducible" (it would reduce to Index for each Datarray) MultiIndex. A workaround can be using MultiIndex coordinates directly, but then alignment cannot be done easily as levels do not behave as real dimensions. Use-cases examples:1. coordinate "metadata"I often have measurements on related axes, but also with additional coordinates (different positions, etc.) Consider: import numpy as np
import xarray as xr
n1 = np.arange(1, 22)
m1 = xr.DataArray(n1*0.5, coords={'num': n1, 'B': 'r', 'ar' :'A'}, dims=['num'], name='MA')
n2 = np.arange(2, 21)
m2 = xr.DataArray(n2*0.5, coords={'num': n2, 'B': 'r', 'ar' :'B'}, dims=['num'], name='MB')
ds = xr.merge([m1, m2])
print(ds) What I would like to get (pseudocode): <xarray.Dataset>
Dimensions: (num: 21, ar:2) # <-- note that MB is still of dims {'num': 19} only
Coordinates: # <-- mostly unions as done by concat
* num (num) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
B <U1 'r'
* ar <U1 'A' 'B' # <-- this is now a dim of the dataset, but not of MA or MB
Data variables:
MA (num) float64 0.5 1.0 1.5 2.0 2.5 3.0 ... 8.0 8.5 9.0 9.5 10.0 10.5
MB (num) float64 1.0 1.5 2.0 2.5 3.0 3.5 ... 7.5 8.0 8.5 9.0 9.5 10.0 Instead I get MergeError: conflicting values for variable 'ar' on objects to be combined:
first value: <xarray.Variable ()>
array('A', dtype='<U1')
second value: <xarray.Variable ()>
array('B', dtype='<U1') While it is possible to 2. unaligned time domainsThis s a large problem especially when different time-bases are involved. A difference in sampling intervals will blow up the storage by a huge number of nan values. Which of course greatly complicates further calculations, e.g. filtering in the time domain. Or just non-overlaping time intervals will require at least double the storage area. I often find myself resorting rather to |
You can use a |
I indeed often resort to using a |
I'm marking #1408 as a bug so we won't forget about it. Hopefully it should
be fixed automatically as part of the "explicit indexes" refactor.
…On Thu, Oct 18, 2018 at 2:48 AM Ondrej Grover ***@***.***> wrote:
I indeed often resort to using a pandas.MultiIndex, but especially the
dropping of the selected coordinate value (#1408
<#1408>) makes it quite
inconvenient.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1471 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1pnDztKWoTaWEjzPpP6orveOMNWRks5umE6BgaJpZM4ORDdd>
.
|
@smartass101 & @shoyer what would be the code for working with a
I am working with land surface model outputs. I have lots of one-dimensional data for different lat/lon points, at different times. I want to join them all into one dataset to make plotting easier. E.g. plot the evapotranspiration estimates for all the stations at their x,y coordinates. Thanks very much! |
I just wanted to chime in as to the usefulness of being able to do something like this without the extra mental overhead being required by the workaround proposed. My use case parallels @smartass101's very closely. Have there been any updates to xarray since last year that might make streamlining this use case a bit more feasible, by any chance? :) |
There is no good way of having single-node TPMs in a single Dataset without blowing up memory usage (due to alignment of non-shared singleton dimensions): pydata/xarray#1471 Instead we will implement our own indexing operation to distribute it across the sequence of DataArray nodes. As a bonus, we can implement positional indexing, which xarray doesn't support at the Dataset level.
I also would love this feature. Consider a dataset with many 1-D time series recordings across repeated sweeps and stimulus series all stored in one N-D array (e.g., dimensions ['series', 'sweep', 'time']). For one or a small number of these time series there is an artifactual wobble in the baseline that needs removing. It would be great to be able to store these detrended time series in the same dataset as a new array that shares the same time coords, but has a subset of the series and sweep coords. This would require that each DataArray in a Dataset can have its own coords which would supersede the Dataset's coords when defined. I imagine there are likely many use cases for working with a subset of a dataset along some, but not all dimensions. In all those cases, it would be convenient to house that data within the same dataset rather than having a bunch of datasets with repeated coord arrays for shared dimensions. Given how stale this discussion is, I imagine this would probably be too much of a headache. But it would be very nice. I am in favor of the idea of a new UnalignedDataset as suggested by @shoyer. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
I have two questions regarding proper implementation of an xarray dataset when defining dimensions. First, I am wondering whether I can share the same dimension across multiple arrays in a dataset without storing NaN values for coordinates not present in each respective array.
As a simple example, I am interested in creating two data arrays that involve the shared dimensions x and y; however, in the first data array, I only care about x-coordinates from (0->5) whereas in the second data array I only care about x-coordinates from (10-> 12)
If I naively merge the two datasets, then the dimensions and coordinates get merged correctly but not each of the data variables within the dataset are much larger than they need to be (store unnecessary nan values)
In my second question, I want to add an extra layer of complexity to this and add a third variable that uses multi-indexing. Again naively, I would have wanted the multi-index in the third table to share dimensions (x and y) from the previous data variables
Currently my solution is just to rename each of the dimensions in each respective data array so that they do not overlap. While this is not ideal, I can probably get away with this, but since I would prefer the ability to share dimensions without adding in NaN values, is there another way to achieve this? (Im also assuming that I can still do joins later on using values within different dimension names.)
The text was updated successfully, but these errors were encountered: