Performance of deep DataTrees

### What is your issue?

The `DataTree` structure was not designed with performance of very large trees in mind. It doesn't do anything obviously wasteful, but the priority has been making decisions about the data model and user API, with performance secondary. Now that the model is more established (or soon should be), we're in a better position to talk about improving performance.

There are two possible performance issues that @shoyer pointed out:
1) The internal structure is a lot of linked python classes, resulting in a lot of method calls to do things like tree traversal. This is good for clarity and evolving a prototype, but will introduce significant overhead per tree operation.

2) There are one or two places which might cause quadratic scaling with tree depth. In particular inserting a node via the `DataTree.__init__` constructor will cause the entire tree to be checked for consistency, creating a tree by repeatedly using this constructor could be quadratically expensive. `DataTree.from_dict` could be optimized to remove this problem because it creates from the root, so you can just check subtrees as they are added.

I personally think that the primary use case of a `DataTree` is small numbers of nodes, each containing large arrays (rather than large numbers of nodes containing small arrays). But I'm sure someone will immediately be like "well in my use case I need a tree with 10k nodes" 😆 

In fact because it is possible to represent huge amounts of archival data with a single `DataTree`, someone will probably do something like attempt to represent the entire CMIP6 catalog as a `DataTree` and then complain after hitting a performance limit...

If anyone has ideas for how to improve performance _without changing user API_ let's use this issue to collate and track them.

(Note that this issue is different from the issue of dask in datatree. (xref #9355, #9502, #9504) Here I'm talking specifically about optimizations that can be performed even without dask installed.)

cc @Illviljan who I'm sure has thoughts about this



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance of deep DataTrees #9511

What is your issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Performance of deep DataTrees #9511

Description

What is your issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions