-
-
Notifications
You must be signed in to change notification settings - Fork 330
refactor v3 data types #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor v3 data types #2874
Conversation
…into feat/fixed-length-strings
copying a comment @nenb made in this zulip discussion:
A feature of the character code is that it provides a way to distinguish parametric types like |
Summarising from a zulip discussion: @nenb: How is the endianness of a dtype handled? @d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype. Proposed solution: Make |
Thanks for the summary! I have implemented the proposed solution. |
…into feat/fixed-length-strings
… registry load frequency, add object_codec_id for v2 json deserialization
…at/fixed-length-strings
I recently pushed a lot of internal changes. A quick summary:
I have not written enough documentation. I would request that we make this happen in another PR, and focus on getting this PR finally reviewed, merged, and included in a pre-release so people can test it more easily. |
Expressing my personal opinion here: I would be very much in favour of getting this into a pre-release so that I can test it with a variety of new data types and see what happens. Like you have pointed out, testing in the pre-release will likely be the best way to motivate what parts of the documentation and the (internal) API need to be improved before an actual release. Great work! |
…at/fixed-length-strings
Sorry, I'm not going to have time to re-review this in the near future. If you've looked at all my previous comments feel free to dimsiss my review. |
no worries, and thanks for your patience with this big PR. I'll review your feedback; some things might get spun out into issues (i'm thinking about changes in the config) |
As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.
In
main
,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.