refactor v3 data types #2874

d-v-b · 2025-02-28T11:43:49Z

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

…into feat/fixed-length-strings

d-v-b · 2025-02-28T13:23:18Z

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

src/zarr/core/metadata/dtype.py

src/zarr/codecs/sharding.py

…base

nenb · 2025-03-03T13:56:00Z

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

d-v-b · 2025-03-04T11:27:15Z

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

…at/fixed-length-strings

…o JSON

…into feat/fixed-length-strings

… registry load frequency, add object_codec_id for v2 json deserialization

…at/fixed-length-strings

d-v-b · 2025-05-29T15:27:49Z

I recently pushed a lot of internal changes. A quick summary:

I added explicit type annotations to the json-producing functions, so wherever possible these functions are annotated with the actual string / typeddict representation of the metadata form of the dtype. This means the type-checking routines that perform type-narrowing on JSON input now have a much more explicit typeguard annotation. But, it also means there is some impedance mismatch between the JSON type we are using in the rest of the codebase, and the typeddict universe of the dtypes. Ultimately I think we need to fix our JSON type to be compatible with typeddicts, but that's a problem for a later day.
I separated the zarr v2 JSON data type deserialization from the zarr v3 deserialization. The key difference is that zarr v2 deserialization now takes an optional object_codec_id parameter, which is the name of an "object codec" present in the array metadata document. This parameter is necessary to resolve the data type for zarr v2 arrays saved with dtype "|O", which can only be disambiguated by checking for special codecs like vlen-utf8, vlen-bytes, pickle, etc in filters or compressor. The process of creating zarr v2 metadata documents from JSON has been updated to include this logic. This means we can add zarr-python 2.18-compatible vlen-bytes and vlen-array dtypes. That's an effort for another PR.
I added a consistency check to the data type registry. When resolving a native data type, if two or more zdtype classes match that native data type, an exception is raised and the user is encouraged to unregister one of the two ambiguous zdtype instances from the registry.

To be clear, I do not like that dtype resolution currently involves calling a bunch of class methods, succeeding only when exactly one of those class method invocations succeeds. Things would be much simpler with a mapping from dict from numpy data type classes to zarr data type classes, as @nenb suggested in here. But this is complicated by the numpy void dtype, which supports two separate zarr data types: fixed-length raw bytes and structured dtypes. One solution would be to add numpy void to the list of data types we don't dynamically resolve, and then I think the static {numpy dtype: zarr dtype} mapping can work. But I consider this a refinement of the content in this PR, and suitable for a separate effort.

I have not written enough documentation. I would request that we make this happen in another PR, and focus on getting this PR finally reviewed, merged, and included in a pre-release so people can test it more easily.

nenb · 2025-05-29T15:54:33Z

Expressing my personal opinion here: I would be very much in favour of getting this into a pre-release so that I can test it with a variety of new data types and see what happens. Like you have pointed out, testing in the pre-release will likely be the best way to motivate what parts of the documentation and the (internal) API need to be improved before an actual release.

Great work!

…at/fixed-length-strings

d-v-b · 2025-06-03T15:18:14Z

@ianhi @dstansby if you have the time could you check and see if the changes you requested have all been addressed?

dstansby · 2025-06-03T15:19:53Z

Sorry, I'm not going to have time to re-review this in the near future. If you've looked at all my previous comments feel free to dimsiss my review.

d-v-b · 2025-06-03T15:22:32Z

Sorry, I'm not going to have time to re-review this in the near future. If you've looked at all my previous comments feel free to dimsiss my review.

no worries, and thanks for your patience with this big PR. I'll review your feedback; some things might get spun out into issues (i'm thinking about changes in the config)

d-v-b added 9 commits February 21, 2025 13:43

modernize typing

f5e3f78

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

b4e71e2

…into feat/fixed-length-strings

lint

3c50f54

new dtypes

d74e7a4

rename base dtype, change type to kind

5000dcb

start working on JSON serialization

9cd5c51

get json de/serialization largely working, and start making tests pass

042fac1

tweak json type guards

556e390

fix dtype sizes, adjust fill value parsing in from_dict, fix tests

b588f70

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025

d-v-b added 2 commits March 2, 2025 12:54

mid-refactor commit

4ed41c6

working form for dtype classes

1b2c773

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/core/metadata/dtype.py Outdated Show resolved Hide resolved

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/core/metadata/dtype.py Outdated Show resolved Hide resolved

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/core/metadata/dtype.py Outdated Show resolved Hide resolved

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/codecs/sharding.py Outdated Show resolved Hide resolved

d-v-b added 3 commits March 2, 2025 21:55

remove unused code

24930b3

use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…

703e0e1

…base

push into v2

3c232a4

remove endianness kwarg to methods, make it an instance variable instead

b7fe986

d-v-b mentioned this pull request Mar 4, 2025

support for datetime and timedelta dtypes (#2616) #2884

Draft

6 tasks

d-v-b added 4 commits March 4, 2025 18:10

make wrapping safe by default

d9b44b4

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

bf24d69

…at/fixed-length-strings

dtype-specific tests

c1a8566

more tests, fix void type default value logic

2868994

d-v-b mentioned this pull request Mar 5, 2025

Fix fill_value serialization issues #2802

Merged

6 tasks

fix dtype mechanics in bytescodec

9ab0b1e

emit warning about unstable dtype when serializing Structured dtype t…

e67d4dc

…o JSON

d-v-b mentioned this pull request May 23, 2025

⭐ support zarr 3.0? Cloud-Drift/clouddrift#552

Open

d-v-b added 4 commits May 24, 2025 14:10

put string dtypes in the strings module

4e2a157

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

a1deda6

…into feat/fixed-length-strings

make tests isomorphic to source code

528a942

remove old string logic

c9c8181

d-v-b mentioned this pull request May 26, 2025

Regression testing #3099

Draft

d-v-b added 6 commits May 26, 2025 17:18

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

1cb7734

…into feat/fixed-length-strings

use scale_factor and unit in cast_value for datetime

d80d565

add regression testing against v2.18

7806563

truncate U and S scalars in _cast_value_unsafe

39219fa

docstrings and simplification for regression tests

4a7a550

changes necessary for linting with regression tests

807c585

nenb mentioned this pull request May 27, 2025

Mapping of new dtypes to numpy dtypes #3101

Closed

d-v-b added 3 commits May 29, 2025 12:40

improve method names, refactor type hints with typeddictionaries, fix…

5150d60

… registry load frequency, add object_codec_id for v2 json deserialization

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

9ddbe97

…at/fixed-length-strings

fix storage info discrepancy in docs

d6535d6

fix docstring that was troubling sphinx

42e14ef

d-v-b added 2 commits May 29, 2025 21:01

wip: add vlen-bytes

3991406

add vlen-bytes

d7da3d9

d-v-b requested review from dstansby and ianhi May 29, 2025 20:28

d-v-b mentioned this pull request May 30, 2025

consider removing default compressor / filters / serializer from config #3104

Open

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

c3c3288

…at/fixed-length-strings

Merge branch 'main' into feat/fixed-length-strings

d1feaee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor v3 data types #2874

refactor v3 data types #2874

Uh oh!

d-v-b commented Feb 28, 2025

Uh oh!

d-v-b commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nenb commented Mar 3, 2025

Uh oh!

d-v-b commented Mar 4, 2025

Uh oh!

d-v-b commented May 29, 2025

Uh oh!

nenb commented May 29, 2025

Uh oh!

d-v-b commented Jun 3, 2025

Uh oh!

dstansby commented Jun 3, 2025

Uh oh!

d-v-b commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

refactor v3 data types #2874

Are you sure you want to change the base?

refactor v3 data types #2874

Uh oh!

Conversation

d-v-b commented Feb 28, 2025

Uh oh!

d-v-b commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nenb commented Mar 3, 2025

Uh oh!

d-v-b commented Mar 4, 2025

Uh oh!

d-v-b commented May 29, 2025

Uh oh!

nenb commented May 29, 2025

Uh oh!

d-v-b commented Jun 3, 2025

Uh oh!

dstansby commented Jun 3, 2025

Uh oh!

d-v-b commented Jun 3, 2025

Uh oh!

Uh oh!