Skip to content

(fix): use typesize on Blosc codec #2962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 8, 2025

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Apr 7, 2025

Fixes #2766 and fixes #2171
TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 7, 2025
@d-v-b d-v-b requested a review from normanrz April 7, 2025 14:56
Copy link
Contributor

@d-v-b d-v-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good @ilan-gold, we just a release note

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 7, 2025
@ilan-gold
Copy link
Contributor Author

Apologies did not mean for you guys to do an immediate review, will keep that in mind next time, this was mostly to remind myself to finish up :)

@d-v-b
Copy link
Contributor

d-v-b commented Apr 7, 2025

no worries, I was trigger-happy here

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a test for this that asserts a certain compressed size, so we can catch regressions in the future? The doctest is a nice way to catch this, but I worry that it might get removed or changed whereas a test is more likely to stay around.

@ilan-gold ilan-gold requested a review from dstansby April 9, 2025 14:11
@ilan-gold
Copy link
Contributor Author

ilan-gold commented May 8, 2025

Does anyone have access to a windows machine? Or should we just xfail this and move on? I am not sure if the issue is numpy or the python version interacting with numcodecs here causing the sizes to be off. We can open an issue if someone can come up with access to a windows machine can create a repro



async def test_typesize() -> None:
a = np.arange(1000000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a = np.arange(1000000)
a = np.arange(2**16, dtype=np.uint16)

As a thought, worth explicitly specifying the data type (and making the data smaller)? Don't know if it will fix the windows issue, but I think worth doing anyway os there's a concrete bytesize, and perhaps using integer data type will help with linux/windows because perhaps they have different floating point implementations (although that's wild speculation on my part...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arange is default uint64 so I'll push something with that added.

@ilan-gold
Copy link
Contributor Author

The first four dumped characters on the failing windows case are:

\x02\x01\x91\x04@

while they should be

\x02\x01\x91\x08

This would indicate to me the typesize is being incorrectly encoded but I (a) don't know why and (b) don't know what the @ means.

@dstansby
Copy link
Contributor

dstansby commented May 8, 2025

This would indicate to me the typesize is being incorrectly encoded but I (a) don't know why and (b) don't know what the @ means.

Weird... I'm guessing that would be an upstream numcodecs issue/fix, so we could probably cut our losses here and just xfail the test on windows for now.

@ilan-gold
Copy link
Contributor Author

Ok @dstansby great call - it looks like it was just being explicit, there must be different behavior on windows for that version. There is a warning in the documentation https://numpy.org/doc/stable/reference/generated/numpy.arange.html but I figured we hadn't actually hit any of those conditions.

@dstansby dstansby enabled auto-merge (squash) May 8, 2025 14:31
@dstansby dstansby merged commit 5ff3fbe into zarr-developers:main May 8, 2025
30 checks passed
@rabernat
Copy link
Contributor

rabernat commented May 8, 2025

Thank you @ilan-gold and @dstansby for working on this bug! I really appreciate your efforts. 🙏

@ilan-gold ilan-gold deleted the ig/typesize_for_blosc branch May 8, 2025 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Difference in behavior between 2.x and 3.x using identical compressor settings Poor blosc compression ratios compared to v2
5 participants