Skip to content

GH-46270: [C++][Parquet] Clarify GeoStatistics docstring #46649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 2, 2025

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented May 30, 2025

Rationale for this change

The distinction between "invalid" and "empty" is not clear in the current documentation!

What changes are included in this PR?

The docstring for GeoStatistics was improved.

Are these changes tested?

Just documention!

Are there any user-facing changes?

No

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is useful.

@pitrou
Copy link
Member

pitrou commented Jun 2, 2025

@paleolimbot I'm curious about this in the method docstrings:

  /// For statistics read from a Parquet file, dimension_empty() will always contain
  /// false values because there is no mechanism to communicate an empty interval
  /// in the Thrift metadata.

Why was it done that way, if emptiness is a useful information to have? And is there a point in exposing emptiness in our geostats APIs? Usually, people want to filter from a Parquet file read from disk, not one that is being constructed in-memory...

@paleolimbot
Copy link
Member Author

Why was it done that way, if emptiness is a useful information to have?

The PR where we discussed this is apache/parquet-format#494 ...the consensus was that checking the null_count for a column chunk against the number of rows in the row group would catch the most common case (row group is all null). We then discovered that we don't currently write null counts for unsorted logical types, but hopefully we can fix that ( #46275 ).

And is there a point in exposing emptiness in our geostats APIs?

We use the same API for producing and consuming GeoStatistics (this was modelled after the regular Statistics). We could move the write path only use internals although I am not sure this would be less confusing.

@paleolimbot paleolimbot merged commit 8d44eea into apache:main Jun 2, 2025
28 of 34 checks passed
@paleolimbot paleolimbot removed the awaiting committer review Awaiting committer review label Jun 2, 2025
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 8d44eea.

There were 68 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants