Skip to content

Full crabbake #1855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 26, 2022
Merged

Full crabbake #1855

merged 4 commits into from
May 26, 2022

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented May 4, 2022

Fixes #1290

@jedel1043
Copy link
Contributor

So, I tried to test this PR with the CLDR full dataset and I discovered that it generates files with 100k lines (pretty much obvious considering the size of the dataset).

However, the massive files causes Rust Analyzer to freeze my whole system when loading the generated module. I don't think that's intentional, right?

Maybe this could be solved with include_bytes somehow?

@robertbastian
Copy link
Member Author

I ran cargo run -p icu_datagen --features bin -- --all-keys --all-locales --cldr-tag 41.0.0 --uprops-tag release-71-1 --format mod and it generated 120MB of source. Running cargo build on that took 2 minutes and didn't even start up the fans on my laptop (very scientific assessment, I know). What analyzer exactly is crashing for you? An IDE? I don't think we can and want to support actually viewing these files.

include_bytes is not a solution as the point of this is having static structured Rust structs. We already have an include_bytes solution, but that needs to parse the bytes at runtime (StaticDataProvider)

@robertbastian
Copy link
Member Author

Ah that analyzer. You can add your generated files to excludeDirs. Does the checked in test data crash it as well?

@jedel1043
Copy link
Contributor

jedel1043 commented May 5, 2022

Ah that analyzer. You can add your generated files to excludeDirs.

Yep, that's my solution for now. Unfortunately it has to be set for every IDE independently, since Rust Analyzer doesn't have a LSP-agnostic way to exclude files from a scan.

Does the checked in test data crash it as well?

Nope, it works perfectly fine with a small subset of language ids! I'm theorizing that the problem appears because RA tries to parse the whole 100k lines of a file at the same time. Maybe this could be alleviated by splitting those files into smaller chunks?
I think that RA is more efficient at analyzing a lot of smaller files than a small amount of massive files.

@sffc
Copy link
Member

sffc commented May 5, 2022

It would be nice if Rust Analyzer had a per-file annotation that we could add to each individual .rs file that gets generated.

@jedel1043
Copy link
Contributor

It would be nice if Rust Analyzer had a per-file annotation that we could add to each individual .rs file that gets generated.

Yeah, I thought RA had some editor-agnostic way of skipping a file or function from being analyzed, but apparently they're still working on it. (rust-lang/rust-analyzer#7449, rust-lang/rust-analyzer#6113)

@jedel1043
Copy link
Contributor

jedel1043 commented May 5, 2022

Ah, apparently you can bypass IDEs using the include! macro:
rust-lang/rust-analyzer#4500 (comment)
@robertbastian Would it be feasible to export the files this way instead?

That could also make it possible to separate the declaration of each VALUES from its definition, which would make the items visible for RA.

@sffc
Copy link
Member

sffc commented May 5, 2022

It's a little bit hacky but not a bad idea since these are really intended to be treated more as data files.

Is there a better extension than .txt? Maybe .inc?

What is an INC file?
Text file containing declarations, headers, functions, or other data referenced by a program's source code; can be used with C/C++, Pascal, Java, PHP (Web pages), and other languages.

https://fileinfo.com/extension/inc

@jedel1043
Copy link
Contributor

jedel1043 commented May 5, 2022

Is there a better extension than .txt? Maybe .inc?

Tested with .inc and it works!

I should point out that only a single expression or item is allowed inside the file, so if VALUES uses any private static, it would have to be inserted in a block along with the definition of VALUES:

//! def.inc
{
    static DATE: length::Date = length::Date::Long;
    length::Bag::from_date_time_style(DATE, length::Time::Medium).into()
}
//! main.rs
use icu::datetime::{mock::parse_gregorian_from_str, options::length, DateTimeFormat};
use icu::locid::locale;

mod icu4x;

fn main() {
    let date = parse_gregorian_from_str("2020-10-14T13:21:00").unwrap();

    let options = include!("def.inc");

    let dtf = DateTimeFormat::try_new(locale!("pl"), &icu4x::BakedDataProvider, &options).unwrap();

    let formatted_date = dtf.format(&date);

    println!("📅: {}", formatted_date)
}

@robertbastian
Copy link
Member Author

I'm hesitant to change the extension, because then we will lose Rust features like syntax highlighting for small files as well. I also suspect that include! could bog down the compiler, as it will see a single massive file.

This sounds like a bug in Rust Analyzer for which there is a workaround, so I don't think it's something we need to fix. That said, I'm happy to include any required attributes; I see that rustfmt has an unstable marker, which I'll add, maybe you can teach that to Rust Analyzer?

@sffc
Copy link
Member

sffc commented May 5, 2022

If we use .inc, it's still possible to configure one's own personal IDE to do the syntax highlighting as Rust. It's just a matter of what the default should be, and I think it is not a bad idea for the default to be treating these files as generic text files.

@robertbastian
Copy link
Member Author

But then you might as well configure your IDE to not rust-analyze them. These are Rust files so they should be .rs.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, done reviewing. There are parts of this we need to discuss.

@jedel1043
Copy link
Contributor

@robertbastian Thank you for the advice! I could successfully integrate the module without causing RA to lag. Can't wait to see this PR merged!

Also, do you have plans on making LocaleCanonicalizer (and possibly any other struct that only needs a provider to work) also crabbakeable? Or would it be impossible to skip runtime checks for them?

@sffc
Copy link
Member

sffc commented May 20, 2022

Does Rust 1.61 address some of the issues that we had from this PR such as a way to const-construct ZeroMap?

@Manishearth
Copy link
Member

Yes

@Manishearth
Copy link
Member

It'll still be somewhat ugly, but crucially the ugliness can be packed away into a method, rather than being splayed out in crabbake output

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the fields of zerovec types private again. Suggestion: make this a separate, standalone PR that adds const constructors for all the types.

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@robertbastian
Copy link
Member Author

^squash

hsivonen
hsivonen previously approved these changes May 24, 2022
Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the collator and normalizer parts. My (non-blocking) comment about alignment applies to all ZeroVec backing stores when the ZeroVec item is larger than one byte.

&'static <::icu_collator::provider::CollationDiacriticsV1Marker as DataMarker>::Yokeable;
static UND: DataStruct = &::icu_collator::provider::CollationDiacriticsV1 {
ce32s: unsafe {
static DATA: &[u8] = &[
Copy link
Member

@hsivonen hsivonen May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: (Can be a follow-up.) While I see that the decision in December means that making this &[u32] without ZeroVec in between is "optional" goal-wise, it seems to me that it would be relatively feasible to align this to a 4-byte boundary. ZeroVec would still have to generate unaligned load instructions the access would be from an actually-aligned address. (Disclaimer: I haven't benchmarked unaligned ALU-type load instructions from actually-aligned addresses relative to unaligned addresses, but at least on the SIMD side it makes a difference, IIRC.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would best be discussed on #1926. Having unaligned loads on crabbaked ZeroVecs is not ideal, but it's the obvious implementation. Maybe in the future we can add a BorrowedStatic enum version that contains a &[T], but we have to consider the trade-off between code size and runtime gains.

Manishearth
Manishearth previously approved these changes May 25, 2022
@sffc sffc removed the request for review from echeran May 25, 2022 22:51
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My remaining open comments can be addressed in a follow-up PR if it's easier.

@@ -2,8 +2,8 @@
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

#[macro_use]
mod macros;
#![allow(clippy::exhaustive_structs)] // Field and FieldULE part of data struct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (optional): I prefer to have the exceptions explicit on each struct

Comment on lines +320 to +321
// FIXME: This should be replaced with a custom derive.
// See: https://github.com/unicode-org/icu4x/issues/1044
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The FIXME points to an issue that is closed. Please resolve the FIXME or make it point to an open issue.

(I'm aware that this is copied from macros.rs; I wouldn't have noticed this if you hadn't moved the code to the new file)

@@ -84,6 +84,7 @@ mod error;
pub mod maps;
mod props;
pub mod provider;
#[allow(clippy::exhaustive_structs)] // TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Add issues for TODOs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fast path for trusted data
5 participants