Full crabbake #1855

robertbastian · 2022-05-04T15:40:50Z

jedel1043 · 2022-05-05T05:51:15Z

So, I tried to test this PR with the CLDR full dataset and I discovered that it generates files with 100k lines (pretty much obvious considering the size of the dataset).

However, the massive files causes Rust Analyzer to freeze my whole system when loading the generated module. I don't think that's intentional, right?

Maybe this could be solved with include_bytes somehow?

robertbastian · 2022-05-05T08:32:48Z

I ran cargo run -p icu_datagen --features bin -- --all-keys --all-locales --cldr-tag 41.0.0 --uprops-tag release-71-1 --format mod and it generated 120MB of source. Running cargo build on that took 2 minutes and didn't even start up the fans on my laptop (very scientific assessment, I know). What analyzer exactly is crashing for you? An IDE? I don't think we can and want to support actually viewing these files.

include_bytes is not a solution as the point of this is having static structured Rust structs. We already have an include_bytes solution, but that needs to parse the bytes at runtime (StaticDataProvider)

robertbastian · 2022-05-05T09:11:49Z

Ah that analyzer. You can add your generated files to excludeDirs. Does the checked in test data crash it as well?

jedel1043 · 2022-05-05T14:11:21Z

Ah that analyzer. You can add your generated files to excludeDirs.

Yep, that's my solution for now. Unfortunately it has to be set for every IDE independently, since Rust Analyzer doesn't have a LSP-agnostic way to exclude files from a scan.

Does the checked in test data crash it as well?

Nope, it works perfectly fine with a small subset of language ids! I'm theorizing that the problem appears because RA tries to parse the whole 100k lines of a file at the same time. Maybe this could be alleviated by splitting those files into smaller chunks?
I think that RA is more efficient at analyzing a lot of smaller files than a small amount of massive files.

sffc · 2022-05-05T14:13:03Z

It would be nice if Rust Analyzer had a per-file annotation that we could add to each individual .rs file that gets generated.

jedel1043 · 2022-05-05T14:21:27Z

It would be nice if Rust Analyzer had a per-file annotation that we could add to each individual .rs file that gets generated.

Yeah, I thought RA had some editor-agnostic way of skipping a file or function from being analyzed, but apparently they're still working on it. (rust-lang/rust-analyzer#7449, rust-lang/rust-analyzer#6113)

jedel1043 · 2022-05-05T14:24:19Z

Ah, apparently you can bypass IDEs using the include! macro:
rust-lang/rust-analyzer#4500 (comment)
@robertbastian Would it be feasible to export the files this way instead?

That could also make it possible to separate the declaration of each VALUES from its definition, which would make the items visible for RA.

sffc · 2022-05-05T14:29:21Z

It's a little bit hacky but not a bad idea since these are really intended to be treated more as data files.

Is there a better extension than .txt? Maybe .inc?

What is an INC file?
Text file containing declarations, headers, functions, or other data referenced by a program's source code; can be used with C/C++, Pascal, Java, PHP (Web pages), and other languages.

https://fileinfo.com/extension/inc

jedel1043 · 2022-05-05T14:42:17Z

Is there a better extension than .txt? Maybe .inc?

Tested with .inc and it works!

I should point out that only a single expression or item is allowed inside the file, so if VALUES uses any private static, it would have to be inserted in a block along with the definition of VALUES:

//! def.inc
{
    static DATE: length::Date = length::Date::Long;
    length::Bag::from_date_time_style(DATE, length::Time::Medium).into()
}

//! main.rs
use icu::datetime::{mock::parse_gregorian_from_str, options::length, DateTimeFormat};
use icu::locid::locale;

mod icu4x;

fn main() {
    let date = parse_gregorian_from_str("2020-10-14T13:21:00").unwrap();

    let options = include!("def.inc");

    let dtf = DateTimeFormat::try_new(locale!("pl"), &icu4x::BakedDataProvider, &options).unwrap();

    let formatted_date = dtf.format(&date);

    println!("📅: {}", formatted_date)
}

robertbastian · 2022-05-05T14:43:50Z

I'm hesitant to change the extension, because then we will lose Rust features like syntax highlighting for small files as well. I also suspect that include! could bog down the compiler, as it will see a single massive file.

This sounds like a bug in Rust Analyzer for which there is a workaround, so I don't think it's something we need to fix. That said, I'm happy to include any required attributes; I see that rustfmt has an unstable marker, which I'll add, maybe you can teach that to Rust Analyzer?

sffc · 2022-05-05T15:11:46Z

If we use .inc, it's still possible to configure one's own personal IDE to do the syntax highlighting as Rust. It's just a matter of what the default should be, and I think it is not a bad idea for the default to be treating these files as generic text files.

robertbastian · 2022-05-05T15:14:55Z

But then you might as well configure your IDE to not rust-analyze them. These are Rust files so they should be .rs.

sffc

okay, done reviewing. There are parts of this we need to discuss.

components/locid/src/crabbake.rs

components/plurals/src/rules/runtime/ast.rs

components/properties/src/script.rs

experimental/casemapping/src/exceptions.rs

utils/codepointtrie/src/codepointtrie.rs

utils/uniset/src/uniset.rs

utils/zerovec/src/map/map.rs

utils/zerovec/src/varzerovec/crabbake.rs

utils/zerovec/src/zerovec/crabbake.rs

jedel1043 · 2022-05-15T16:09:49Z

@robertbastian Thank you for the advice! I could successfully integrate the module without causing RA to lag. Can't wait to see this PR merged!

Also, do you have plans on making LocaleCanonicalizer (and possibly any other struct that only needs a provider to work) also crabbakeable? Or would it be impossible to skip runtime checks for them?

sffc · 2022-05-20T21:16:01Z

Does Rust 1.61 address some of the issues that we had from this PR such as a way to const-construct ZeroMap?

Manishearth · 2022-05-20T22:32:33Z

Yes

Manishearth · 2022-05-20T22:33:03Z

It'll still be somewhat ugly, but crucially the ugliness can be packed away into a method, rather than being splayed out in crabbake output

sffc

Please make the fields of zerovec types private again. Suggestion: make this a separate, standalone PR that adds const constructors for all the types.

jira-pull-request-webhook · 2022-05-23T20:40:12Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

robertbastian · 2022-05-23T20:41:42Z

^squash

hsivonen

I only reviewed the collator and normalizer parts. My (non-blocking) comment about alignment applies to all ZeroVec backing stores when the ZeroVec item is larger than one byte.

hsivonen · 2022-05-24T06:09:54Z

provider/testdata/data/baked/collator/dia_v1.rs

+    &'static <::icu_collator::provider::CollationDiacriticsV1Marker as DataMarker>::Yokeable;
+static UND: DataStruct = &::icu_collator::provider::CollationDiacriticsV1 {
+    ce32s: unsafe {
+        static DATA: &[u8] = &[


Issue: (Can be a follow-up.) While I see that the decision in December means that making this &[u32] without ZeroVec in between is "optional" goal-wise, it seems to me that it would be relatively feasible to align this to a 4-byte boundary. ZeroVec would still have to generate unaligned load instructions the access would be from an actually-aligned address. (Disclaimer: I haven't benchmarked unaligned ALU-type load instructions from actually-aligned addresses relative to unaligned addresses, but at least on the SIMD side it makes a difference, IIRC.)

I think this would best be discussed on #1926. Having unaligned loads on crabbaked ZeroVecs is not ideal, but it's the obvious implementation. Maybe in the future we can add a BorrowedStatic enum version that contains a &[T], but we have to consider the trade-off between code size and runtime gains.

provider/testdata/data/baked/collator/meta_v1.rs

provider/datagen/src/crabbake.rs

components/datetime/src/provider/calendar/skeletons.rs

sffc

My remaining open comments can be addressed in a follow-up PR if it's easier.

sffc · 2022-05-26T01:34:20Z

components/datetime/src/fields/mod.rs

@@ -2,8 +2,8 @@
 // called LICENSE at the top level of the ICU4X source tree
 // (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

-#[macro_use]
-mod macros;
+#![allow(clippy::exhaustive_structs)] // Field and FieldULE part of data struct


Nit (optional): I prefer to have the exceptions explicit on each struct

sffc · 2022-05-26T01:36:29Z

components/datetime/src/fields/symbols.rs

+        // FIXME: This should be replaced with a custom derive.
+        // See: https://github.com/unicode-org/icu4x/issues/1044


Nit: The FIXME points to an issue that is closed. Please resolve the FIXME or make it point to an open issue.

(I'm aware that this is copied from macros.rs; I wouldn't have noticed this if you hadn't moved the code to the new file)

components/datetime/src/pattern/runtime/plural.rs

components/locid/src/crabbake.rs

sffc · 2022-05-26T01:54:58Z

components/properties/src/lib.rs

@@ -84,6 +84,7 @@ mod error;
 pub mod maps;
 mod props;
 pub mod provider;
+#[allow(clippy::exhaustive_structs)] // TODO


Nit: Add issues for TODOs.

components/properties/src/script.rs

utils/codepointtrie/src/codepointtrie.rs

utils/zerovec/src/map/map.rs

utils/zerovec/src/zerovec/crabbake.rs

utils/uniset/src/uniset.rs

robertbastian requested review from Manishearth, sffc, a team, aethanyc, makotokato, zbraniecki, nciric, dminor, gregtatum and nordzilla as code owners May 4, 2022 15:40

robertbastian removed request for a team, aethanyc, dminor, zbraniecki, makotokato, gregtatum, nciric and nordzilla May 4, 2022 15:41

sffc reviewed May 6, 2022

View reviewed changes

sffc mentioned this pull request May 9, 2022

Add benchmarks for CodePointTrie #1851

Merged

robertbastian requested review from sffc and Manishearth May 9, 2022 10:42

sffc requested changes May 20, 2022

View reviewed changes

robertbastian requested review from hsivonen and echeran as code owners May 23, 2022 20:34

full crabbake

0fc6011

robertbastian force-pushed the crabbake branch from 7d6625b to 0fc6011 Compare May 23, 2022 20:40

hsivonen previously approved these changes May 24, 2022

View reviewed changes

Merge branch 'main' into crabbake

35e1910

robertbastian dismissed hsivonen’s stale review via 35e1910 May 24, 2022 19:05

robertbastian requested a review from sffc May 24, 2022 20:26

robertbastian commented May 25, 2022

View reviewed changes

provider/datagen/src/crabbake.rs Show resolved Hide resolved

Manishearth previously approved these changes May 25, 2022

View reviewed changes

components/datetime/src/provider/calendar/skeletons.rs Outdated Show resolved Hide resolved

components/datetime/src/provider/calendar/skeletons.rs Show resolved Hide resolved

comments

47797ed

robertbastian dismissed Manishearth’s stale review via 47797ed May 25, 2022 21:23

clarify comment

844d0d4

Manishearth approved these changes May 25, 2022

View reviewed changes

sffc removed the request for review from echeran May 25, 2022 22:51

sffc approved these changes May 26, 2022

View reviewed changes

robertbastian merged commit ae73b7e into unicode-org:main May 26, 2022

robertbastian deleted the crabbake branch May 26, 2022 05:31

aethanyc mentioned this pull request Jun 16, 2022

Crate icu_segmenter [0.6.0]: Visibility error[E0603] for icu_segmenter::provider::GraphemeClusterBreakDataV1Marker #2012

Closed

		// FIXME: This should be replaced with a custom derive.
		// See: https://github.com/unicode-org/icu4x/issues/1044

Full crabbake #1855

Full crabbake #1855

Uh oh!

Conversation

robertbastian commented May 4, 2022 • edited by sffc Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jedel1043 commented May 5, 2022

Uh oh!

robertbastian commented May 5, 2022

Uh oh!

robertbastian commented May 5, 2022

Uh oh!

jedel1043 commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sffc commented May 5, 2022

Uh oh!

jedel1043 commented May 5, 2022

Uh oh!

jedel1043 commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sffc commented May 5, 2022

Uh oh!

jedel1043 commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertbastian commented May 5, 2022

Uh oh!

sffc commented May 5, 2022

Uh oh!

robertbastian commented May 5, 2022

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jedel1043 commented May 15, 2022

Uh oh!

sffc commented May 20, 2022

Uh oh!

Manishearth commented May 20, 2022

Uh oh!

Manishearth commented May 20, 2022

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

jira-pull-request-webhook bot commented May 23, 2022

Uh oh!

robertbastian commented May 23, 2022

Uh oh!

hsivonen left a comment

Choose a reason for hiding this comment

Uh oh!

hsivonen May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertbastian May 24, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

sffc May 26, 2022

Choose a reason for hiding this comment

robertbastian commented May 4, 2022 •

edited by sffc

Loading

jedel1043 commented May 5, 2022 •

edited

Loading

jedel1043 commented May 5, 2022 •

edited

Loading

jedel1043 commented May 5, 2022 •

edited

Loading

hsivonen May 24, 2022 •

edited

Loading