You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rollup merge of rust-lang#70486 - Mark-Simulacrum:unicode-shrink, r=dtolnay
Shrink Unicode tables (even more)
This shrinks the Unicode tables further, building upon the wins in rust-lang#68232 (the previous counts differ due to an interim Unicode version update, see rust-lang#69929.
The new data structure is slower by around 3x, on the benchmark of looking up every Unicode scalar value in each data set sequentially in every data set included. Note that for ASCII, the exposed functions on `char` optimize with direct branches, so ASCII will retain the same performance regardless of internal optimizations (or the reverse). Also, note that the size reduction due to the skip list (from where the performance losses come) is around 40%, and, as a result, I believe the performance loss is acceptable, as the routines are still quite fast. Anywhere where this is hot, should probably be using a custom data structure anyway (e.g., a raw bitset) or something optimized for frequently seen values, etc.
This PR updates both the bitset data structure, and introduces a new data structure similar to a skip list. For more details, see the [main.rs] of the table generator, which describes both. The commits mostly work individually and document size wins.
As before, this is tested on all valid chars to have the same results as nightly (and the canonical Unicode data sets), happily, no bugs were found.
[main.rs]: https://github.com/rust-lang/rust/blob/fb4a715e18b/src/tools/unicode-table-generator/src/main.rs
Set | Previous | New | % of old | Codepoints | Ranges |
----------------|---------:|------:|-----------:|-----------:|-------:|
Alphabetic | 3055 | 1599 | 52% | 132875 | 695 |
Case Ignorable | 2136 | 949 | 44% | 2413 | 410 |
Cased | 934 | 359 | 38% | 4286 | 141 |
Cc | 43 | 9 | 20% | 65 | 2 |
Grapheme Extend | 1774 | 813 | 46% | 1979 | 344 |
Lowercase | 985 | 867 | 88% | 2344 | 652 |
N | 1266 | 419 | 33% | 1781 | 133 |
Uppercase | 934 | 777 | 83% | 1911 | 643 |
White_Space | 140 | 37 | 26% | 25 | 10 |
----------------|----------|-------|------------|------------|--------|
Total | 11267 | 5829 | 51% | - | - |
0 commit comments