-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Split Vec::dedup_by
into 2 cycles
#118273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split Vec::dedup_by
into 2 cycles
#118273
Conversation
They are for more specific cases than old benches. Also, better usage of blackbox
(rustbot has picked a reviewer for you, use r? to override) |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…rsion_77772_2, r=<try> Split `Vec::dedup_by` into 2 cycles First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772 P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it. Also, I cannot reopen that pull request so I am creating a new one. I responded to remaining questions directly by adding commentaries to my code.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (d340b8b): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 674.339s -> 676.429s (0.31%) |
@joshtriplett Hi, I just want to notify that there is an unreviewed PR. |
r? the8472 |
library/alloc/benches/vec.rs
Outdated
// Measures performance of slice dedup impl. | ||
// "Old" implementation of Vec::dedup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit misleading in the context of this PR. It's "old-old" now. Can we put a version on it "from Rust 1.xx" or something like that?
library/alloc/src/vec/mod.rs
Outdated
// Check if we ever want to remove anything. | ||
// This allows to use copy_non_overlapping in next cycle. | ||
// And avoids any memory writes if we don't need to remove anything. | ||
let mut possible_remove_idx = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is a bit weird. Maybe first_duplicate
?
First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772
f0c9274
to
964df01
Compare
@the8472 I made changes you requested. |
@bors r+ |
☀️ Test successful - checks-actions |
Finished benchmarking commit (e9013ac): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 675.495s -> 673.642s (-0.27%) |
First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.
This leads to significant performance gains if all
Vec
items are kept: -40% on my benchmark with unique integers.Results of benchmarks before implementation (including new benchmark where nothing needs to be removed):
Results after implementation:
Resolves #77772
P.S. Note that this is same PR as #92104 I just missed review then forgot about it.
Also, I cannot reopen that pull request so I am creating a new one.
I responded to remaining questions directly by adding commentaries to my code.