Split `Vec::dedup_by` into 2 cycles #118273

AngelicosPhosphoros · 2023-11-25T02:28:07Z

First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.

This leads to significant performance gains if all Vec items are kept: -40% on my benchmark with unique integers.

Results of benchmarks before implementation (including new benchmark where nothing needs to be removed):

vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns
vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns
vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs
vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns
vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns
vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs
vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs
vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns
vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns
vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs
vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs
vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns
vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns
vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs
vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs

Results after implementation:

vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns
vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns
vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs
vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns
vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns
vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns
vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs
vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns
vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns
vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs
vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs
vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns
vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns
vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs
vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs

Resolves #77772

P.S. Note that this is same PR as #92104 I just missed review then forgot about it.
Also, I cannot reopen that pull request so I am creating a new one.
I responded to remaining questions directly by adding commentaries to my code.

They are for more specific cases than old benches. Also, better usage of blackbox

rustbot · 2023-11-25T02:28:15Z

r? @joshtriplett

(rustbot has picked a reviewer for you, use r? to override)

the8472 · 2023-11-25T16:31:43Z

@bors try @rust-timer queue

bors · 2023-11-25T16:32:55Z

⌛ Trying commit f0c9274 with merge d340b8b...

…rsion_77772_2, r=<try> Split `Vec::dedup_by` into 2 cycles First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772 P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it. Also, I cannot reopen that pull request so I am creating a new one. I responded to remaining questions directly by adding commentaries to my code.

bors · 2023-11-25T18:01:00Z

☀️ Try build successful - checks-actions
Build commit: d340b8b (d340b8b09f67d765e5ba32ab9c2702994a63b8c3)

rust-timer · 2023-11-25T19:47:59Z

Finished benchmarking commit (d340b8b): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-1.1%	[-1.1%, -1.1%]	1
All ❌✅ (primary)	-	-	0

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.0%	[0.9%, 1.0%]	2
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.3%	[-1.3%, -1.3%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.2%	[-1.3%, 1.0%]	3

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 674.339s -> 676.429s (0.31%)
Artifact size: 313.28 MiB -> 313.27 MiB (-0.00%)

AngelicosPhosphoros · 2023-12-04T10:15:27Z

@joshtriplett Hi, I just want to notify that there is an unreviewed PR.

the8472 · 2023-12-04T10:27:17Z

r? the8472

the8472 · 2023-12-04T20:40:24Z

library/alloc/benches/vec.rs

+// Measures performance of slice dedup impl.
+// "Old" implementation of Vec::dedup


I think this is a bit misleading in the context of this PR. It's "old-old" now. Can we put a version on it "from Rust 1.xx" or something like that?

the8472 · 2023-12-04T20:45:20Z

library/alloc/src/vec/mod.rs

+        // Check if we ever want to remove anything.
+        // This allows to use copy_non_overlapping in next cycle.
+        // And avoids any memory writes if we don't need to remove anything.
+        let mut possible_remove_idx = 1;


The name is a bit weird. Maybe first_duplicate?

library/alloc/src/vec/mod.rs

First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772

AngelicosPhosphoros · 2023-12-05T20:02:48Z

@the8472 I made changes you requested.

the8472 · 2023-12-05T21:04:53Z

@bors r+

bors · 2023-12-05T21:04:55Z

📌 Commit 964df01 has been approved by the8472

It is now in the queue for this repository.

bors · 2023-12-05T21:40:05Z

⌛ Testing commit 964df01 with merge e9013ac...

bors · 2023-12-05T23:36:02Z

☀️ Test successful - checks-actions
Approved by: the8472
Pushing e9013ac to master...

rust-timer · 2023-12-06T01:24:47Z

Finished benchmarking commit (e9013ac): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.1%	[-0.1%, -0.1%]	3
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-3.8%	[-3.8%, -3.8%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-3.8%	[-3.8%, -3.8%]	1

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 675.495s -> 673.642s (-0.27%)
Artifact size: 314.16 MiB -> 314.17 MiB (0.00%)

Add more benchmarks of Vec::dedup

0c6901a

They are for more specific cases than old benches. Also, better usage of blackbox

rustbot assigned joshtriplett Nov 25, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 25, 2023

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 25, 2023

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 25, 2023

rustbot assigned the8472 and unassigned joshtriplett Dec 4, 2023

the8472 reviewed Dec 4, 2023

View reviewed changes

AngelicosPhosphoros force-pushed the dedup_2_loops_version_77772_2 branch from f0c9274 to 964df01 Compare December 5, 2023 20:01

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 5, 2023

bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 5, 2023

bors merged commit e9013ac into rust-lang:master Dec 5, 2023

rustbot added this to the 1.76.0 milestone Dec 5, 2023

AngelicosPhosphoros deleted the dedup_2_loops_version_77772_2 branch December 6, 2023 21:57

		// Measures performance of slice dedup impl.
		// "Old" implementation of Vec::dedup

Split Vec::dedup_by into 2 cycles #118273

Split Vec::dedup_by into 2 cycles #118273

Uh oh!

Conversation

AngelicosPhosphoros commented Nov 25, 2023

Uh oh!

rustbot commented Nov 25, 2023

Uh oh!

the8472 commented Nov 25, 2023

Uh oh!

This comment has been minimized.

bors commented Nov 25, 2023

Uh oh!

bors commented Nov 25, 2023

Uh oh!

This comment has been minimized.

rust-timer commented Nov 25, 2023

Overall result: no relevant changes - no action needed

Uh oh!

AngelicosPhosphoros commented Dec 4, 2023

Uh oh!

the8472 commented Dec 4, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AngelicosPhosphoros commented Dec 5, 2023

Uh oh!

the8472 commented Dec 5, 2023

Uh oh!

bors commented Dec 5, 2023

Uh oh!

bors commented Dec 5, 2023

Uh oh!

bors commented Dec 5, 2023

Uh oh!

rust-timer commented Dec 6, 2023

Overall result: ✅ improvements - no action needed

Uh oh!

Uh oh!

Split `Vec::dedup_by` into 2 cycles #118273

Split `Vec::dedup_by` into 2 cycles #118273