-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Split Vec::dedup_by
into 2 cycles
#92104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split Vec::dedup_by
into 2 cycles
#92104
Conversation
r? @dtolnay (rust-highfive has picked a reviewer for you, use r? to override) |
I didn't understand how to run benchmarks in rustc codegen suite so I write my own and tested on my machine (Ryzen 2700X with TurboBoost disabled). My codeuse criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::prelude::*;
const NUM_ITEMS: usize = 20000;
fn generate_unique_nums() -> Vec<usize> {
(0..NUM_ITEMS).collect()
}
fn generate_unique_strings() -> Vec<String> {
(0..NUM_ITEMS).map(|x| x.to_string()).collect()
}
fn generate_duplicates<T: Clone>(v: Vec<T>) -> Vec<T> {
let mut rng = rand_chacha::ChaChaRng::seed_from_u64(546);
v.into_iter()
.flat_map(|s| {
if rng.gen_bool(0.7) {
vec![s]
} else {
vec![s.clone(), s]
}
})
.take(NUM_ITEMS)
.collect()
}
pub fn criterion_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("Dedup");
group.bench_function("Unique num", |b| {
b.iter_batched(
|| black_box(generate_unique_nums()),
|mut v| {
v.dedup();
v
},
criterion::BatchSize::LargeInput,
)
});
group.bench_function("Duplicate num", |b| {
b.iter_batched(
|| {
let uniq = generate_unique_nums();
let dup = generate_duplicates(uniq);
black_box(dup)
},
|mut v| {
v.dedup();
v
},
criterion::BatchSize::LargeInput,
)
});
group.bench_function("Unique string", |b| {
b.iter_batched(
|| black_box(generate_unique_strings()),
|mut v| {
v.dedup();
v
},
criterion::BatchSize::LargeInput,
)
});
group.bench_function("With duplicate string", |b| {
b.iter_batched(
|| {
let uniq = generate_unique_strings();
let dup = generate_duplicates(uniq);
black_box(dup)
},
|mut v| {
v.dedup();
v
},
criterion::BatchSize::LargeInput,
)
});
group.bench_function("Duplicate ZSTs", |b| {
b.iter_batched(
|| black_box(vec![(); NUM_ITEMS]),
|mut v| {
v.dedup();
v
},
criterion::BatchSize::LargeInput,
)
});
group.finish();
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches); My results
Also, this change increased generated instructions count but I think that performance gains justify this. Below an example of code generated by this function: pub fn dedup(v: &mut Vec<u32>){
v.dedup()
} I used command With old version .text
.def @feat.00;
.scl 3;
.type 0;
.endef
.globl @feat.00
.set @feat.00, 0
.file "dedup.f421260b-cgu.0"
.def _ZN5dedup5dedup17h0bbecfdc6851a7d9E;
.scl 2;
.type 32;
.endef
.section .text,"xr",one_only,_ZN5dedup5dedup17h0bbecfdc6851a7d9E
.globl _ZN5dedup5dedup17h0bbecfdc6851a7d9E
.p2align 4, 0x90
_ZN5dedup5dedup17h0bbecfdc6851a7d9E:
movq 16(%rcx), %rax
cmpq $2, %rax
jb .LBB0_7
movq (%rcx), %r9
leaq -1(%rax), %r8
cmpq $2, %rax
jne .LBB0_8
movl $1, %eax
movl $1, %r11d
.LBB0_3:
testb $1, %r8b
je .LBB0_6
movl (%r9,%rax,4), %eax
cmpl -4(%r9,%r11,4), %eax
je .LBB0_6
movl %eax, (%r9,%r11,4)
addq $1, %r11
.LBB0_6:
movq %r11, 16(%rcx)
.LBB0_7:
retq
.LBB0_8:
movq %r8, %r10
andq $-2, %r10
negq %r10
movl $1, %eax
movl $1, %r11d
jmp .LBB0_9
.p2align 4, 0x90
.LBB0_12:
leaq (%r10,%rax), %rdx
addq $2, %rdx
addq $2, %rax
cmpq $1, %rdx
je .LBB0_3
.LBB0_9:
movl (%r9,%rax,4), %edx
cmpl -4(%r9,%r11,4), %edx
je .LBB0_10
movl %edx, (%r9,%r11,4)
addq $1, %r11
.LBB0_10:
movl 4(%r9,%rax,4), %edx
cmpl -4(%r9,%r11,4), %edx
je .LBB0_12
movl %edx, (%r9,%r11,4)
addq $1, %r11
jmp .LBB0_12 With new version .text
.def @feat.00;
.scl 3;
.type 0;
.endef
.globl @feat.00
.set @feat.00, 0
.file "dedup.f421260b-cgu.0"
.def _ZN5dedup5dedup17h0bbecfdc6851a7d9E;
.scl 2;
.type 32;
.endef
.section .text,"xr",one_only,_ZN5dedup5dedup17h0bbecfdc6851a7d9E
.globl _ZN5dedup5dedup17h0bbecfdc6851a7d9E
.p2align 4, 0x90
_ZN5dedup5dedup17h0bbecfdc6851a7d9E:
movq 16(%rcx), %r8
cmpq $2, %r8 ; Check length of input
jb .LBB0_14
movq (%rcx), %r10
movl (%r10), %eax
leaq -1(%r8), %r9
xorl %edx, %edx
.p2align 4, 0x90
.LBB0_2: ; First loop header
movl %eax, %r11d
movl 4(%r10,%rdx,4), %eax ; Read item from vec to register
cmpl %eax, %r11d
je .LBB0_3 ; Jump to code which handle item removes
addq $1, %rdx
cmpq %rdx, %r9
jne .LBB0_2 ; If we finished loop and didn't find duplicate, return
.LBB0_14:
retq
.LBB0_3:
leaq 2(%rdx), %r11
leaq 1(%rdx), %r9
cmpq %r11, %r8
jbe .LBB0_13
movl %r8d, %eax
subl %edx, %eax
addl $-2, %eax
testb $1, %al
je .LBB0_8
movl 8(%r10,%rdx,4), %eax
cmpl (%r10,%rdx,4), %eax
je .LBB0_7
movl %eax, 4(%r10,%rdx,4)
leaq 2(%rdx), %r9
.LBB0_7:
leaq 3(%rdx), %r11
.LBB0_8:
leaq -3(%r8), %rax
cmpq %rdx, %rax
jne .LBB0_9
.LBB0_13:
movq %r9, 16(%rcx)
retq
.p2align 4, 0x90
.LBB0_12:
addq $2, %r11
cmpq %r11, %r8
je .LBB0_13
.LBB0_9:
movl (%r10,%r11,4), %edx
cmpl -4(%r10,%r9,4), %edx
jne .LBB0_16
movl 4(%r10,%r11,4), %edx
cmpl -4(%r10,%r9,4), %edx
je .LBB0_12
jmp .LBB0_11
.p2align 4, 0x90
.LBB0_16:
movl %edx, (%r10,%r9,4)
addq $1, %r9
movl 4(%r10,%r11,4), %edx
cmpl -4(%r10,%r9,4), %edx
je .LBB0_12
.LBB0_11:
movl %edx, (%r10,%r9,4)
addq $1, %r9
jmp .LBB0_12 It can be seen from ASM above that there are no writes to memory in first loop. |
I also added some benchmarks with specific cases into Vec suite but I failed to execute them. |
This comment has been minimized.
This comment has been minimized.
For walltime benchmarks of library functions there are the benches directories. E.g. |
@the8472 Thank you for your help. I don't know what case is better to optimize, actually. P.S. There is also criterion benchmark results which show clear win for case when there is nothing to remove. |
Presumably the benchmark tests you added are designed to show the benefit of the changes? Could you run those new benchmarks with and without the code changes to demonstrate the benefit of the changes? (It would also be useful to comment somewhere in the benches to explain what the aspect of dedup the benchmarks are testing and what none/all/old/new in the names means). |
They are for more specific cases than old benches. Also, better usage of blackbox
Rewritten benchmarks and split my commit into 2 parks. Here the results. Affected benchmarks is Old code benchmark
New code benchmark
It seems I managed to significantly improve |
JFYI: I wouldn't be able to participate until 15th January. I would return to this PR after that if needed. |
@dtolnay May you review my PR please? :) |
library/alloc/src/vec/mod.rs
Outdated
let current = start.add(possible_remove_idx); | ||
same_bucket(&mut *current, &mut *prev) | ||
}; | ||
if need_drop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like need_drop is somewhat confusing here -- can we call this has_duplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to found_duplicate
.
This looks pretty good to me. That said, I think if we want to optimize further it'd be good to enhance the benchmark suite with a non-trivial Drop impl (e.g., a I did find it a little odd that the current impl (and the new impl) both end up copying every single element in the gaps between duplicate pairs (e.g., with It seems like the optimization this PR suggests would be good to move into the "slow" loop, so instead of just optimizing the "no duplicates" case, we never have this sort of 1-element shuffle going on. That does likely hurt the case where there's a lot of small gaps between duplicate elements, but that case is presumably somewhat rare. I think it should be possible to rewrite the core loop here to essentially have: search for a range such that This basically turns the current (or after this PR) single-element shuffles into drop + copy's of ranges, which seems likely to optimize better than the current strategy, though obviously would need benchmarking. It seems likely to be a big win for the "rare duplicates" case (or mostly duplicates case) -- similar to the win seen in this PR with the "no duplicates" case. It might also let us avoid the whole fill gap on drop abstraction, since vec.drain already does that (right?). I'm happy to r+ this if you want to leave this further optimization to a future PR (seems like an issue might be in order; the design sketch means it's probably relatively 'medium hard' for someone to pick up) -- but I think it would simplify the code pretty nicely, and leave the unsafe bits mostly just to roughly a pick2_mut. |
First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers.
You proposed optimization requires more benchmarking to do and some effort that I can't to do immediately so I would leave it to future PR. I added link to your comment in linked issue to avoid losing it. P.S. We, probably, won't be able to use |
Ping from triage: FYI: when a PR is ready for review, post a message containing |
@AngelicosPhosphoros @rustbot label: +S-inactive |
I think this was waiting on review and not labeled correctly. |
black_box(vec.first()); | ||
// Unlike other benches of `dedup` | ||
// this doesn't reinitialize vec | ||
// because we measure how effecient dedup is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// because we measure how effecient dedup is | |
// because we measure how efficient dedup is |
// SAFETY: possible_remove_idx always in range [1..len) | ||
let prev = start.add(possible_remove_idx - 1); | ||
let current = start.add(possible_remove_idx); | ||
same_bucket(&mut *current, &mut *prev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm following the old code right, it looks like this swapped the ordering -- we used to pass (0, 1), (1, 2), (2, 3), etc., whereas this now passes them as (1, 0), (2, 1), ... -- it seems like we can probably just swap current and prev here?
It would be great to add a test for this, too.
I thought for some reason that this was merged already. |
…rsion_77772_2, r=<try> Split `Vec::dedup_by` into 2 cycles First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772 P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it. Also, I cannot reopen that pull request so I am creating a new one. I responded to remaining questions directly by adding commentaries to my code.
…rsion_77772_2, r=the8472 Split `Vec::dedup_by` into 2 cycles First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772 P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it. Also, I cannot reopen that pull request so I am creating a new one. I responded to remaining questions directly by adding commentaries to my code.
First cycle runs until we found 2 same elements, second runs after if there was any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.
This leads to significant performance gains if all
Vec
items are kept: -40% on my benchmark with unique integers.