Skip to content

Split Vec::dedup_by into 2 cycles #92104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Split Vec::dedup_by into 2 cycles #92104

wants to merge 2 commits into from

Conversation

AngelicosPhosphoros
Copy link
Contributor

@AngelicosPhosphoros AngelicosPhosphoros commented Dec 19, 2021

First cycle runs until we found 2 same elements, second runs after if there was any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.

This leads to significant performance gains if all Vec items are kept: -40% on my benchmark with unique integers.

@rust-highfive
Copy link
Contributor

r? @dtolnay

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 19, 2021
@AngelicosPhosphoros
Copy link
Contributor Author

I didn't understand how to run benchmarks in rustc codegen suite so I write my own and tested on my machine (Ryzen 2700X with TurboBoost disabled).

My code
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::prelude::*;

const NUM_ITEMS: usize = 20000;

fn generate_unique_nums() -> Vec<usize> {
    (0..NUM_ITEMS).collect()
}

fn generate_unique_strings() -> Vec<String> {
    (0..NUM_ITEMS).map(|x| x.to_string()).collect()
}

fn generate_duplicates<T: Clone>(v: Vec<T>) -> Vec<T> {
    let mut rng = rand_chacha::ChaChaRng::seed_from_u64(546);
    v.into_iter()
        .flat_map(|s| {
            if rng.gen_bool(0.7) {
                vec![s]
            } else {
                vec![s.clone(), s]
            }
        })
        .take(NUM_ITEMS)
        .collect()
}

pub fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("Dedup");

    group.bench_function("Unique num", |b| {
        b.iter_batched(
            || black_box(generate_unique_nums()),
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.bench_function("Duplicate num", |b| {
        b.iter_batched(
            || {
                let uniq = generate_unique_nums();
                let dup = generate_duplicates(uniq);
                black_box(dup)
            },
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.bench_function("Unique string", |b| {
        b.iter_batched(
            || black_box(generate_unique_strings()),
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.bench_function("With duplicate string", |b| {
        b.iter_batched(
            || {
                let uniq = generate_unique_strings();
                let dup = generate_duplicates(uniq);
                black_box(dup)
            },
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.bench_function("Duplicate ZSTs", |b| {
        b.iter_batched(
            || black_box(vec![(); NUM_ITEMS]),
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.finish();
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
My results
Dedup/Unique num        time:   [5.8258 us 5.8452 us 5.8690 us]
                        change: [-41.080% -40.406% -39.715%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking Dedup/Duplicate num: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.9s, enable flat sampling, or reduce sample count to 60.
Dedup/Duplicate num     time:   [46.327 us 46.383 us 46.439 us]
                        change: [-4.1701% -3.5954% -2.9625%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
Benchmarking Dedup/Unique string: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.6s, enable flat sampling, or reduce sample count to 50.
Dedup/Unique string     time:   [114.63 us 115.15 us 115.74 us]
                        change: [+0.9217% +1.2801% +1.6846%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
Dedup/With duplicate string
                        time:   [275.14 us 275.25 us 275.38 us]
                        change: [+0.1959% +0.2757% +0.3485%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Dedup/Duplicate ZSTs    time:   [1.0923 ns 1.0948 ns 1.0980 ns]
                        change: [-3.3687% +0.3632% +4.2486%] (p = 0.86 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Also, this change increased generated instructions count but I think that performance gains justify this.

Below an example of code generated by this function:

pub fn dedup(v: &mut Vec<u32>){
    v.dedup()
}

I used command rustc +with_my_fixes .\dedup.rs -Copt-level=3 --emit=asm --crate-type=rlib.

With old version
    .text
    .def     @feat.00;
    .scl    3;
    .type   0;
    .endef
    .globl  @feat.00
.set @feat.00, 0
    .file   "dedup.f421260b-cgu.0"
    .def     _ZN5dedup5dedup17h0bbecfdc6851a7d9E;
    .scl    2;
    .type   32;
    .endef
    .section    .text,"xr",one_only,_ZN5dedup5dedup17h0bbecfdc6851a7d9E
    .globl  _ZN5dedup5dedup17h0bbecfdc6851a7d9E
    .p2align    4, 0x90
_ZN5dedup5dedup17h0bbecfdc6851a7d9E:
    movq    16(%rcx), %rax
    cmpq    $2, %rax
    jb  .LBB0_7
    movq    (%rcx), %r9
    leaq    -1(%rax), %r8
    cmpq    $2, %rax
    jne .LBB0_8
    movl    $1, %eax
    movl    $1, %r11d
.LBB0_3:
    testb   $1, %r8b
    je  .LBB0_6
    movl    (%r9,%rax,4), %eax
    cmpl    -4(%r9,%r11,4), %eax
    je  .LBB0_6
    movl    %eax, (%r9,%r11,4)
    addq    $1, %r11
.LBB0_6:
    movq    %r11, 16(%rcx)
.LBB0_7:
    retq
.LBB0_8:
    movq    %r8, %r10
    andq    $-2, %r10
    negq    %r10
    movl    $1, %eax
    movl    $1, %r11d
    jmp .LBB0_9
    .p2align    4, 0x90
.LBB0_12:
    leaq    (%r10,%rax), %rdx
    addq    $2, %rdx
    addq    $2, %rax
    cmpq    $1, %rdx
    je  .LBB0_3
.LBB0_9:
    movl    (%r9,%rax,4), %edx
    cmpl    -4(%r9,%r11,4), %edx
    je  .LBB0_10
    movl    %edx, (%r9,%r11,4)
    addq    $1, %r11
.LBB0_10:
    movl    4(%r9,%rax,4), %edx
    cmpl    -4(%r9,%r11,4), %edx
    je  .LBB0_12
    movl    %edx, (%r9,%r11,4)
    addq    $1, %r11
    jmp .LBB0_12
With new version
    .text
    .def     @feat.00;
    .scl    3;
    .type   0;
    .endef
    .globl  @feat.00
.set @feat.00, 0
    .file   "dedup.f421260b-cgu.0"
    .def     _ZN5dedup5dedup17h0bbecfdc6851a7d9E;
    .scl    2;
    .type   32;
    .endef
    .section    .text,"xr",one_only,_ZN5dedup5dedup17h0bbecfdc6851a7d9E
    .globl  _ZN5dedup5dedup17h0bbecfdc6851a7d9E
    .p2align    4, 0x90
_ZN5dedup5dedup17h0bbecfdc6851a7d9E:
    movq    16(%rcx), %r8
    cmpq    $2, %r8                ; Check length of input
    jb  .LBB0_14
    movq    (%rcx), %r10
    movl    (%r10), %eax
    leaq    -1(%r8), %r9
    xorl    %edx, %edx
    .p2align    4, 0x90
.LBB0_2:                           ; First loop header
    movl    %eax, %r11d
    movl    4(%r10,%rdx,4), %eax   ; Read item from vec to register
    cmpl    %eax, %r11d
    je  .LBB0_3                    ; Jump to code which handle item removes
    addq    $1, %rdx
    cmpq    %rdx, %r9
    jne .LBB0_2                    ; If we finished loop and didn't find duplicate, return
.LBB0_14:
    retq
.LBB0_3:
    leaq    2(%rdx), %r11
    leaq    1(%rdx), %r9
    cmpq    %r11, %r8
    jbe .LBB0_13
    movl    %r8d, %eax
    subl    %edx, %eax
    addl    $-2, %eax
    testb   $1, %al
    je  .LBB0_8
    movl    8(%r10,%rdx,4), %eax
    cmpl    (%r10,%rdx,4), %eax
    je  .LBB0_7
    movl    %eax, 4(%r10,%rdx,4)
    leaq    2(%rdx), %r9
.LBB0_7:
    leaq    3(%rdx), %r11
.LBB0_8:
    leaq    -3(%r8), %rax
    cmpq    %rdx, %rax
    jne .LBB0_9
.LBB0_13:
    movq    %r9, 16(%rcx)
    retq
    .p2align    4, 0x90
.LBB0_12:
    addq    $2, %r11
    cmpq    %r11, %r8
    je  .LBB0_13
.LBB0_9:
    movl    (%r10,%r11,4), %edx
    cmpl    -4(%r10,%r9,4), %edx
    jne .LBB0_16
    movl    4(%r10,%r11,4), %edx
    cmpl    -4(%r10,%r9,4), %edx
    je  .LBB0_12
    jmp .LBB0_11
    .p2align    4, 0x90
.LBB0_16:
    movl    %edx, (%r10,%r9,4)
    addq    $1, %r9
    movl    4(%r10,%r11,4), %edx
    cmpl    -4(%r10,%r9,4), %edx
    je  .LBB0_12
.LBB0_11:
    movl    %edx, (%r10,%r9,4)
    addq    $1, %r9
    jmp .LBB0_12

It can be seen from ASM above that there are no writes to memory in first loop.

@AngelicosPhosphoros
Copy link
Contributor Author

I also added some benchmarks with specific cases into Vec suite but I failed to execute them.

@rust-log-analyzer

This comment has been minimized.

@the8472
Copy link
Member

the8472 commented Dec 19, 2021

I didn't understand how to run benchmarks in rustc codegen suite

For walltime benchmarks of library functions there are the benches directories. E.g. library/alloc/benches. You can run them with ./x.py bench library/alloc --stage 0 --test-args <bench name>

@AngelicosPhosphoros
Copy link
Contributor Author

AngelicosPhosphoros commented Dec 19, 2021

OLD

test vec::bench_dedup_new_100 ... bench: 59 ns/iter (+/- 1) = 6779 MB/s
test vec::bench_dedup_new_1000 ... bench: 781 ns/iter (+/- 30) = 5121 MB/s
test vec::bench_dedup_new_10000 ... bench: 9,974 ns/iter (+/- 38) = 4010 MB/s
test vec::bench_dedup_new_100000 ... bench: 367,460 ns/iter (+/- 1,545) = 1088 MB/s

test vec::bench_dedup_old_100 ... bench: 97 ns/iter (+/- 8) = 4123 MB/s
test vec::bench_dedup_old_1000 ... bench: 945 ns/iter (+/- 10) = 4232 MB/s
test vec::bench_dedup_old_10000 ... bench: 14,292 ns/iter (+/- 145) = 2798 MB/s
test vec::bench_dedup_old_100000 ... bench: 387,502 ns/iter (+/- 17,975) = 1032 MB/s

NEW

test vec::bench_dedup_all_100 ... bench: 47 ns/iter (+/- 2) = 8510 MB/s
test vec::bench_dedup_all_1000 ... bench: 355 ns/iter (+/- 5) = 11267 MB/s
test vec::bench_dedup_all_10000 ... bench: 3,713 ns/iter (+/- 20) = 10772 MB/s
test vec::bench_dedup_all_100000 ... bench: 36,585 ns/iter (+/- 225) = 10933 MB/s
test vec::bench_dedup_new_100 ... bench: 55 ns/iter (+/- 1) = 7272 MB/s
test vec::bench_dedup_new_1000 ... bench: 761 ns/iter (+/- 12) = 5256 MB/s
test vec::bench_dedup_new_10000 ... bench: 10,900 ns/iter (+/- 948) = 3669 MB/s
test vec::bench_dedup_new_100000 ... bench: 374,950 ns/iter (+/- 2,516) = 1066 MB/s

test vec::bench_dedup_none_100 ... bench: 55 ns/iter (+/- 0) = 7272 MB/s
test vec::bench_dedup_none_1000 ... bench: 361 ns/iter (+/- 2) = 11080 MB/s
test vec::bench_dedup_none_10000 ... bench: 3,811 ns/iter (+/- 23) = 10495 MB/s
test vec::bench_dedup_none_100000 ... bench: 37,387 ns/iter (+/- 496) = 10698 MB/s
test vec::bench_dedup_old_100 ... bench: 91 ns/iter (+/- 3) = 4395 MB/s
test vec::bench_dedup_old_1000 ... bench: 919 ns/iter (+/- 31) = 4352 MB/s
test vec::bench_dedup_old_10000 ... bench: 14,568 ns/iter (+/- 178) = 2745 MB/s
test vec::bench_dedup_old_100000 ... bench: 386,705 ns/iter (+/- 5,915) = 1034 MB/s

@the8472 Thank you for your help.
It seems that my optimization is working but this benches doesn't catch it because all of them checks the situation when there are some items to remove.
For smaller sets new version of code works faster because in that sets more percent of data belongs to first loop (first X unique items), while performance of the second loop should remain nearly same as it was. On larger sets, almost all time is spent in the second loops and bigger code size makes them slower.

I don't know what case is better to optimize, actually.

P.S. There is also criterion benchmark results which show clear win for case when there is nothing to remove.

@nrc
Copy link
Member

nrc commented Dec 20, 2021

It seems that my optimization is working but this benches doesn't catch it because all of them checks the situation when there are some items to remove.

Presumably the benchmark tests you added are designed to show the benefit of the changes? Could you run those new benchmarks with and without the code changes to demonstrate the benefit of the changes?

(It would also be useful to comment somewhere in the benches to explain what the aspect of dedup the benchmarks are testing and what none/all/old/new in the names means).

@nrc nrc added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Dec 20, 2021
They are for more specific cases than old benches.
Also, better usage of blackbox
@AngelicosPhosphoros
Copy link
Contributor Author

AngelicosPhosphoros commented Dec 20, 2021

@the8472 @nrc

Rewritten benchmarks and split my commit into 2 parks.
My optimization mostly implemented to improve vec::bench_dedup_none.

Here the results. Affected benchmarks is vec::bench_dedup_all, vec::bench_dedup_none, vec::bench_dedup_random. vec::bench_dedup_slice_truncate shown to show detect possible measuring errors.

Old code benchmark

test vec::bench_dedup_all_100                ... bench:          57 ns/iter (+/- 2) = 7017 MB/s
test vec::bench_dedup_all_1000               ... bench:         394 ns/iter (+/- 5) = 10152 MB/s
test vec::bench_dedup_all_10000              ... bench:       4,019 ns/iter (+/- 6) = 9952 MB/s
test vec::bench_dedup_all_100000             ... bench:      39,567 ns/iter (+/- 183) = 10109 MB/s
test vec::bench_dedup_none_100               ... bench:          56 ns/iter (+/- 0) = 7142 MB/s
test vec::bench_dedup_none_1000              ... bench:         486 ns/iter (+/- 2) = 8230 MB/s
test vec::bench_dedup_none_10000             ... bench:       4,824 ns/iter (+/- 12) = 8291 MB/s
test vec::bench_dedup_none_100000            ... bench:      48,140 ns/iter (+/- 95) = 8309 MB/s
test vec::bench_dedup_random_100             ... bench:          64 ns/iter (+/- 3) = 6250 MB/s
test vec::bench_dedup_random_1000            ... bench:         779 ns/iter (+/- 11) = 5134 MB/s
test vec::bench_dedup_random_10000           ... bench:       9,968 ns/iter (+/- 43) = 4012 MB/s
test vec::bench_dedup_random_100000          ... bench:     365,905 ns/iter (+/- 1,113) = 1093 MB/s
test vec::bench_dedup_slice_truncate_100     ... bench:          91 ns/iter (+/- 4) = 4395 MB/s
test vec::bench_dedup_slice_truncate_1000    ... bench:         766 ns/iter (+/- 39) = 5221 MB/s
test vec::bench_dedup_slice_truncate_10000   ... bench:      14,793 ns/iter (+/- 109) = 2703 MB/s
test vec::bench_dedup_slice_truncate_100000  ... bench:     402,150 ns/iter (+/- 10,161) = 994 MB/s

New code benchmark

test vec::bench_dedup_all_100                ... bench:          51 ns/iter (+/- 1) = 7843 MB/s
test vec::bench_dedup_all_1000               ... bench:         359 ns/iter (+/- 129) = 11142 MB/s
test vec::bench_dedup_all_10000              ... bench:       3,562 ns/iter (+/- 8) = 11229 MB/s
test vec::bench_dedup_all_100000             ... bench:      36,504 ns/iter (+/- 83) = 10957 MB/s
test vec::bench_dedup_none_100               ... bench:          44 ns/iter (+/- 0) = 9090 MB/s
test vec::bench_dedup_none_1000              ... bench:         288 ns/iter (+/- 1) = 13888 MB/s
test vec::bench_dedup_none_10000             ... bench:       2,752 ns/iter (+/- 28) = 14534 MB/s
test vec::bench_dedup_none_100000            ... bench:      29,269 ns/iter (+/- 1,082) = 13666 MB/s
test vec::bench_dedup_random_100             ... bench:          58 ns/iter (+/- 4) = 6896 MB/s
test vec::bench_dedup_random_1000            ... bench:         775 ns/iter (+/- 17) = 5161 MB/s
test vec::bench_dedup_random_10000           ... bench:      10,686 ns/iter (+/- 979) = 3743 MB/s
test vec::bench_dedup_random_100000          ... bench:     374,580 ns/iter (+/- 1,192) = 1067 MB/s
test vec::bench_dedup_slice_truncate_100     ... bench:          93 ns/iter (+/- 0) = 4301 MB/s
test vec::bench_dedup_slice_truncate_1000    ... bench:         970 ns/iter (+/- 5) = 4123 MB/s
test vec::bench_dedup_slice_truncate_10000   ... bench:      15,095 ns/iter (+/- 57) = 2649 MB/s
test vec::bench_dedup_slice_truncate_100000  ... bench:     389,795 ns/iter (+/- 293) = 1026 MB/s

It seems I managed to significantly improve vec::bench_dedup_none_100000 without hurting others.

@AngelicosPhosphoros
Copy link
Contributor Author

JFYI: I wouldn't be able to participate until 15th January. I would return to this PR after that if needed.

@AngelicosPhosphoros
Copy link
Contributor Author

@dtolnay May you review my PR please? :)

let current = start.add(possible_remove_idx);
same_bucket(&mut *current, &mut *prev)
};
if need_drop {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like need_drop is somewhat confusing here -- can we call this has_duplicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to found_duplicate.

@Mark-Simulacrum
Copy link
Member

This looks pretty good to me. That said, I think if we want to optimize further it'd be good to enhance the benchmark suite with a non-trivial Drop impl (e.g., a Vec<String> or so).

I did find it a little odd that the current impl (and the new impl) both end up copying every single element in the gaps between duplicate pairs (e.g., with 0 1 1 2 3 4 5 5, the middle piece will get 1-element shuffled back instead of a larger memcpy to move the whole section).

It seems like the optimization this PR suggests would be good to move into the "slow" loop, so instead of just optimizing the "no duplicates" case, we never have this sort of 1-element shuffle going on. That does likely hurt the case where there's a lot of small gaps between duplicate elements, but that case is presumably somewhat rare. I think it should be possible to rewrite the core loop here to essentially have: search for a range such that same_bucket(idx1 - 1, idx1) and !same_bucket(idx2 - 1, idx2), where the last defaults to the length of the vector if not found, and then drain(idx1..idx2), repeating until the "defaults to the length of the vector" condition is reached.

This basically turns the current (or after this PR) single-element shuffles into drop + copy's of ranges, which seems likely to optimize better than the current strategy, though obviously would need benchmarking. It seems likely to be a big win for the "rare duplicates" case (or mostly duplicates case) -- similar to the win seen in this PR with the "no duplicates" case. It might also let us avoid the whole fill gap on drop abstraction, since vec.drain already does that (right?).

I'm happy to r+ this if you want to leave this further optimization to a future PR (seems like an issue might be in order; the design sketch means it's probably relatively 'medium hard' for someone to pick up) -- but I think it would simplify the code pretty nicely, and leave the unsafe bits mostly just to roughly a pick2_mut.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 17, 2022
First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.

This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers.
@AngelicosPhosphoros
Copy link
Contributor Author

AngelicosPhosphoros commented Jan 17, 2022

if you want to leave this further optimization to a future PR

You proposed optimization requires more benchmarking to do and some effort that I can't to do immediately so I would leave it to future PR.

I added link to your comment in linked issue to avoid losing it.

P.S. We, probably, won't be able to use drain directly because then dedup would move tail items multiple times (e.g. in case of 1 1 1 5 6 7 7 7 8 9 10 we would move items 8 9 10 twice) but we can try to make 2 loops (removal loop and preserver loop) inside outer loop which would probably be more friendly for branch predictor for series of removed items and preserved items.

@JohnCSimon JohnCSimon added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 6, 2022
@JohnCSimon
Copy link
Member

Ping from triage:
@AngelicosPhosphoros Can you post the status of this PR?

FYI: when a PR is ready for review, post a message containing
@rustbot ready to switch to S-waiting-on-review so the PR appears in the reviewer's backlog.

@JohnCSimon
Copy link
Member

@AngelicosPhosphoros
Ping from triage: I'm closing this due to inactivity, Please reopen when you are ready to continue with this.
Note: if you do please open the PR BEFORE you force push to it, else you won't be able to reopen.
Thanks for your contribution.

@rustbot label: +S-inactive

@JohnCSimon JohnCSimon closed this Apr 23, 2022
@rustbot rustbot added the S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. label Apr 23, 2022
@the8472
Copy link
Member

the8472 commented Apr 23, 2022

AngelicosPhosphoros requested a review from Mark-Simulacrum 3 months ago

I think this was waiting on review and not labeled correctly.

@the8472 the8472 reopened this Apr 23, 2022
@the8472 the8472 added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Apr 23, 2022
@JohnCSimon JohnCSimon removed the S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. label Apr 23, 2022
black_box(vec.first());
// Unlike other benches of `dedup`
// this doesn't reinitialize vec
// because we measure how effecient dedup is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// because we measure how effecient dedup is
// because we measure how efficient dedup is

// SAFETY: possible_remove_idx always in range [1..len)
let prev = start.add(possible_remove_idx - 1);
let current = start.add(possible_remove_idx);
same_bucket(&mut *current, &mut *prev)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm following the old code right, it looks like this swapped the ordering -- we used to pass (0, 1), (1, 2), (2, 3), etc., whereas this now passes them as (1, 0), (2, 1), ... -- it seems like we can probably just swap current and prev here?

It would be great to add a test for this, too.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 24, 2022
@AngelicosPhosphoros
Copy link
Contributor Author

I thought for some reason that this was merged already.
I would reopen in few days.

bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 25, 2023
…rsion_77772_2, r=<try>

Split `Vec::dedup_by` into 2 cycles

First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.

This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers.

Results of benchmarks before implementation (including new benchmark where nothing needs to be removed):
 *   vec::bench_dedup_all_100                 74.00ns/iter  +/- 13.00ns
 *   vec::bench_dedup_all_1000               572.00ns/iter +/- 272.00ns
 *   vec::bench_dedup_all_100000              64.42µs/iter  +/- 19.47µs
 *   __vec::bench_dedup_none_100                67.00ns/iter  +/- 17.00ns__
 *   __vec::bench_dedup_none_1000              662.00ns/iter  +/- 86.00ns__
 *   __vec::bench_dedup_none_10000               9.16µs/iter   +/- 2.71µs__
 *   __vec::bench_dedup_none_100000             91.25µs/iter   +/- 1.82µs__
 *   vec::bench_dedup_random_100             105.00ns/iter  +/- 11.00ns
 *   vec::bench_dedup_random_1000            781.00ns/iter  +/- 10.00ns
 *   vec::bench_dedup_random_10000             9.00µs/iter   +/- 5.62µs
 *   vec::bench_dedup_random_100000          449.81µs/iter  +/- 74.99µs
 *   vec::bench_dedup_slice_truncate_100     105.00ns/iter  +/- 16.00ns
 *   vec::bench_dedup_slice_truncate_1000      2.65µs/iter +/- 481.00ns
 *   vec::bench_dedup_slice_truncate_10000    18.33µs/iter   +/- 5.23µs
 *   vec::bench_dedup_slice_truncate_100000  501.12µs/iter  +/- 46.97µs

Results after implementation:
 *   vec::bench_dedup_all_100                 75.00ns/iter   +/- 9.00ns
 *   vec::bench_dedup_all_1000               494.00ns/iter +/- 117.00ns
 *   vec::bench_dedup_all_100000              58.13µs/iter   +/- 8.78µs
 *   __vec::bench_dedup_none_100                52.00ns/iter  +/- 22.00ns__
 *   __vec::bench_dedup_none_1000              417.00ns/iter +/- 116.00ns__
 *   __vec::bench_dedup_none_10000               4.11µs/iter +/- 546.00ns__
 *   __vec::bench_dedup_none_100000             40.47µs/iter   +/- 5.36µs__
 *   vec::bench_dedup_random_100              77.00ns/iter  +/- 15.00ns
 *   vec::bench_dedup_random_1000            681.00ns/iter  +/- 86.00ns
 *   vec::bench_dedup_random_10000            11.66µs/iter   +/- 2.22µs
 *   vec::bench_dedup_random_100000          469.35µs/iter  +/- 20.53µs
 *   vec::bench_dedup_slice_truncate_100     100.00ns/iter   +/- 5.00ns
 *   vec::bench_dedup_slice_truncate_1000      2.55µs/iter +/- 224.00ns
 *   vec::bench_dedup_slice_truncate_10000    18.95µs/iter   +/- 2.59µs
 *   vec::bench_dedup_slice_truncate_100000  492.85µs/iter  +/- 72.84µs

Resolves rust-lang#77772

P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it.
Also, I cannot reopen that pull request so I am creating a new one.
I responded to remaining questions directly by adding commentaries to my code.
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 5, 2023
…rsion_77772_2, r=the8472

Split `Vec::dedup_by` into 2 cycles

First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.

This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers.

Results of benchmarks before implementation (including new benchmark where nothing needs to be removed):
 *   vec::bench_dedup_all_100                 74.00ns/iter  +/- 13.00ns
 *   vec::bench_dedup_all_1000               572.00ns/iter +/- 272.00ns
 *   vec::bench_dedup_all_100000              64.42µs/iter  +/- 19.47µs
 *   __vec::bench_dedup_none_100                67.00ns/iter  +/- 17.00ns__
 *   __vec::bench_dedup_none_1000              662.00ns/iter  +/- 86.00ns__
 *   __vec::bench_dedup_none_10000               9.16µs/iter   +/- 2.71µs__
 *   __vec::bench_dedup_none_100000             91.25µs/iter   +/- 1.82µs__
 *   vec::bench_dedup_random_100             105.00ns/iter  +/- 11.00ns
 *   vec::bench_dedup_random_1000            781.00ns/iter  +/- 10.00ns
 *   vec::bench_dedup_random_10000             9.00µs/iter   +/- 5.62µs
 *   vec::bench_dedup_random_100000          449.81µs/iter  +/- 74.99µs
 *   vec::bench_dedup_slice_truncate_100     105.00ns/iter  +/- 16.00ns
 *   vec::bench_dedup_slice_truncate_1000      2.65µs/iter +/- 481.00ns
 *   vec::bench_dedup_slice_truncate_10000    18.33µs/iter   +/- 5.23µs
 *   vec::bench_dedup_slice_truncate_100000  501.12µs/iter  +/- 46.97µs

Results after implementation:
 *   vec::bench_dedup_all_100                 75.00ns/iter   +/- 9.00ns
 *   vec::bench_dedup_all_1000               494.00ns/iter +/- 117.00ns
 *   vec::bench_dedup_all_100000              58.13µs/iter   +/- 8.78µs
 *   __vec::bench_dedup_none_100                52.00ns/iter  +/- 22.00ns__
 *   __vec::bench_dedup_none_1000              417.00ns/iter +/- 116.00ns__
 *   __vec::bench_dedup_none_10000               4.11µs/iter +/- 546.00ns__
 *   __vec::bench_dedup_none_100000             40.47µs/iter   +/- 5.36µs__
 *   vec::bench_dedup_random_100              77.00ns/iter  +/- 15.00ns
 *   vec::bench_dedup_random_1000            681.00ns/iter  +/- 86.00ns
 *   vec::bench_dedup_random_10000            11.66µs/iter   +/- 2.22µs
 *   vec::bench_dedup_random_100000          469.35µs/iter  +/- 20.53µs
 *   vec::bench_dedup_slice_truncate_100     100.00ns/iter   +/- 5.00ns
 *   vec::bench_dedup_slice_truncate_1000      2.55µs/iter +/- 224.00ns
 *   vec::bench_dedup_slice_truncate_10000    18.95µs/iter   +/- 2.59µs
 *   vec::bench_dedup_slice_truncate_100000  492.85µs/iter  +/- 72.84µs

Resolves rust-lang#77772

P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it.
Also, I cannot reopen that pull request so I am creating a new one.
I responded to remaining questions directly by adding commentaries to my code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants