Split `Vec::dedup_by` into 2 cycles #92104

AngelicosPhosphoros · 2021-12-19T16:18:30Z

First cycle runs until we found 2 same elements, second runs after if there was any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.

This leads to significant performance gains if all Vec items are kept: -40% on my benchmark with unique integers.

rust-highfive · 2021-12-19T16:18:33Z

r? @dtolnay

(rust-highfive has picked a reviewer for you, use r? to override)

AngelicosPhosphoros · 2021-12-19T16:18:36Z

I didn't understand how to run benchmarks in rustc codegen suite so I write my own and tested on my machine (Ryzen 2700X with TurboBoost disabled).

My code

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::prelude::*;

const NUM_ITEMS: usize = 20000;

fn generate_unique_nums() -> Vec<usize> {
    (0..NUM_ITEMS).collect()
}

fn generate_unique_strings() -> Vec<String> {
    (0..NUM_ITEMS).map(|x| x.to_string()).collect()
}

fn generate_duplicates<T: Clone>(v: Vec<T>) -> Vec<T> {
    let mut rng = rand_chacha::ChaChaRng::seed_from_u64(546);
    v.into_iter()
        .flat_map(|s| {
            if rng.gen_bool(0.7) {
                vec![s]
            } else {
                vec![s.clone(), s]
            }
        })
        .take(NUM_ITEMS)
        .collect()
}

pub fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("Dedup");

    group.bench_function("Unique num", |b| {
        b.iter_batched(
            || black_box(generate_unique_nums()),
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.bench_function("Duplicate num", |b| {
        b.iter_batched(
            || {
                let uniq = generate_unique_nums();
                let dup = generate_duplicates(uniq);
                black_box(dup)
            },
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.bench_function("Unique string", |b| {
        b.iter_batched(
            || black_box(generate_unique_strings()),
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.bench_function("With duplicate string", |b| {
        b.iter_batched(
            || {
                let uniq = generate_unique_strings();
                let dup = generate_duplicates(uniq);
                black_box(dup)
            },
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.bench_function("Duplicate ZSTs", |b| {
        b.iter_batched(
            || black_box(vec![(); NUM_ITEMS]),
            |mut v| {
                v.dedup();
                v
            },
            criterion::BatchSize::LargeInput,
        )
    });

    group.finish();
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

My results

Dedup/Unique num        time:   [5.8258 us 5.8452 us 5.8690 us]
                        change: [-41.080% -40.406% -39.715%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking Dedup/Duplicate num: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.9s, enable flat sampling, or reduce sample count to 60.
Dedup/Duplicate num     time:   [46.327 us 46.383 us 46.439 us]
                        change: [-4.1701% -3.5954% -2.9625%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
Benchmarking Dedup/Unique string: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.6s, enable flat sampling, or reduce sample count to 50.
Dedup/Unique string     time:   [114.63 us 115.15 us 115.74 us]
                        change: [+0.9217% +1.2801% +1.6846%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
Dedup/With duplicate string
                        time:   [275.14 us 275.25 us 275.38 us]
                        change: [+0.1959% +0.2757% +0.3485%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Dedup/Duplicate ZSTs    time:   [1.0923 ns 1.0948 ns 1.0980 ns]
                        change: [-3.3687% +0.3632% +4.2486%] (p = 0.86 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Also, this change increased generated instructions count but I think that performance gains justify this.

Below an example of code generated by this function:

pub fn dedup(v: &mut Vec<u32>){
    v.dedup()
}

I used command rustc +with_my_fixes .\dedup.rs -Copt-level=3 --emit=asm --crate-type=rlib.

With old version

    .text
    .def     @feat.00;
    .scl    3;
    .type   0;
    .endef
    .globl  @feat.00
.set @feat.00, 0
    .file   "dedup.f421260b-cgu.0"
    .def     _ZN5dedup5dedup17h0bbecfdc6851a7d9E;
    .scl    2;
    .type   32;
    .endef
    .section    .text,"xr",one_only,_ZN5dedup5dedup17h0bbecfdc6851a7d9E
    .globl  _ZN5dedup5dedup17h0bbecfdc6851a7d9E
    .p2align    4, 0x90
_ZN5dedup5dedup17h0bbecfdc6851a7d9E:
    movq    16(%rcx), %rax
    cmpq    $2, %rax
    jb  .LBB0_7
    movq    (%rcx), %r9
    leaq    -1(%rax), %r8
    cmpq    $2, %rax
    jne .LBB0_8
    movl    $1, %eax
    movl    $1, %r11d
.LBB0_3:
    testb   $1, %r8b
    je  .LBB0_6
    movl    (%r9,%rax,4), %eax
    cmpl    -4(%r9,%r11,4), %eax
    je  .LBB0_6
    movl    %eax, (%r9,%r11,4)
    addq    $1, %r11
.LBB0_6:
    movq    %r11, 16(%rcx)
.LBB0_7:
    retq
.LBB0_8:
    movq    %r8, %r10
    andq    $-2, %r10
    negq    %r10
    movl    $1, %eax
    movl    $1, %r11d
    jmp .LBB0_9
    .p2align    4, 0x90
.LBB0_12:
    leaq    (%r10,%rax), %rdx
    addq    $2, %rdx
    addq    $2, %rax
    cmpq    $1, %rdx
    je  .LBB0_3
.LBB0_9:
    movl    (%r9,%rax,4), %edx
    cmpl    -4(%r9,%r11,4), %edx
    je  .LBB0_10
    movl    %edx, (%r9,%r11,4)
    addq    $1, %r11
.LBB0_10:
    movl    4(%r9,%rax,4), %edx
    cmpl    -4(%r9,%r11,4), %edx
    je  .LBB0_12
    movl    %edx, (%r9,%r11,4)
    addq    $1, %r11
    jmp .LBB0_12

With new version

    .text
    .def     @feat.00;
    .scl    3;
    .type   0;
    .endef
    .globl  @feat.00
.set @feat.00, 0
    .file   "dedup.f421260b-cgu.0"
    .def     _ZN5dedup5dedup17h0bbecfdc6851a7d9E;
    .scl    2;
    .type   32;
    .endef
    .section    .text,"xr",one_only,_ZN5dedup5dedup17h0bbecfdc6851a7d9E
    .globl  _ZN5dedup5dedup17h0bbecfdc6851a7d9E
    .p2align    4, 0x90
_ZN5dedup5dedup17h0bbecfdc6851a7d9E:
    movq    16(%rcx), %r8
    cmpq    $2, %r8                ; Check length of input
    jb  .LBB0_14
    movq    (%rcx), %r10
    movl    (%r10), %eax
    leaq    -1(%r8), %r9
    xorl    %edx, %edx
    .p2align    4, 0x90
.LBB0_2:                           ; First loop header
    movl    %eax, %r11d
    movl    4(%r10,%rdx,4), %eax   ; Read item from vec to register
    cmpl    %eax, %r11d
    je  .LBB0_3                    ; Jump to code which handle item removes
    addq    $1, %rdx
    cmpq    %rdx, %r9
    jne .LBB0_2                    ; If we finished loop and didn't find duplicate, return
.LBB0_14:
    retq
.LBB0_3:
    leaq    2(%rdx), %r11
    leaq    1(%rdx), %r9
    cmpq    %r11, %r8
    jbe .LBB0_13
    movl    %r8d, %eax
    subl    %edx, %eax
    addl    $-2, %eax
    testb   $1, %al
    je  .LBB0_8
    movl    8(%r10,%rdx,4), %eax
    cmpl    (%r10,%rdx,4), %eax
    je  .LBB0_7
    movl    %eax, 4(%r10,%rdx,4)
    leaq    2(%rdx), %r9
.LBB0_7:
    leaq    3(%rdx), %r11
.LBB0_8:
    leaq    -3(%r8), %rax
    cmpq    %rdx, %rax
    jne .LBB0_9
.LBB0_13:
    movq    %r9, 16(%rcx)
    retq
    .p2align    4, 0x90
.LBB0_12:
    addq    $2, %r11
    cmpq    %r11, %r8
    je  .LBB0_13
.LBB0_9:
    movl    (%r10,%r11,4), %edx
    cmpl    -4(%r10,%r9,4), %edx
    jne .LBB0_16
    movl    4(%r10,%r11,4), %edx
    cmpl    -4(%r10,%r9,4), %edx
    je  .LBB0_12
    jmp .LBB0_11
    .p2align    4, 0x90
.LBB0_16:
    movl    %edx, (%r10,%r9,4)
    addq    $1, %r9
    movl    4(%r10,%r11,4), %edx
    cmpl    -4(%r10,%r9,4), %edx
    je  .LBB0_12
.LBB0_11:
    movl    %edx, (%r10,%r9,4)
    addq    $1, %r9
    jmp .LBB0_12

It can be seen from ASM above that there are no writes to memory in first loop.

AngelicosPhosphoros · 2021-12-19T16:20:13Z

I also added some benchmarks with specific cases into Vec suite but I failed to execute them.

the8472 · 2021-12-19T18:50:02Z

I didn't understand how to run benchmarks in rustc codegen suite

For walltime benchmarks of library functions there are the benches directories. E.g. library/alloc/benches. You can run them with ./x.py bench library/alloc --stage 0 --test-args <bench name>

AngelicosPhosphoros · 2021-12-19T20:55:49Z

OLD

test vec::bench_dedup_new_100 ... bench: 59 ns/iter (+/- 1) = 6779 MB/s
test vec::bench_dedup_new_1000 ... bench: 781 ns/iter (+/- 30) = 5121 MB/s
test vec::bench_dedup_new_10000 ... bench: 9,974 ns/iter (+/- 38) = 4010 MB/s
test vec::bench_dedup_new_100000 ... bench: 367,460 ns/iter (+/- 1,545) = 1088 MB/s
test vec::bench_dedup_old_100 ... bench: 97 ns/iter (+/- 8) = 4123 MB/s
test vec::bench_dedup_old_1000 ... bench: 945 ns/iter (+/- 10) = 4232 MB/s
test vec::bench_dedup_old_10000 ... bench: 14,292 ns/iter (+/- 145) = 2798 MB/s
test vec::bench_dedup_old_100000 ... bench: 387,502 ns/iter (+/- 17,975) = 1032 MB/s

NEW

test vec::bench_dedup_all_100 ... bench: 47 ns/iter (+/- 2) = 8510 MB/s
test vec::bench_dedup_all_1000 ... bench: 355 ns/iter (+/- 5) = 11267 MB/s
test vec::bench_dedup_all_10000 ... bench: 3,713 ns/iter (+/- 20) = 10772 MB/s
test vec::bench_dedup_all_100000 ... bench: 36,585 ns/iter (+/- 225) = 10933 MB/s
test vec::bench_dedup_new_100 ... bench: 55 ns/iter (+/- 1) = 7272 MB/s
test vec::bench_dedup_new_1000 ... bench: 761 ns/iter (+/- 12) = 5256 MB/s
test vec::bench_dedup_new_10000 ... bench: 10,900 ns/iter (+/- 948) = 3669 MB/s
test vec::bench_dedup_new_100000 ... bench: 374,950 ns/iter (+/- 2,516) = 1066 MB/s
test vec::bench_dedup_none_100 ... bench: 55 ns/iter (+/- 0) = 7272 MB/s
test vec::bench_dedup_none_1000 ... bench: 361 ns/iter (+/- 2) = 11080 MB/s
test vec::bench_dedup_none_10000 ... bench: 3,811 ns/iter (+/- 23) = 10495 MB/s
test vec::bench_dedup_none_100000 ... bench: 37,387 ns/iter (+/- 496) = 10698 MB/s
test vec::bench_dedup_old_100 ... bench: 91 ns/iter (+/- 3) = 4395 MB/s
test vec::bench_dedup_old_1000 ... bench: 919 ns/iter (+/- 31) = 4352 MB/s
test vec::bench_dedup_old_10000 ... bench: 14,568 ns/iter (+/- 178) = 2745 MB/s
test vec::bench_dedup_old_100000 ... bench: 386,705 ns/iter (+/- 5,915) = 1034 MB/s

@the8472 Thank you for your help.
It seems that my optimization is working but this benches doesn't catch it because all of them checks the situation when there are some items to remove.
For smaller sets new version of code works faster because in that sets more percent of data belongs to first loop (first X unique items), while performance of the second loop should remain nearly same as it was. On larger sets, almost all time is spent in the second loops and bigger code size makes them slower.

I don't know what case is better to optimize, actually.

P.S. There is also criterion benchmark results which show clear win for case when there is nothing to remove.

nrc · 2021-12-20T10:05:48Z

It seems that my optimization is working but this benches doesn't catch it because all of them checks the situation when there are some items to remove.

Presumably the benchmark tests you added are designed to show the benefit of the changes? Could you run those new benchmarks with and without the code changes to demonstrate the benefit of the changes?

(It would also be useful to comment somewhere in the benches to explain what the aspect of dedup the benchmarks are testing and what none/all/old/new in the names means).

They are for more specific cases than old benches. Also, better usage of blackbox

AngelicosPhosphoros · 2021-12-20T16:55:49Z

@the8472 @nrc

Rewritten benchmarks and split my commit into 2 parks.
My optimization mostly implemented to improve vec::bench_dedup_none.

Here the results. Affected benchmarks is vec::bench_dedup_all, vec::bench_dedup_none, vec::bench_dedup_random. vec::bench_dedup_slice_truncate shown to show detect possible measuring errors.

Old code benchmark

test vec::bench_dedup_all_100                ... bench:          57 ns/iter (+/- 2) = 7017 MB/s
test vec::bench_dedup_all_1000               ... bench:         394 ns/iter (+/- 5) = 10152 MB/s
test vec::bench_dedup_all_10000              ... bench:       4,019 ns/iter (+/- 6) = 9952 MB/s
test vec::bench_dedup_all_100000             ... bench:      39,567 ns/iter (+/- 183) = 10109 MB/s
test vec::bench_dedup_none_100               ... bench:          56 ns/iter (+/- 0) = 7142 MB/s
test vec::bench_dedup_none_1000              ... bench:         486 ns/iter (+/- 2) = 8230 MB/s
test vec::bench_dedup_none_10000             ... bench:       4,824 ns/iter (+/- 12) = 8291 MB/s
test vec::bench_dedup_none_100000            ... bench:      48,140 ns/iter (+/- 95) = 8309 MB/s
test vec::bench_dedup_random_100             ... bench:          64 ns/iter (+/- 3) = 6250 MB/s
test vec::bench_dedup_random_1000            ... bench:         779 ns/iter (+/- 11) = 5134 MB/s
test vec::bench_dedup_random_10000           ... bench:       9,968 ns/iter (+/- 43) = 4012 MB/s
test vec::bench_dedup_random_100000          ... bench:     365,905 ns/iter (+/- 1,113) = 1093 MB/s
test vec::bench_dedup_slice_truncate_100     ... bench:          91 ns/iter (+/- 4) = 4395 MB/s
test vec::bench_dedup_slice_truncate_1000    ... bench:         766 ns/iter (+/- 39) = 5221 MB/s
test vec::bench_dedup_slice_truncate_10000   ... bench:      14,793 ns/iter (+/- 109) = 2703 MB/s
test vec::bench_dedup_slice_truncate_100000  ... bench:     402,150 ns/iter (+/- 10,161) = 994 MB/s

New code benchmark

test vec::bench_dedup_all_100                ... bench:          51 ns/iter (+/- 1) = 7843 MB/s
test vec::bench_dedup_all_1000               ... bench:         359 ns/iter (+/- 129) = 11142 MB/s
test vec::bench_dedup_all_10000              ... bench:       3,562 ns/iter (+/- 8) = 11229 MB/s
test vec::bench_dedup_all_100000             ... bench:      36,504 ns/iter (+/- 83) = 10957 MB/s
test vec::bench_dedup_none_100               ... bench:          44 ns/iter (+/- 0) = 9090 MB/s
test vec::bench_dedup_none_1000              ... bench:         288 ns/iter (+/- 1) = 13888 MB/s
test vec::bench_dedup_none_10000             ... bench:       2,752 ns/iter (+/- 28) = 14534 MB/s
test vec::bench_dedup_none_100000            ... bench:      29,269 ns/iter (+/- 1,082) = 13666 MB/s
test vec::bench_dedup_random_100             ... bench:          58 ns/iter (+/- 4) = 6896 MB/s
test vec::bench_dedup_random_1000            ... bench:         775 ns/iter (+/- 17) = 5161 MB/s
test vec::bench_dedup_random_10000           ... bench:      10,686 ns/iter (+/- 979) = 3743 MB/s
test vec::bench_dedup_random_100000          ... bench:     374,580 ns/iter (+/- 1,192) = 1067 MB/s
test vec::bench_dedup_slice_truncate_100     ... bench:          93 ns/iter (+/- 0) = 4301 MB/s
test vec::bench_dedup_slice_truncate_1000    ... bench:         970 ns/iter (+/- 5) = 4123 MB/s
test vec::bench_dedup_slice_truncate_10000   ... bench:      15,095 ns/iter (+/- 57) = 2649 MB/s
test vec::bench_dedup_slice_truncate_100000  ... bench:     389,795 ns/iter (+/- 293) = 1026 MB/s

It seems I managed to significantly improve vec::bench_dedup_none_100000 without hurting others.

library/alloc/src/vec/mod.rs

AngelicosPhosphoros · 2021-12-26T21:08:55Z

JFYI: I wouldn't be able to participate until 15th January. I would return to this PR after that if needed.

AngelicosPhosphoros · 2022-01-12T10:57:55Z

@dtolnay May you review my PR please? :)

Mark-Simulacrum · 2022-01-17T02:03:32Z

library/alloc/src/vec/mod.rs

+                let current = start.add(possible_remove_idx);
+                same_bucket(&mut *current, &mut *prev)
+            };
+            if need_drop {


I feel like need_drop is somewhat confusing here -- can we call this has_duplicate?

Changed to found_duplicate.

Mark-Simulacrum · 2022-01-17T03:26:33Z

This looks pretty good to me. That said, I think if we want to optimize further it'd be good to enhance the benchmark suite with a non-trivial Drop impl (e.g., a Vec<String> or so).

I did find it a little odd that the current impl (and the new impl) both end up copying every single element in the gaps between duplicate pairs (e.g., with 0 1 1 2 3 4 5 5, the middle piece will get 1-element shuffled back instead of a larger memcpy to move the whole section).

It seems like the optimization this PR suggests would be good to move into the "slow" loop, so instead of just optimizing the "no duplicates" case, we never have this sort of 1-element shuffle going on. That does likely hurt the case where there's a lot of small gaps between duplicate elements, but that case is presumably somewhat rare. I think it should be possible to rewrite the core loop here to essentially have: search for a range such that same_bucket(idx1 - 1, idx1) and !same_bucket(idx2 - 1, idx2), where the last defaults to the length of the vector if not found, and then drain(idx1..idx2), repeating until the "defaults to the length of the vector" condition is reached.

This basically turns the current (or after this PR) single-element shuffles into drop + copy's of ranges, which seems likely to optimize better than the current strategy, though obviously would need benchmarking. It seems likely to be a big win for the "rare duplicates" case (or mostly duplicates case) -- similar to the win seen in this PR with the "no duplicates" case. It might also let us avoid the whole fill gap on drop abstraction, since vec.drain already does that (right?).

I'm happy to r+ this if you want to leave this further optimization to a future PR (seems like an issue might be in order; the design sketch means it's probably relatively 'medium hard' for someone to pick up) -- but I think it would simplify the code pretty nicely, and leave the unsafe bits mostly just to roughly a pick2_mut.

First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers.

AngelicosPhosphoros · 2022-01-17T15:47:51Z

if you want to leave this further optimization to a future PR

You proposed optimization requires more benchmarking to do and some effort that I can't to do immediately so I would leave it to future PR.

I added link to your comment in linked issue to avoid losing it.

P.S. We, probably, won't be able to use drain directly because then dedup would move tail items multiple times (e.g. in case of 1 1 1 5 6 7 7 7 8 9 10 we would move items 8 9 10 twice) but we can try to make 2 loops (removal loop and preserver loop) inside outer loop which would probably be more friendly for branch predictor for series of removed items and preserved items.

JohnCSimon · 2022-03-06T04:02:46Z

Ping from triage:
@AngelicosPhosphoros Can you post the status of this PR?

FYI: when a PR is ready for review, post a message containing
@rustbot ready to switch to S-waiting-on-review so the PR appears in the reviewer's backlog.

JohnCSimon · 2022-04-23T06:00:13Z

@AngelicosPhosphoros
Ping from triage: I'm closing this due to inactivity, Please reopen when you are ready to continue with this.
Note: if you do please open the PR BEFORE you force push to it, else you won't be able to reopen.
Thanks for your contribution.

@rustbot label: +S-inactive

the8472 · 2022-04-23T12:14:01Z

AngelicosPhosphoros requested a review from Mark-Simulacrum 3 months ago

I think this was waiting on review and not labeled correctly.

Mark-Simulacrum · 2022-04-24T15:03:56Z

library/alloc/benches/vec.rs

+        black_box(vec.first());
+        // Unlike other benches of `dedup`
+        // this doesn't reinitialize vec
+        // because we measure how effecient dedup is


Suggested change

// because we measure how effecient dedup is

// because we measure how efficient dedup is

Mark-Simulacrum · 2022-04-24T15:07:38Z

library/alloc/src/vec/mod.rs

+                // SAFETY: possible_remove_idx always in range [1..len)
+                let prev = start.add(possible_remove_idx - 1);
+                let current = start.add(possible_remove_idx);
+                same_bucket(&mut *current, &mut *prev)


If I'm following the old code right, it looks like this swapped the ordering -- we used to pass (0, 1), (1, 2), (2, 3), etc., whereas this now passes them as (1, 0), (2, 1), ... -- it seems like we can probably just swap current and prev here?

It would be great to add a test for this, too.

AngelicosPhosphoros · 2023-11-22T08:59:05Z

I thought for some reason that this was merged already.
I would reopen in few days.

…rsion_77772_2, r=<try> Split `Vec::dedup_by` into 2 cycles First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772 P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it. Also, I cannot reopen that pull request so I am creating a new one. I responded to remaining questions directly by adding commentaries to my code.

…rsion_77772_2, r=the8472 Split `Vec::dedup_by` into 2 cycles First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772 P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it. Also, I cannot reopen that pull request so I am creating a new one. I responded to remaining questions directly by adding commentaries to my code.

rust-highfive assigned dtolnay Dec 19, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 19, 2021

This comment has been minimized.

Sign in to view

nrc added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Dec 20, 2021

Add more benchmarks of Vec::dedup

95d064c

They are for more specific cases than old benches. Also, better usage of blackbox

ShadowJonathan reviewed Dec 21, 2021

View reviewed changes

library/alloc/src/vec/mod.rs Outdated Show resolved Hide resolved

Mark-Simulacrum assigned Mark-Simulacrum and unassigned dtolnay Jan 14, 2022

Mark-Simulacrum reviewed Jan 17, 2022

View reviewed changes

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 17, 2022

AngelicosPhosphoros mentioned this pull request Jan 17, 2022

Potentially faster dedup_by implementation #77772

Closed

AngelicosPhosphoros requested a review from Mark-Simulacrum January 17, 2022 15:48

JohnCSimon added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 6, 2022

JohnCSimon closed this Apr 23, 2022

rustbot added the S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. label Apr 23, 2022

the8472 reopened this Apr 23, 2022

the8472 added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Apr 23, 2022

JohnCSimon removed the S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. label Apr 23, 2022

Mark-Simulacrum reviewed Apr 24, 2022

View reviewed changes

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 24, 2022

AngelicosPhosphoros mentioned this pull request Nov 25, 2023

Split Vec::dedup_by into 2 cycles #118273

Merged

	// because we measure how effecient dedup is
	// because we measure how efficient dedup is

Split Vec::dedup_by into 2 cycles #92104

Split Vec::dedup_by into 2 cycles #92104

Uh oh!

Conversation

AngelicosPhosphoros commented Dec 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rust-highfive commented Dec 19, 2021

Uh oh!

AngelicosPhosphoros commented Dec 19, 2021

Uh oh!

AngelicosPhosphoros commented Dec 19, 2021

Uh oh!

This comment has been minimized.

the8472 commented Dec 19, 2021

Uh oh!

AngelicosPhosphoros commented Dec 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OLD

NEW

Uh oh!

nrc commented Dec 20, 2021

Uh oh!

AngelicosPhosphoros commented Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Old code benchmark

New code benchmark

Uh oh!

Uh oh!

AngelicosPhosphoros commented Dec 26, 2021

Uh oh!

AngelicosPhosphoros commented Jan 12, 2022

Uh oh!

Mark-Simulacrum Jan 17, 2022

Choose a reason for hiding this comment

Uh oh!

AngelicosPhosphoros Jan 17, 2022

Choose a reason for hiding this comment

Uh oh!

Mark-Simulacrum commented Jan 17, 2022

Uh oh!

AngelicosPhosphoros commented Jan 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnCSimon commented Mar 6, 2022

Uh oh!

JohnCSimon commented Apr 23, 2022

Uh oh!

the8472 commented Apr 23, 2022

Uh oh!

Mark-Simulacrum Apr 24, 2022

Choose a reason for hiding this comment

Uh oh!

Mark-Simulacrum Apr 24, 2022

Choose a reason for hiding this comment

Uh oh!

AngelicosPhosphoros commented Nov 22, 2023

Uh oh!

Uh oh!

Split `Vec::dedup_by` into 2 cycles #92104

Split `Vec::dedup_by` into 2 cycles #92104

AngelicosPhosphoros commented Dec 19, 2021 •

edited

Loading

AngelicosPhosphoros commented Dec 19, 2021 •

edited

Loading

AngelicosPhosphoros commented Dec 20, 2021 •

edited

Loading

AngelicosPhosphoros commented Jan 17, 2022 •

edited

Loading