-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Add missing _excess methods to Alloc and use them in RawVec #50738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r? @dtolnay (rust_highfive has picked a reviewer for you, use r? to override) |
The Also, sub-sequent PRs should move |
cc @alexcrichton I'll update the |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
Could you show some benchmark comparisons between the old and new behavior? |
@sfackler sure, what benchmarks would you like to see ? |
The existing vec benchmarks would be a good choice https://github.com/rust-lang/rust/blob/master/src/liballoc/benches/vec.rs |
Yeah I can do those! |
While I think something like this is necessary, I also think that's way too much churn to add to the Alloc API. Background: https://internals.rust-lang.org/t/pre-rfc-changing-the-alloc-trait/7487 |
When this is about ready to land I think we'll also want to do at least a perf.rust-lang.org run, these tend to be very performance sensitive so just want to try to weed out possible regressions if we can! |
After the PR ( name vec_bench_master ns/iter vec_bench_exc ns/iter diff ns/iter diff % speedup
slice::ends_with_same_vector 0 0 0 NaN% x NaN
slice::starts_with_same_vector 0 0 0 NaN% x NaN
vec::bench_clone_0000 14 13 -1 -7.14% x 1.08
vec::bench_clone_0010 51 (196 MB/s) 51 (196 MB/s) 0 0.00% x 1.00
vec::bench_clone_0100 149 (671 MB/s) 139 (719 MB/s) -10 -6.71% x 1.07
vec::bench_clone_1000 1,085 (921 MB/s) 1,057 (946 MB/s) -28 -2.58% x 1.03
vec::bench_clone_from_01_0000_0000 25 24 -1 -4.00% x 1.04
vec::bench_clone_from_01_0000_0010 80 (125 MB/s) 77 (129 MB/s) -3 -3.75% x 1.04
vec::bench_clone_from_01_0000_0100 212 (471 MB/s) 204 (490 MB/s) -8 -3.77% x 1.04
vec::bench_clone_from_01_0000_1000 1,631 (613 MB/s) 1,616 (618 MB/s) -15 -0.92% x 1.01
vec::bench_clone_from_01_0010_0000 24 27 3 12.50% x 0.89
vec::bench_clone_from_01_0010_0010 81 (123 MB/s) 72 (138 MB/s) -9 -11.11% x 1.12
vec::bench_clone_from_01_0010_0100 236 (423 MB/s) 204 (490 MB/s) -32 -13.56% x 1.16
vec::bench_clone_from_01_0100_0010 81 (123 MB/s) 70 (142 MB/s) -11 -13.58% x 1.16
vec::bench_clone_from_01_0100_0100 213 (469 MB/s) 210 (476 MB/s) -3 -1.41% x 1.01
vec::bench_clone_from_01_0100_1000 1,715 (583 MB/s) 1,660 (602 MB/s) -55 -3.21% x 1.03
vec::bench_clone_from_01_1000_0100 230 (434 MB/s) 226 (442 MB/s) -4 -1.74% x 1.02
vec::bench_clone_from_01_1000_1000 1,745 (573 MB/s) 1,658 (603 MB/s) -87 -4.99% x 1.05
vec::bench_clone_from_10_0000_0000 152 146 -6 -3.95% x 1.04
vec::bench_clone_from_10_0000_0010 347 (288 MB/s) 321 (311 MB/s) -26 -7.49% x 1.08
vec::bench_clone_from_10_0000_0100 1,619 (617 MB/s) 1,513 (660 MB/s) -106 -6.55% x 1.07
vec::bench_clone_from_10_0000_1000 14,540 (687 MB/s) 14,150 (706 MB/s) -390 -2.68% x 1.03
vec::bench_clone_from_10_0010_0000 148 143 -5 -3.38% x 1.03
vec::bench_clone_from_10_0010_0010 373 (268 MB/s) 321 (311 MB/s) -52 -13.94% x 1.16
vec::bench_clone_from_10_0010_0100 1,711 (584 MB/s) 1,513 (660 MB/s) -198 -11.57% x 1.13
vec::bench_clone_from_10_0100_0010 356 (280 MB/s) 323 (309 MB/s) -33 -9.27% x 1.10
vec::bench_clone_from_10_0100_0100 1,719 (581 MB/s) 1,510 (662 MB/s) -209 -12.16% x 1.14
vec::bench_clone_from_10_0100_1000 14,240 (702 MB/s) 14,201 (704 MB/s) -39 -0.27% x 1.00
vec::bench_clone_from_10_1000_0100 1,732 (577 MB/s) 1,511 (661 MB/s) -221 -12.76% x 1.15
vec::bench_clone_from_10_1000_1000 15,541 (643 MB/s) 15,025 (665 MB/s) -516 -3.32% x 1.03
vec::bench_extend_0000_0000 40 34 -6 -15.00% x 1.18
vec::bench_extend_0000_0010 119 (84 MB/s) 110 (90 MB/s) -9 -7.56% x 1.08
vec::bench_extend_0000_0100 227 (440 MB/s) 220 (454 MB/s) -7 -3.08% x 1.03
vec::bench_extend_0000_1000 1,373 (728 MB/s) 1,332 (750 MB/s) -41 -2.99% x 1.03
vec::bench_extend_0010_0010 177 (56 MB/s) 180 (55 MB/s) 3 1.69% x 0.98
vec::bench_extend_0100_0100 410 (243 MB/s) 396 (252 MB/s) -14 -3.41% x 1.04
vec::bench_extend_1000_1000 3,133 (319 MB/s) 3,002 (333 MB/s) -131 -4.18% x 1.04
vec::bench_from_elem_0000 10 10 0 0.00% x 1.00
vec::bench_from_elem_0010 52 (192 MB/s) 51 (196 MB/s) -1 -1.92% x 1.02
vec::bench_from_elem_0100 168 (595 MB/s) 192 (520 MB/s) 24 14.29% x 0.88
vec::bench_from_elem_1000 1,284 (778 MB/s) 1,264 (791 MB/s) -20 -1.56% x 1.02
vec::bench_from_fn_0000 11 11 0 0.00% x 1.00
vec::bench_from_fn_0010 54 (185 MB/s) 60 (166 MB/s) 6 11.11% x 0.90
vec::bench_from_fn_0100 207 (483 MB/s) 228 (438 MB/s) 21 10.14% x 0.91
vec::bench_from_fn_1000 1,627 (614 MB/s) 1,600 (625 MB/s) -27 -1.66% x 1.02
vec::bench_from_iter_0000 15 15 0 0.00% x 1.00
vec::bench_from_iter_0010 55 (181 MB/s) 51 (196 MB/s) -4 -7.27% x 1.08
vec::bench_from_iter_0100 142 (704 MB/s) 142 (704 MB/s) 0 0.00% x 1.00
vec::bench_from_iter_1000 1,110 (900 MB/s) 1,057 (946 MB/s) -53 -4.77% x 1.05
vec::bench_from_slice_0000 23 23 0 0.00% x 1.00
vec::bench_from_slice_0010 94 (106 MB/s) 110 (90 MB/s) 16 17.02% x 0.85
vec::bench_from_slice_0100 202 (495 MB/s) 228 (438 MB/s) 26 12.87% x 0.89
vec::bench_from_slice_1000 1,357 (736 MB/s) 1,320 (757 MB/s) -37 -2.73% x 1.03
vec::bench_new 0 0 0 NaN% x NaN
vec::bench_push_all_0000_0000 21 21 0 0.00% x 1.00
vec::bench_push_all_0000_0010 61 (163 MB/s) 58 (172 MB/s) -3 -4.92% x 1.05
vec::bench_push_all_0000_0100 164 (609 MB/s) 147 (680 MB/s) -17 -10.37% x 1.12
vec::bench_push_all_0000_1000 1,089 (918 MB/s) 1,072 (932 MB/s) -17 -1.56% x 1.02
vec::bench_push_all_0010_0010 139 (71 MB/s) 127 (78 MB/s) -12 -8.63% x 1.09
vec::bench_push_all_0100_0100 338 (295 MB/s) 334 (299 MB/s) -4 -1.18% x 1.01
vec::bench_push_all_1000_1000 2,979 (335 MB/s) 3,086 (324 MB/s) 107 3.59% x 0.97
vec::bench_push_all_move_0000_0000 36 37 1 2.78% x 0.97
vec::bench_push_all_move_0000_0010 118 (84 MB/s) 116 (86 MB/s) -2 -1.69% x 1.02
vec::bench_push_all_move_0000_0100 224 (446 MB/s) 234 (427 MB/s) 10 4.46% x 0.96
vec::bench_push_all_move_0000_1000 1,400 (714 MB/s) 1,352 (739 MB/s) -48 -3.43% x 1.04
vec::bench_push_all_move_0010_0010 178 (56 MB/s) 196 (51 MB/s) 18 10.11% x 0.91
vec::bench_push_all_move_0100_0100 399 (250 MB/s) 399 (250 MB/s) 0 0.00% x 1.00
vec::bench_push_all_move_1000_1000 3,218 (310 MB/s) 3,334 (299 MB/s) 116 3.60% x 0.97
vec::bench_with_capacity_0000 2 2 0 0.00% x 1.00
vec::bench_with_capacity_0010 31 (322 MB/s) 31 (322 MB/s) 0 0.00% x 1.00
vec::bench_with_capacity_0100 31 (3225 MB/s) 30 (3333 MB/s) -1 -3.23% x 1.03
vec::bench_with_capacity_1000 31 (32258 MB/s) 34 (29411 MB/s) 3 9.68% x 0.91
vec_deque::bench_grow_1025 5,754 5,546 -208 -3.61% x 1.04
vec_deque::bench_iter_1000 1,002 910 -92 -9.18% x 1.10
vec_deque::bench_mut_iter_1000 983 924 -59 -6.00% x 1.06
vec_deque::bench_new 37 36 -1 -2.70% x 1.03 @alexcrichton For the run-time parts of the benchmarks, whether that's relevant or not will depend on how much does rustc use @Mark-Simulacrum could you launch a perf run? Does
I am of the opinion that all Alloc methods should return the The only situation in which returning the |
@bors: try FWIW a perf run isn't a great way to measure this, it's mostly a "hey look there's no obvious regressions by accident" sort of signal |
Add missing _excess methods to Alloc and use them in RawVec Adds the following methods to the `Alloc` trait: * `alloc_zeroed_excess` * `shrink_in_place_excess` * `grow_in_place_excess` Uses them in the appropriate places within the `RawVec` implementation, with the exception of `shrink_in_place_excess` which should be used in `RawVec::shrink_to_fit`, but which will be done in a different PR.
cc @SimonSapin this only touches |
☀️ Test successful - status-travis |
I've scheduled a perf run, data will show up here when ready |
Awesome, thanks @alexcrichton |
I think the discussion about |
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0901r0.html suggest the same, FWIW I implemented that same API on jemalloc (jemalloc/jemalloc#1074 (comment)) , and in some cases is much faster than doing |
Maybe what's missing is for |
I also wondered whether Excess should just return a
I'll try to chime in there tomorrow. |
Ping from triage @sfackler! What's the status of this PR? |
Great! That shows there aren't any "obvious" regressions |
@@ -574,6 +574,10 @@ impl<T> Vec<T> { | |||
/// It will drop down as close as possible to the length but the allocator | |||
/// may still inform the vector that there is space for a few more elements. | |||
/// | |||
/// Note: `shrink_to_fit` is a non-binding request. Whether the `Vec` size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what way will this help someone reading the documentation?
if l <= new_size { | ||
return Ok(Excess(ptr, u)); | ||
} else { | ||
return Err(CannotReallocInPlace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we end up in this situation? The allocation can always fit less data than it already contains, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two issues that interact here:
-
the user is not allowed to call
dealloc
with any size, but the size used must be in the range returned byusable_size -> (l, u)
. Sizes in that range must not invoke undefined behavior when passed todealloc
. -
the only requirement on the sizes
(l, u)
returned byusable_size
is thatl <= u
. Therefore, a validusable_size
implementation is:usable_size() -> (u, u)
.
Therefore, we can end here if the lower bound on the allocation size l
returned by usable_size -> (l, u)
is larger than new_size
, which happens if usable_size() -> (u, u)
and new_size < u
.
If we were to return Ok
in this case, the user would be allowed to call dealloc
with new_size
, but since that's out of the [l, u]
range, that would trigger undefined behavior when calling dealloc
.
Does that make sense?
FWIW I think that the current API of shrink_in_place
is a bit broken. For example, malloc
does not care about the allocation size on deallocation, so a valid implementation of usable_size
for malloc is to return (0, size)
. This would allow implementing shrink_in_place
/ shrink_in_place_excess
for malloc
by just doing nothing and returning Ok
, but I don't think that's a useful implementation of shrink_in_place
. Even if it were to return Ok
only if it was able to shrink the allocation, that might not be useful either. If I have a 1Gb allocation, and want to shrink it in place to 100 Mb, and shrink_in_place
only shrinks it by 1 byte, then that's not useful.
The wording that is problematic is that shrink_in_place
is allowed to return more size than the one requested, which means that it can just return the original size and do nothing.
For grow_in_place
this wording makes sense, but it seems that it was c&p for shrink_in_place
. So either shrink_in_place
should always return the allocation size so that the user can check whether it shrunk the allocation enough, or it should take a requested size and an upper bound, and return error if it can't shrink the allocation under the upper bound.
Also while I understand the rationale behind splitting realloc_in_place
into {grow,shrink}_in_place
, I am starting to doubt whether that was the right call. A single realloc_in_place
method would reduce API surface and simplify things even if it adds a branch for allocators that can have completely different execution paths for growing and shrinking.
In any case all of this must be discussed in the RFC issue, and not here, but those are my thoughts on the matter.
On the other hand, it also doesn't show any particular advantage to making this change. |
Yes, these changes should not have any impact on Perf at all since liballoc_jemalloc does not specialize the |
Ping from triage! What's the status of this PR? |
Ping from triage @sfackler! What's the status of this PR? |
I'd like to see a concrete performance improvement before adding more complexity here. |
The liballoc_jemalloc specialization of the Alloc functions were removed form rustc, and the plan is to completely remove liballoc_jemalloc and move to jemallocator. Jemallocator can't implement these APIs until they are part of the Alloc trait. Std can't use these APIs until they are part of the Alloc trait. Perf runs without both std using these APIs and the allocator specializing them, are meaningless beyond "this change does not introduce regressions". So can you explain how to do a meaningful perf run from this? I can try to remove liballoc_jemalloc from rustc, include jemallocator as a 3rd party dependency, make rustc use that by default, change all of std to use the new APIs, and then report back. Is that what you mean that should be done? |
Performance is not the point anyways. The point is aligning the allocated sizes from types like However, I still think this adds too much API surface. |
I have a branch that changes the |
Did you look at the rationale when |
Yes, the rationale for the split was that two different functions make intent clearer and can save a branch in the implementation of AFAIK there is no rationale anywhere for why the current APIs for Fixing FWIW nowhere was discussed whether saving a branch in The main reason I went with |
The fact that it's annoying to determine of a change is actually beneficial or not doesn't mean we should just YOLO merge every exotic variant on allocation/reallocation/deallocation/whatever that could possibly exist. I am generally of the opinion that all of the _excess methods are a waste of effort and complexity budget. If this change couldn't have any benefit until it's propagated through the the global allocator API, then it seems like this change should do that? You could alternatively .yank the relevant code out into a separate project and point the allocation calls to something that actually forwards to the right allocator methods for the purpose of benchmarking. |
What’s annoying is the that the work to determine the effect of this change
was already done already but the specialization of Alloc for jemalloc was
removed “YOLO-style”: no benchmarks required for that change.
|
Ping from triage! @sfackler, what should we do with this? |
@pietroalbini Probably close. This is now in the limbo where the only thing I can do to benchmark this is a synthetic benchmark using jemallocator and a modified version of rustc. For example, creating one vector/string of a particular size, and then pushing one element to it. Depending on whether the vector reallocates or not, the timings vary significantly. So I can always choose a size for which if there is capacity available the excess variant takes 1ns for the push back, and the non-excess one takes at least 20ns-30ns because of the reallocation. Depending on how hard I want to make my point, I can make the vectors huge, and choose one of the known particularly bad prime number for jemalloc, so that I can pump those numbers to the 1000ns range. But I don't know what any of that would prove. Before |
Off-topic but you totally can do that in a Servo branch, it just won’t pass our Android CI at the moment. |
Ok, closing this. |
Adds the following methods to the
Alloc
trait:alloc_zeroed_excess
shrink_in_place_excess
grow_in_place_excess
Uses them in the appropriate places within the
RawVec
implementation.