Closed
Description
Analyzing the Instruments profiles reveals quite a few things we could do better.
Native API calls
- Inefficient ObjectiveC selector lookups - Objective-C interop is too slow #2145
-
Too many ObjectiveC messages. Roughly, our profile shows 5-7% dedicated toobjc_msgSend
versus 0.3-0.5% in MoltenVK. The difference in retain/release calls is of a similar scale. - Too many
retain
/release
calls - mtl: Use pointers for temporary state #2175 - Retains & releases could be faster if not done via messaging - Consider using objc_retain and objc_release instead of sending -retain and -release message metal-rs#58 .
-
Attribute access goes through messages - Metal direct property access #2167 -
Class::get
calls are super slow - CacheClass::get
SSheldon/rust-objc#65 . We spend roughly 3% of our library's time doing those calls. - nsAutoreleasePool could be created faster - [mtl] new autorelease #2267, Add autoreleasepool functionality. SSheldon/rust-objc#67
Metal backend issues
- Too much heap (re)allocation - Various Metal performance optimizations #2185. We should avoid any allocations at run-time by re-using the storage and/or using the iterators more aggressively.
- Too many command buffers - Command buffer allocation pool in Metal #2180. The instrumented memory profile doesn't show command buffers being re-used. I suppose the queue just creates a new one every time until it reaches the capacity, and then starts trying to re-use the completed ones. This is undesired, and we may attempt to address it (temporarily) by playing with the capacity limit.
- Dynamic creation of a render pass - [mtl] minimize creation of render passes #2178. We create one every time a render pass is started by copying one from the framebuffer and filling out all the clear values/operations. Apparently, this is super heavy, and we should probably avoid creating any Metal objects at normal run time.
- Dynamic creation of depth-stencil states - [mtl] cache depth/stencil states #2195.
- Binding descriptor sets is visibly slow - Heap-less descriptor sets in Metal #2183. We should avoid doing work for repeated bindings, and need to explore ways to share heap-allocated data between sets of the same group.
- Non-existent semaphore frame synchronization - Frame synchronization in Metal #2143
- Too much locking going on - [meta] Locking on Metal #2229
- Too many command buffer callbacks - [mtl] dynamic depth bias, callback coalescence, and more stats #2224
- Copying the render pass descriptor (which we do at each start of an RP) takes 7.4% of the time in our library
- Binding descriptor sets still involves a few hot loops that could be faster. In particular, we set resources in batches, and we provide a closure, which checks for the current pre-render status. All of that can be simplified if we are outside of pass.
Portability issues
- Leaking descriptor sets - Leaking descriptor sets portability#99
- Too much data moving and undesired heap allocation (during descriptor updates in particular)
Application concerns
- The engine operates within an assumption that command buffer recording is cheap, while submission is expensive. They move the submission onto a separate thread, which doesn't appear to be saturated enough.
- the chosen threading model (of having 2 threads for job execution and a dedicated submission thread) allows MoltenVK to effectively run on 3 threads, while we are limited to 2.
This is something Dota should fix or expose for us to test. - passing "-threads 3" technically solves this
- the chosen threading model (of having 2 threads for job execution and a dedicated submission thread) allows MoltenVK to effectively run on 3 threads, while we are limited to 2.
- Submissions are too late - Metal enqueue() advantage #2232, Remote command sink in Metal #2260 (more work is possible)
Things to investigate
- Actual memory types/heaps used by an application. Maybe we could tweak the queries and/or ask Valve to fix those in order to use our exposed memory more efficiently. This is certainly a difference with Molten.