Skip to content

[meta] Dota performance issues on Metal #2161

Closed
@kvark

Description

@kvark

Analyzing the Instruments profiles reveals quite a few things we could do better.

Native API calls

Metal backend issues

  • Too much heap (re)allocation - Various Metal performance optimizations #2185. We should avoid any allocations at run-time by re-using the storage and/or using the iterators more aggressively.
  • Too many command buffers - Command buffer allocation pool in Metal #2180. The instrumented memory profile doesn't show command buffers being re-used. I suppose the queue just creates a new one every time until it reaches the capacity, and then starts trying to re-use the completed ones. This is undesired, and we may attempt to address it (temporarily) by playing with the capacity limit.
  • Dynamic creation of a render pass - [mtl] minimize creation of render passes #2178. We create one every time a render pass is started by copying one from the framebuffer and filling out all the clear values/operations. Apparently, this is super heavy, and we should probably avoid creating any Metal objects at normal run time.
  • Dynamic creation of depth-stencil states - [mtl] cache depth/stencil states #2195.
  • Binding descriptor sets is visibly slow - Heap-less descriptor sets in Metal #2183. We should avoid doing work for repeated bindings, and need to explore ways to share heap-allocated data between sets of the same group.
  • Non-existent semaphore frame synchronization - Frame synchronization in Metal #2143
  • Too much locking going on - [meta] Locking on Metal #2229
  • Too many command buffer callbacks - [mtl] dynamic depth bias, callback coalescence, and more stats #2224
  • Copying the render pass descriptor (which we do at each start of an RP) takes 7.4% of the time in our library
  • Binding descriptor sets still involves a few hot loops that could be faster. In particular, we set resources in batches, and we provide a closure, which checks for the current pre-render status. All of that can be simplified if we are outside of pass.

Portability issues

Application concerns

  • The engine operates within an assumption that command buffer recording is cheap, while submission is expensive. They move the submission onto a separate thread, which doesn't appear to be saturated enough.
    • the chosen threading model (of having 2 threads for job execution and a dedicated submission thread) allows MoltenVK to effectively run on 3 threads, while we are limited to 2. This is something Dota should fix or expose for us to test.
    • passing "-threads 3" technically solves this
  • Submissions are too late - Metal enqueue() advantage #2232, Remote command sink in Metal #2260 (more work is possible)

Things to investigate

  • Actual memory types/heaps used by an application. Maybe we could tweak the queries and/or ask Valve to fix those in order to use our exposed memory more efficiently. This is certainly a difference with Molten.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions