Skip to content

Metal enqueue() advantage #2232

Open
Open
@kvark

Description

@kvark

Looking at Metal System Trace, I realized the real value of enqueuing the command buffers earlier than submitting it. It's not documented much, but the behavior change of the driver is drastic.

Theory

When a command buffer is enqueued, once we call endEncoding on a pass, it gets instantly passed down to the driver, since it knows it doesn't need to wait for anything and is expecting the work. Consequently, the GPU starts chewing on the work right away. Thus, the commit() becomes a simple message saying "I'm done with this", leaving the submission queue for other things to use.

Basically, I call this paragraph BS:

The enqueue method does not make the command buffer eligible for execution. To submit the command buffer to the device for execution, a subsequent commit method invocation is required.

Now, let's look at what happens if we don't enqueue anything explicitly:

If the command buffer has not previously been enqueued, it is enqueued implicitly.

Sounds pretty harmless, doesn't it? Well, what really happens is that the driver doesn't want to do anything with our encoded passes until the command buffer gets committed. The passes get stacked on the command buffer internally (much like our software commands) and then dropped like a bomb to the driver upon commit() call.

Practice

Let's get more concrete. Suppose gfx-portability spends X amount of time recording an application's (one-time) command buffer. The driver does some work, but it's able to propagate it to GPU gradually, and doesn't take longer than the GPU itself, so we should only take into account it's latency L. And the GPU takes G amount of time to finish the work on those commands. Let's see how the work flows:

  1. The user records a command buffer, spends X time
  2. The command buffer gets submitted, flows to the driver, which starts propagating the work on GPU after L time.
  3. The GPU takes G time to execute the work.

Total time: X + L + G
Encoding thread time: X
Submission thread time: ~0

Now, let's look at MoltenVK:

  1. The user "records" a command buffer, spending 0.5X time. Molten just copies over the commands internally, not touching Metal yet.
  2. The command buffer get's submitted. Here is when Molten starts actually recording those commands, spending an extra 0.75X time. The trick is that Molten enqueues the command buffer right before doing any work, so the driver is ready to receive those encoded passes.
  3. GPU starts the work earlier, right at the first encoded pass by the submission. Let's say it's the slowest one here, so it will try to keep up with the work and take G time in total, like in our previous case.

Total time: 0.5X + max(0.75X, L + G)
Encoding thread time: 0.5X
Submission thread time: 0.75X

See what happened here? There is more work in total, but it's spread over threads, and actually completes faster because the GPU gets stuff to work on earlier. Now X here can be logically extended to the total recording time (instead of a single command buffer), given that it's the submission cut-off that matters, and you can see how this can drastically affect performance in the end.

Solutions

In an ideal world, Vulkan would have some sort of API to tell the driver (earlier than at submission time) which order the on-time encoded command buffers are going to be submitted in. This isn't going to happen though.

A more practical alternative would be to try forcing the deferred command buffer recording on our side, and see how this affects frame scheduling. This would technically zero out one of our major advantages, and it will be a race over whose software command buffers are lighter.

Finally, at the application side, we'd benefit from more granular submissions. Dota makes about a thousand command buffers, but only submits them in 2 chunks per frame. So we are being delayed by roughly a quarter of the frame time here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions