Skip to content

runtime: CPU limit-aware GOMAXPROCS default #73193

Closed
@prattmic

Description

@prattmic

Overview

Change the Go runtime on Linux to use CPU cgroup quota limits to set the default value of GOMAXPROCS.

This is a concrete proposal for the ideas discussed in #33803. I've included a lot of background, but you can jump directly to the proposal.

Background

Go

GOMAXPROCS specifies the maximum number of goroutines that may be run in parallel. In the implementation, this corresponds to the maximum number of system threads that will execute a goroutine at a given time. In other words, GOMAXPROCS specifies the maximum parallelism of a Go program.

Note that GOMAXPROCS does not apply to threads created by (and running in) C, or to “blocking” system calls or cgo calls1. Thus the actual maximum parallelism of a Go process may exceed GOMAXPROCS.

Linux

The Linux kernel has several different mechanisms that impact thread scheduling.

The most fundamental is the actual number of logical CPUs in the machine. That is, all CPUs, counting hyperthreads, if any. This defines the maximum parallelism of the machine itself, as the Linux kernel can only run one thread on a CPU at a time.

Closely related is the CPU affinity mask of a process, set by sched_setaffinity(2). This specifies the set of logical CPUs that a given process is allowed to run on. The Linux kernel will never schedule any threads on CPU not in the mask, even if they are idle. This provides a mechanism to limit the maximum parallelism of a process by reducing its available CPU set. Unfortunately, this is fairly rudimentary as it requires the user to manually allocate CPU resources, which may result in subpar overall utilization.

Both of these mechanisms directly correspond to maximum parallelism, so they are the basis for the GOMAXPROCS default today. GOMAXPROCS defaults to either the total number of logical CPUs on the machine, or the number of logical CPUs available in the CPU affinity mask, whichever is lower.

Linux’s CPU cgroups provide additional scheduling controls commonly used by container runtimes/orchestration systems such as Docker or Kubernetes. Note that Linux has both v1 and v2 cgroups. Both provide the same CPU controls, just with slightly different names.

The first is cpu.cfs_quota_us / cpu.cfs_period_us (v1) or cpu.max (v2). This defines the maximum CPU time the cgroup may use within some period window. For example, a typical period is 100ms. If the quota is set to 800ms, then the cgroup may use 800ms of CPU time every 100ms of wall time. The simple case here would be that 8 threads can run in parallel, as each uses 100ms of CPU time per 100ms of wall time.

However, note that this is not a limit on maximum parallelism. For instance, it is also allowed for 16 threads to run for 50ms and then do nothing for 50ms. This allows bursts of higher parallelism, provided the cgroup uses less CPU later in the period. If a cgroup exceeds its quota, all threads in the cgroup are descheduled until the end of the period.

The bursting behavior makes this not quite a perfect match for GOMAXPROCS, as GOMAXPROCS does not allow bursts, however this is otherwise conceptually similar to GOMAXPROCS.

Note that cgroups are hierarchical, so the effective quota for a cgroup is the minimum quota of any cgroup up the hierarchy, assuming the periods are all identical.

Second, there is cpu.shares (v1) or cpu.weights (v2). These do not set hard limits at all, but are instead relative priorities given to the kernel scheduler. i.e., threads in cgroup A with double the shares of cgroup B are twice as likely to run.

Container runtimes will use this to set relative priorities of different containers. For instance, suppose you have a machine with 8 logical CPUs, and two containers A and B. Both containers want to run 8 threads in parallel, but you would like A to use approximately 1 CPU and B to use approximately 7 CPUs. If you set A with shares of 1 and B with shares of 7, the scheduler will run 1 of A’s threads at a time, and 7 of B’s threads. However, the ratio only matters if the machine is overcommitted. If B is completely idle, A will run all 8 threads, as there is no hard limit on its CPU usage.

So, the relative values of CPU shares make them difficult to interpret without context, and the lack of an actual limit makes them a bad fit for GOMAXPROCS.

Note that both .NET and Java both use the CPU quota to determine CPU limits. Java originally also considered CPU shares, but reverted the change after realizing it is a bad fit.

Finally, cpuset.cpus defines the set of CPUs available for scheduling within the cgroup. This is equivalent to using sched_setaffinity(2), except that it applies on the cgroup level. The result of this configuration is visible to applications via sched_getaffinity(2).

Higher level

While the Linux kernel values are the primary things we have to work with, let’s take a look at how users actually configure these values.

Docker

Docker directly provides --cpu-quota, --cpu-period, and --cpu-shares flags which directly correspond to the cgroup options, as well as a slightly more general --cpus, which simply uses the CPU quota with a predefined period.

Kubernetes

Kubernetes is a bit higher level. Kubernetes containers specify CPU limits and requests. From the summary of how these work:

  • The CPU limit is “a hard ceiling on how much CPU time the container can use”. This corresponds almost directly to the CPU cgroup quota. Kubernetes selects a constant period (100ms, I believe), and scales the CPU limit to determine the quota. e.g., a CPU limit of 8 results in a quota of 800ms.
  • The CPU request is the “minimum required CPU” for a container. The Kubernetes pod scheduler will not overcommit nodes, ensuring there is at least as much as the requested CPU available for each container. In addition, Kubernetes will assign CPU shares to achieve appropriate weighting between different containers (like the shares example above). A container with only a request and no limit has no hard upper bound on CPU usage.

Historically, CPU requests and limits could not be changed after starting the container. However, Kubernetes has alpha support for “in place vertical scaling”, which allows changing the CPU request and limit while the container is running. This is scheduled to be promoted to beta in Kubernetes 1.33.

Similarly, historically CPU requests and limits applied only to individual containers. Kubernetes also has alpha support for “pod-level resource specification”, which allows specifying requests and limits for a pod, which is a group of one or more containers. There is no published target beta or GA schedule for this feature. I believe this results in a multi-level cgroup hierarchy.

A large fraction of Kubernetes users set CPU requests but not limits, with the intention that they want to consume otherwise idle cycles. Doing so should theoretically increase overall fleet CPU utilization, and also help avoid production issues caused by hitting CPU limits when there are sufficient idle cycles to handle the load.

Since CPU limits correspond directly to quotas, that makes them easy to work with.

CPU requests correspond indirectly to shares. If the Go runtime had access to the specific request value rather than the raw shares value, that resolves the problem of lack of context described above. However, the lack of an upper limit is still problematic. The CPU request is the minimum CPU. We could set GOMAXPROCS to this value, but that is overly conservative, preventing any use of parallelism beyond the minimum request. This is also in conflict with the users’ intention to use extra idle cycles over the request. We could set it to a higher value, but it is unclear how much higher we should go.

In addition to CPU requests and limits, Kubernetes also has an optional CPU Manager, which assigns exclusive CPUs to containers when enabled. These scheduling restrictions are applied via CPU cgroup cpuset.cpus configuration, which is visible to applications via sched_getaffinity(2).

Mesos

I am less familiar with Mesos, but their resource summary is similar to Kubernetes. Tasks can have CPU requests and limits, which translate to CPU cgroup shares and limits, respectively.

I do not know which configurations are most common with Mesos.

GOMAXPROCS default today

Since Go 1.5, GOMAXPROCS has had a default of “the number of logical CPUs available”. Concretely, this means either the total number of logical CPUs on the machine, or the number available in the CPU affinity mask, whichever is lower.

This default works well for single-tenant systems (where the Go program is approximately the only thing running), allowing full use of the machine’s parallelism without requiring any configuration from users.

For multi-tenant systems without isolation (e.g., a laptop running multiple active applications), this default also tends to work fairly well. If the application only receives, say, 50% of the machine’s maximum parallelism on average, it may be better to use a lower value of GOMAXPROCS. On the other hand, the other tenants could reduce their CPU usage at any time, allowing this application to use more. If we used a lower value of GOMAXPROCS, then our application could not utilize this newly available parallelism.

Many of us (@mknyszek, @aclements, @cherrymui, etc) have long brainstormed a mechanism to eliminate the need for GOMAXPROCS entirely and instead dynamically discover the available parallelism and make adjustments on the fly in order to better serve multi-tenant systems in cases like above. Unfortunately, we do not expect to achieve something like this in the near term.

Additionally, the current GOMAXPROCS defaults tend to work OK for multi-tenant systems with no isolation simply because it is uncommon for such systems to run extremely overcommitted, which is when the default performs the worst.

Multi-tenant systems with isolation are typically container orchestration systems like Kubernetes. These will run multiple applications on a single machine with some form of CPU isolation for each application (CPU cgroup quota or shares). For example, a 64 CPU machine may host 8 applications, each of which have a CPU quota of 8 CPUs.

With a default GOMAXPROCS of 64, but only 8 CPUs of (average) parallelism, these applications are quite mismatched from reality. Downsides from this mismatch include:

  • CPU quota throttling. An application with a CPU quota of 8 and GOMAXPROCS=64 can quickly hit its quota and throttle (all threads descheduled) until the end of the period, which causes direct latency impact. Note that if the application has 64 concurrently runnable goroutines then even with GOMAXPROCS=8 there will be latency impact from goroutines waiting for the Go runtime to schedule them. However, this may be preferred as it is more smooth compared to the hard cutoff of CPU quota throttling.
  • Latency impact and CPU quota throttling from GC. Like the rest of the runtime, the GC uses GOMAXPROCS as the source of available parallelism. When running, the GC targets using 25% of GOMAXPROCS to perform GC work. Generally, this means that 25% of GOMAXPROCS is used to run GC worker goroutines, and 75% of GOMAXPROCS is used to run standard application goroutines. Additionally, the GC runs “idle workers” on any remaining portion of GOMAXPROCS that otherwise have nothing to do. This causes two primary issues:
    • Major: For applications that generally remain under their CPU quota despite the high GOMAXPROCS (because they simply do not have too many goroutines running concurrently), the 25% GC worker target plus idle workers on remaining GOMAXPROCS will cause a large spike in work that causes the application to exceed its quota and get throttled when it otherwise would not have.
    • Minor: Actual thread scheduling warps the 25% target for applications above their quota. In the GOMAXPROCS=64 example, the Go runtime will target running 16 GC worker goroutines and 48 application goroutines across 64 threads. The Linux kernel only runs 8 threads at a time. On average, 25% of the running threads will be running GC workers, but because thread scheduling is arbitrary, at any time there may be significantly more than 25% GC workers running, up to 100% GC workers running and no application goroutines running, which would effectively be an unintentional “stop the world”.
  • Scalability costs. Running Go at higher GOMAXPROCS has a variety of scaling costs, from increased memory use due to additional caches, to increased coordinate costs between threads. These costs are generally worthwhile to achieve additional parallelism, but when there is a big mismatch between GOMAXPROCS and actual parallelism these costs are paid with no benefit.
  • Minor: Increased context switching costs. The Go runtime will run goroutines on 64 threads, which the Linux kernel will need to round-robin between the 8 CPUs available. This is an added cost when the Go runtime could do this scheduling itself. This can also lead to edge cases like the kernel descheduling a thread that is running a goroutine holding a mutex. No other goroutine will be able to acquire that mutex until the kernel runs the thread again so the holder can unlock the mutex.

Proposal

Given this background, my proposal is that:

  1. At startup, if GOMAXPROCS is not set in the environment, the Go runtime will determine:
    1. The total logical CPU count of the machine
    2. The number of available logical CPUs from sched_getaffinity(2)
    3. If the process is in a cgroup, the “adjusted” CPU limit.
      1. For each level of the cgroup hierarchy, compute the CPU limit as cpu.cfs_quota_us / cpu.cfs_period_us (or cgroup v2 equivalent).
      2. Take the minimum CPU limit in the hierarchy as the “effective” effective CPU limit.
      3. Compute the “adjusted” CPU limit as max(2, ceil(effective_cpu_limit)).
  2. The default value of GOMAXPROCS will be the minimum of (i), (ii), or (iii).
  3. A new API in package runtime, func SetDefaultGOMAXPROCS() sets GOMAXPROCS based on the default behavior and returns the new value.
  4. The Go runtime will automatically update GOMAXPROCS if the CPU affinity or cgroup CPU limit change. This is done with a low frequency scan of the current environment. Users may call SetDefaultGOMAXPROCS to manually trigger updates.

The Go runtime will support querying the CPU cgroup quota from either cgroups v1 or v2. Note that mixed v1 and v2 controllers are supported by Linux. Go should support those as well.

This change in behavior is controlled by a compatibility GODEBUG, cgroupgomaxprocs=1. This defaults to cgroupgomaxprocs=0 for older language versions. Thus, behavior changes only when upgrading the language version, not when upgrading the toolchain.

The updated and new documentation:

// GOMAXPROCS sets the maximum number of CPUs that can be executing
// simultaneously and returns the previous setting. If n < 1, it does not change
// the current setting.
//
// If the GOMAXPROCS environment variable is set to a positive whole number,
// GOMAXPROCS defaults to that value.
//
// Otherwise, the Go runtime selects an appropriate default value based on the
// number of logical CPUs on the machine, the process’s CPU affinity mask, and,
// on Linux, the process’s average CPU throughput limit based on cgroup CPU
// quota, if any.
//
// The Go runtime periodically updates the default value based on changes to
// the total logical CPU count, the CPU affinity mask, or cgroup quota. Setting
// a custom value with the GOMAXPROCS environment variable or by calling
// GOMAXPROCS disables automatic updates. The default value and automatic
// updates can be restored by calling [SetDefaultGOMAXPROCS].
//
// If GODEBUG=cgroupgomaxprocs=0 is set, GOMAXPROCS defaults to the value of
// [runtime.NumCPU] and does not perform automatic updating.
//
// The default GOMAXPROCS behavior may change as the scheduler improves. 
func GOMAXPROCS(n int) int

// SetDefaultGOMAXPROCS updates the GOMAXPROCS setting to the runtime
// default, as described by [GOMAXPROCS], ignoring the GOMAXPROCS
// environment variable.
//
// SetDefaultGOMAXPROCS can be used to enable the default automatic updating
// GOMAXPROCS behavior if it has been disabled by the GOMAXPROCS
// environment variable or a prior call to [GOMAXPROCS], or to force an immediate
// update if the caller is aware of a change to the total logical CPU count, CPU
// affinity mask or cgroup quota. 
func SetDefaultGOMAXPROCS()

Discussion

Given the details in the background, cgroup CPU shares are bad fit for GOMAXPROCS and cannot be used.

cgroup CPU quota is a much better fit, but is not perfect. The CPU quota allows bursting to higher parallelism, which GOMAXPROCS does not. By setting GOMAXPROCS to the quota, we potentially increase latency for bursty workloads. See the example below for a more thorough explanation.

While this is disappointing, I still believe this makes a better default than the current default of total CPU count, as it mitigates the numerous downsides of a GOMAXPROCS mismatch, which in my opinion are more extreme than lost burst ability. I do expect some workloads will want to manually increase GOMAXPROCS to allow bursts.

Additionally, the CPU quota and period may average to a fractional number of CPUs (e.g., 0.5 or 2.5). GOMAXPROCS does not allow fractional parallelism. If the quota is less than 1, we must round up. If the quota is greater than 1, we can either round up or down. Arguments can go either way. If we round down, we are very unlikely to exceed the limit, which avoids potential freezes for the remainder of the period. If we round up, we can get better utilization by fully using the quota, and this theoretically might better indicate to monitoring systems that we are starved of CPU. I’ve selected the latter approach of rounding up for consistency with the GOMAXPROCS=2 rationale below, but I don’t feel strongly. In particular, uber-go/automaxprocs#13 and #33803 (comment) makes reasonable arguments for rounding down based on the assumption that fractional requests are intended to support additional small processes outside of the Go application.

The CPU quota limit specifies a minimum of GOMAXPROCS=2. That is, with a quota less than or equal to 1, we will round up GOMAXPROCS all the way to 2. GOMAXPROCS=1 disables all parallelism in the Go scheduler, which can cause surprising effects like GC workers temporarily “pausing” the application while the Go runtime switches back and forth between application goroutines and GC worker goroutines. Additionally, I consider a CPU quota less than 1 to be an indication that a workload is bursty, since it must be to avoid hitting the limit. Thus we can take advantage of the bursty nature to allow the runtime itself to burst and avoid GOMAXPROCS=1 pitfalls. If the number of logical or affinity mask CPUs is 1, we will still set GOMAXPROCS=1, as there is definitely no additional parallelism available.

Currently, if you wish to reset GOMAXPROCS to the default value (such as to override the GOMAXPROCS environment variable), you use runtime.GOMAXPROCS(runtime.NumCPU()). Note that NumCPU already takes CPU affinity into account.

All inputs to runtime.GOMAXPROCS are already well defined (>0 set to the passed value, <=0 returns the current value), so I’ve defined a new function (runtime.SetDefaultGOMAXPROCS) which performs the lookup and updates GOMAXPROCS. I’m not particularly attached to this. Some alternatives include:

  1. Change runtime.NumCPU to also consider the CPU quota. This feels like a bad idea to me because creating per-CPU caches would need an actual CPU count, not a quota, but that is extremely niche. Also, this call is currently defined as never updating after process start. It would need to update if users want to use this to discover changes to the quota.
  2. Add a new runtime.CPUQuota API that just returns the quota. This is a bit more flexible, as the result can be used for other purposes. The main downside is that reimplementing the default GOMAXPROCS behavior is complicated. Something like: runtime.GOMAXPROCS(min(runtime.NumCPU(), max(2, math.Ceil(runtime.CPUQuota()))))
  3. Since extremely large values of GOMAXPROCS are not useful, we could define some constant runtime.DefaultGOMAXPROCS = math.Int32Max that could be passed to runtime.GOMAXPROCS(). This is questionably backwards compatible, but setting a huge GOMAXPROCS is likely so slow that I doubt anyone does so.

The runtime will automatically update GOMAXPROCS if the CPU quota (or affinity) changes to accommodate container runtimes that change limits online. In particular, it would be unfortunate and confusing if CPU limits on a container are increased but the Go application never used the additional resources.

Automatic updates to GOMAXPROCS consider changes to the CPU affinity mask in addition to cgroup limit. Theoretically affinity changes should be reflected in runtime.NumCPU, but that call is defined as never changing after startup, which is unfortunate.

Implementation note: The CPU cgroup configuration files (cpu.cfs_quota_us, etc) do not support an explicit notification mechanism (such as poll(2)) when their values change. The only usable notification mechanisms are vfs file watches like inotify(7) / fanotify(7). sched_getaffinity(2) also has no notification mechanism. As a result, my current thinking is that we will detect changes via low frequency reread of the files in sysmon. We will scan with a minimum period of 30s, up to the maximum sysmon period (1 minute, due to forced GCs).

Note that container runtimes often mount a limited cgroupfs revealing only the leaf cgroup. This would prevent the runtime from walking up the hierarchy to check for more restrictive parent cgroups. I suspect this is a minor issue, as expect it is rate to have more restrictive parent cgroups, as that doesn't have much utility. It may even be OK to simply ignore parent cgroups at all times.

A major downside of this proposal is that it has no impact on container runtimes users that set a CPU request but no limit. This is a very common configuration, and will have no change from the status quo, which is unfortunate (note that Uber’s automaxprocs also does nothing for those users). Still, this proposal is better for users that do set a limit, and should not impede future changes for users with only a request.

This proposal is primarily limited to Linux. If other OSes have similar CPU limit mechanisms, I think it would make sense to support those as well in future proposals. The automatic updates of GOMAXPROCS based on changes to CPU scheduling affinity will affect all OSes.

Today’s GOMAXPROCS default is conceptually close to static (“number of CPUs”). With this proposal, I see the runtime moving more towards a dynamic “the runtime selects a good value” approach, which I think is a beneficial move to make if we want to make additional changes in the future, such as eliminating a fixed GOMAXPROCS entirely.

Comparison to go.uber.org/automaxprocs

go.uber.org/automaxprocs is a popular package for automatically setting GOMAXPROCS for container workloads. This proposal is very similar to automaxprocs, effectively an upstream version. For completeness, the major differences between this proposal and automaxprocs are:

  • automaxprocs by default has a minimum GOMAXPROCS of 1. This proposal has a minimum of 2.
  • automaxprocs by default rounds fractional limits down. This proposal rounds up.
  • automaxprocs by default logs changes it makes with log.Printf. This proposal does not log.
  • automaxprocs is more configurable (minimum GOMAXPROCS, rounding, logging). None of these are configurable in this proposal.
  • automaxprocs does not automatically update GOMAXPROCS when the quota changes. This proposal does.
  • If the process is in a CPU cgroup, automaxprocs always uses the CPU quota to set GOMAXPROCS. This proposal will use the number of logical CPUs or CPUs available in sched_getaffinity(2) if those are less than the CPU quota.
  • automaxprocs does not appear to support mixed cgroup v1 and v2 controllers. This proposal does.

Open questions

Implementation

I intend to implement this proposal for Go 1.25 if it is accepted.

Appendix

GOMAXPROCS and CPU quota difference example

To illustration the potential downside of setting GOMAXPROCS to the CPU limit for bursty applications, consider an idealized example application:

  • The machine has 10 logical CPUs.
  • The application runs inside a CPU cgroup with quota = 200ms, period = 100ms.
  • The application receives 1 request every 100ms.
  • Each request requires 50ms of CPU time to complete, and the work is perfectly parallizable.

Without this proposal, GOMAXPROCS=10. When a request is received, the 50ms of work is spread across 10 goroutines and the request completes with a latency of 5ms. The application never exceeds the cgroup quota because it only uses 50ms of CPU time in each 100ms period.

With this proposal, GOMAXPROCS=2. When a request is received, the 50ms of work is spread across 2 goroutunes and the request completes with a latency of 25ms. The application again never exceeds the cgroup quota.

This is a fairly extreme example, with the application completely idle most of the time, but that isn't necessary to get subpar behavior.

The "CPU limit" that we compute from quota / period is simply the average parallelism available for an entire period. Any subsection of the period where an applications uses less than the average CPU limit provides headroom to use more CPU later in the same period. e.g., with a CPU limit of 4, if an application uses only 2 CPU for the first half of the period, it could use 6 CPU for the second half and still achieve an average of 4.

With GOMAXPROCS set to the average CPU limit, Go prevents using more than the average parallelism at any point, so any headroom gained from using less is simply wasted.

CPU request based GC CPU

This is out of scope for this proposal, but one partially-baked idea to help users with a CPU request but no limit is to leave GOMAXPROCS alone, but to restrict the GC’s CPU target and idle workers to the CPU request.

Today, while the GC is running, the Go runtime will use 25% of GOMAXPROCS as “dedicated” GC workers. i.e., it will run on 25% of Ps even if it needs to deschedule user goroutines to do so. Beyond that, it will run “idle” workers on every available idle P. This can cause big CPU spikes described above.

Instead, we could run dedicated workers on 25% of “CPU request” Ps, and only idle workers on Ps up to a total CPU usage of the CPU request. This would resolve the big spikes caused by the GC.

The primary risk of this change is if the GC can no longer keep up with the allocation rate of application, then the GC will force GC assists on user goroutines. This is most likely if the application is using more than its CPU request. One possible mitigation for this would be to adjust up the GC target CPU if the application is using more than the CPU request.

This depends on having a mechanism to determine what the container CPU request value is.

Note that this is an internal implementation detail, so it does not need to be a proposal.

cc @golang/runtime, @sywhang @chabbimilind (for automaxprocs), @thepudds (for previous prototype experience), @thockin (for Kubernetes)

Footnotes

  1. Go intends for system calls and cgo calls that remain on CPU to count towards GOMAXPROCS. However, it is difficult to efficiently determine if these calls are blocked (off CPU), so the runtime uses a heuristic that assumes that calls that take a long wall time are blocked.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ImplementationIssues describing a semantics-preserving change to the Go implementation.ProposalProposal-Acceptedcompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Done

    Status

    Accepted

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions