reduce timer object creation to reduce gc overhead #4175

es-chow · 2016-02-06T02:18:00Z

by start 3 nodes with GODEBUG=memprofilerate=1, and using

http://host:port/debug/pprof/heap?debug=1

will found heavy timer object creation

1604: 102656 [4251: 272064] @ 0x9cb15c 0x9cb451 0xa0b681 0x94dcd1
#   0x9cb15c    time.NewTimer+0x6c                              /usr/local/go/src/time/sleep.go:74
#   0x9cb451    time.After+0x21                                 /usr/local/go/src/time/sleep.go:110
#   0xa0b681    github.com/cockroachdb/cockroach/server.(*rpcTransport).processQueue+0x781  /cockroach/src/github.com/cockroachdb/cockroach/server/raft_transport.go:166

This creation can be done only once.

petermattis · 2016-02-06T02:39:00Z

server/raft_transport.go

 	for {
+		raftIdleTimer.Reset(raftIdleTimeout)


I'm not sure if this is functionally equivalent. The old code would restart the timer on every iteration. But it looks like this new pattern introduces a small race where the timer can fire just as a request is received. The timer internally sends to raftIdleTimer.C and on the next iteration we call Reset, but that doesn't clear the channel. I'm not familiar enough with this code to know if that race is innocuous.

Good catch; that does seem like an issue. Now that I look closely, there are several other problems with the shutdown-on-idle behavior. A new request could be written to the channel after the decision has been made to shut down the goroutine, and even without any races the goroutine could stop while there are still outstanding requests that will try to write to the done channel.

We have several instances of this pattern where we want a goroutine that goes away when it is idle (here, gossip.infoStore.runCallbacks, and my upcoming work on parallelizing raft execution). I'm going to try to refactor the infoStore pattern into something reusable.

Reset returns false if the timer had expired (in other words, it returns false if there's a value parked on the channel). Seems easy enough to mitigate:

if !raftIdleTimer.Reset(raftIdleTimeout) { <-raftIdleTimer.C }

bdarnell · 2016-02-06T03:19:19Z

LGTM. While there are some possible races, I don't think this change makes things significantly worse than before so we might as well make the change.

@tamird

There are currently 8 places in CockroachDB non-test code that create a `time.Timer` using `time.NewTimer` during every iteration of a loop. cockroachdb#4175 proposed a fix for the worst instance of this issue within `*rpcTransport.processQueue`, which resulted in upwards of **400,000** timers on a single node inuse at a given time. The second biggest offender of this issue was in `kv.send`, which resulted in about **30,000** timers on a single node inuse at a given time. Together, I diagnosed that these two issues were responsible for the memory leak seen in cockroachdb#4346. After making these fixes, it looks like the issue is gone as memory no longer grows without bound. I've gone ahead and fixed all 8 occurences of this anti-pattern across our codebase, using the strategy @tamird brought up in [this](https://github.com/cockroachdb/cockroach/pull/4175/files#r52558817) comment to avoid a race condition between iterations with the timers. A few of the changes might be a little over-aggressive as the loops are not as "tight" as the ones causing issues, but I still think it's important to make this change now and avoid these issues in the future.

@tamird

There are currently 8 places in CockroachDB non-test code that create a `time.Timer` using `time.NewTimer` during every iteration of a loop. cockroachdb#4175 proposed a fix for the worst instance of this issue within `*rpcTransport.processQueue`, which resulted in upwards of **400,000** timers on a single node inuse at a given time. The second biggest offender of this issue was in `kv.send`, which resulted in about **30,000** timers on a single node inuse at a given time. Together, I diagnosed that these two issues were responsible for the memory "leak" seen in cockroachdb#4346. After making these fixes, it looks like the issue is gone as memory no longer grows without bound. I've gone ahead and fixed all 8 occurences of this anti-pattern across our codebase, using the strategy @tamird brought up in [this](https://github.com/cockroachdb/cockroach/pull/4175/files#r52558817) comment to avoid a race condition between iterations with the timers. A few of the changes might be a little over-aggressive as the loops are not as "tight" as the ones causing issues, but I still think it's important to make this change now and avoid these issues in the future.

@tamird

There are currently 8 places in CockroachDB non-test code that create a `time.Timer` using `time.NewTimer` during every iteration of a loop. cockroachdb#4175 proposed a fix for the worst instance of this issue within `*rpcTransport.processQueue`, which resulted in upwards of **400,000** timers on a single node inuse at a given time. The second biggest offender of this issue was in `kv.send`, which resulted in about **30,000** timers on a single node inuse at a given time. Together, I diagnosed that these two issues were responsible for the memory "leak" seen in cockroachdb#4346. After making these fixes, it looks like the issue is gone as memory no longer grows without bound. I've gone ahead and fixed all 8 occurences of this anti-pattern across our codebase, using the strategy @tamird brought up in [this](https://github.com/cockroachdb/cockroach/pull/4175/files#r52558817) comment to avoid a race condition between iterations with the timers. A few of the changes might be a little over-aggressive as the loops are not as "tight" as the ones causing issues, but I still think it's important to make this change now and avoid these issues in the future.

es-chow · 2016-02-14T09:10:51Z

close as #4367 will fix the same.

@tamird

There are currently 8 places in CockroachDB non-test code that create a `time.Timer` using `time.NewTimer` during every iteration of a loop. cockroachdb#4175 proposed a fix for the worst instance of this issue within `*rpcTransport.processQueue`, which resulted in upwards of **400,000** timers on a single node inuse at a given time. The second biggest offender of this issue was in `kv.send`, which resulted in about **30,000** timers on a single node inuse at a given time. Together, I diagnosed that these two issues were responsible for the memory "leak" seen in cockroachdb#4346. After making these fixes, it looks like the issue is gone as memory no longer grows without bound. I've gone ahead and fixed all 8 occurences of this anti-pattern across our codebase, using the strategy @tamird brought up in [this](https://github.com/cockroachdb/cockroach/pull/4175/files#r52558817) comment to avoid a race condition between iterations with the timers. A few of the changes might be a little over-aggressive as the loops are not as "tight" as the ones causing issues, but I still think it's important to make this change now and avoid these issues in the future.

reduce timer object creation to reduce gc overhead

ec711eb

petermattis reviewed Feb 6, 2016
View reviewed changes

This was referenced Feb 13, 2016

Allocate timers outside of loops to avoid repeat allocations #4367

Merged

Memory leak on 3-node cluster #4346

Closed

es-chow closed this Feb 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reduce timer object creation to reduce gc overhead #4175

reduce timer object creation to reduce gc overhead #4175

Uh oh!

es-chow commented Feb 6, 2016

Uh oh!

petermattis Feb 6, 2016

Uh oh!

bdarnell Feb 6, 2016

Uh oh!

tamird Feb 11, 2016

Uh oh!

bdarnell commented Feb 6, 2016

Uh oh!

es-chow commented Feb 14, 2016

Uh oh!

Uh oh!

reduce timer object creation to reduce gc overhead #4175

reduce timer object creation to reduce gc overhead #4175

Uh oh!

Conversation

es-chow commented Feb 6, 2016

Uh oh!

petermattis Feb 6, 2016

Choose a reason for hiding this comment

Uh oh!

bdarnell Feb 6, 2016

Choose a reason for hiding this comment

Uh oh!

tamird Feb 11, 2016

Choose a reason for hiding this comment

Uh oh!

bdarnell commented Feb 6, 2016

Uh oh!

es-chow commented Feb 14, 2016

Uh oh!

Uh oh!