Skip to content

reduce timer object creation to reduce gc overhead #4175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

es-chow
Copy link
Contributor

@es-chow es-chow commented Feb 6, 2016

by start 3 nodes with GODEBUG=memprofilerate=1, and using

http://host:port/debug/pprof/heap?debug=1

will found heavy timer object creation

1604: 102656 [4251: 272064] @ 0x9cb15c 0x9cb451 0xa0b681 0x94dcd1
#   0x9cb15c    time.NewTimer+0x6c                              /usr/local/go/src/time/sleep.go:74
#   0x9cb451    time.After+0x21                                 /usr/local/go/src/time/sleep.go:110
#   0xa0b681    github.com/cockroachdb/cockroach/server.(*rpcTransport).processQueue+0x781  /cockroach/src/github.com/cockroachdb/cockroach/server/raft_transport.go:166

This creation can be done only once.

Review on Reviewable

for {
raftIdleTimer.Reset(raftIdleTimeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is functionally equivalent. The old code would restart the timer on every iteration. But it looks like this new pattern introduces a small race where the timer can fire just as a request is received. The timer internally sends to raftIdleTimer.C and on the next iteration we call Reset, but that doesn't clear the channel. I'm not familiar enough with this code to know if that race is innocuous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; that does seem like an issue. Now that I look closely, there are several other problems with the shutdown-on-idle behavior. A new request could be written to the channel after the decision has been made to shut down the goroutine, and even without any races the goroutine could stop while there are still outstanding requests that will try to write to the done channel.

We have several instances of this pattern where we want a goroutine that goes away when it is idle (here, gossip.infoStore.runCallbacks, and my upcoming work on parallelizing raft execution). I'm going to try to refactor the infoStore pattern into something reusable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset returns false if the timer had expired (in other words, it returns false if there's a value parked on the channel). Seems easy enough to mitigate:

if !raftIdleTimer.Reset(raftIdleTimeout) {
  <-raftIdleTimer.C
}

@bdarnell
Copy link
Contributor

bdarnell commented Feb 6, 2016

LGTM. While there are some possible races, I don't think this change makes things significantly worse than before so we might as well make the change.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 13, 2016
There are currently 8 places in CockroachDB non-test code
that create a `time.Timer` using `time.NewTimer` during
every iteration of a loop. cockroachdb#4175 proposed a fix for the worst
instance of this issue within `*rpcTransport.processQueue`,
which resulted in upwards of **400,000** timers on a single node
inuse at a given time. The second biggest offender of this
issue was in `kv.send`, which resulted in about **30,000** timers
on a single node inuse at a given time. Together, I diagnosed
that these two issues were responsible for the memory leak
seen in cockroachdb#4346. After making these fixes, it looks like the
issue is gone as memory no longer grows without bound.

I've gone ahead and fixed all 8 occurences of this anti-pattern
across our codebase, using the strategy @tamird brought up in
[this](https://github.com/cockroachdb/cockroach/pull/4175/files#r52558817) comment
to avoid a race condition between iterations with the timers.
A few of the changes might be a little over-aggressive as the
loops are not as "tight" as the ones causing issues, but I still
think it's important to make this change now and avoid these
issues in the future.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 13, 2016
There are currently 8 places in CockroachDB non-test code
that create a `time.Timer` using `time.NewTimer` during
every iteration of a loop. cockroachdb#4175 proposed a fix for the worst
instance of this issue within `*rpcTransport.processQueue`,
which resulted in upwards of **400,000** timers on a single node
inuse at a given time. The second biggest offender of this
issue was in `kv.send`, which resulted in about **30,000** timers
on a single node inuse at a given time. Together, I diagnosed
that these two issues were responsible for the memory "leak"
seen in cockroachdb#4346. After making these fixes, it looks like the
issue is gone as memory no longer grows without bound.

I've gone ahead and fixed all 8 occurences of this anti-pattern
across our codebase, using the strategy @tamird brought up in
[this](https://github.com/cockroachdb/cockroach/pull/4175/files#r52558817)
comment
to avoid a race condition between iterations with the timers.
A few of the changes might be a little over-aggressive as the
loops are not as "tight" as the ones causing issues, but I still
think it's important to make this change now and avoid these
issues in the future.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 14, 2016
There are currently 8 places in CockroachDB non-test code
that create a `time.Timer` using `time.NewTimer` during
every iteration of a loop. cockroachdb#4175 proposed a fix for the worst
instance of this issue within `*rpcTransport.processQueue`,
which resulted in upwards of **400,000** timers on a single node
inuse at a given time. The second biggest offender of this
issue was in `kv.send`, which resulted in about **30,000** timers
on a single node inuse at a given time. Together, I diagnosed
that these two issues were responsible for the memory "leak"
seen in cockroachdb#4346. After making these fixes, it looks like the
issue is gone as memory no longer grows without bound.

I've gone ahead and fixed all 8 occurences of this anti-pattern
across our codebase, using the strategy @tamird brought up in
[this](https://github.com/cockroachdb/cockroach/pull/4175/files#r52558817)
comment
to avoid a race condition between iterations with the timers.
A few of the changes might be a little over-aggressive as the
loops are not as "tight" as the ones causing issues, but I still
think it's important to make this change now and avoid these
issues in the future.
@es-chow
Copy link
Contributor Author

es-chow commented Feb 14, 2016

close as #4367 will fix the same.

@es-chow es-chow closed this Feb 14, 2016
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 15, 2016
There are currently 8 places in CockroachDB non-test code
that create a `time.Timer` using `time.NewTimer` during
every iteration of a loop. cockroachdb#4175 proposed a fix for the worst
instance of this issue within `*rpcTransport.processQueue`,
which resulted in upwards of **400,000** timers on a single node
inuse at a given time. The second biggest offender of this
issue was in `kv.send`, which resulted in about **30,000** timers
on a single node inuse at a given time. Together, I diagnosed
that these two issues were responsible for the memory "leak"
seen in cockroachdb#4346. After making these fixes, it looks like the
issue is gone as memory no longer grows without bound.

I've gone ahead and fixed all 8 occurences of this anti-pattern
across our codebase, using the strategy @tamird brought up in
[this](https://github.com/cockroachdb/cockroach/pull/4175/files#r52558817)
comment
to avoid a race condition between iterations with the timers.
A few of the changes might be a little over-aggressive as the
loops are not as "tight" as the ones causing issues, but I still
think it's important to make this change now and avoid these
issues in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants