Description
I've been encountering a problem with our (reverse proxy) nginx servers that they have been crashing. They stop responding to requests completely, and don't seem to come out of this state by themselves. These servers deal with a highish volume of requests (>5million a day).
For the past few days I've been at a bit of a loss, and restarting the Docker instance manually whenever I was alerted to this by monitoring, but I decided to put a helper cron script in place that would check if nginx was still responding and restart it via supervisord if there was an issue.
Due to the fact initially I was restarting the Docker container, I wasn't really getting any form of debugging information -- the logging would just stop. However after changing this to instead restart nginx inside the container I have the following in the logs:
2017/02/01 01:14:16 [alert] 489#0: worker process 501 exited on signal 9
2017/02/01 01:14:16 [alert] 489#0: shared memory zone "auto_ssl" was locked by 501
2017/02/01 01:14:16 [alert] 489#0: worker process 502 exited on signal 9
I had around on Google and the only reference I can find is 18F/api.data.gov#325 -- however it looks like expirations were put into place, this doesn't seem to be working on our setup, as we (due to bad monitoring) ended up with about a 7 hour downtime recently.
I should mention I cannot recreate this bug at all locally, even using the same Docker container.
I'm at a bit of a loss, our automatic restart script has sorted out the issues for now but it would be nice to see if anyone has ideas. I'd be happy to turn on extra logging and attempting the debug log (I've been a bit scared to turn it on in our production servers).