New Relic strikes again! Kudos to New Relic’s RPM for helping us isolate an issue in production.
We noticed an increase in latency (12 -> ~25ms) over the past week and did a deployment this morning (marked by the vertical line) to see if it would fix the issue.
Unfortunately it didn’t fix the issue. We did some additional troubleshooting and tracked the slowness down to only happening on calls that were using Redis for Resque jobs. Turns out, redis was recently reconfigured to persist to disk every 10 minutes (configuration: save 600 100000). This is completely the wrong configuration to have when using resque at the level we use it. Our resque environment handles over 50 million queue jobs a day. In the 10 minutes it will take to write this blog post it will have completed nearly half a million jobs.
That’s quite a large amount of throughput, and it makes no sense to be writing all these changes to disk. This led me to make a change in redis’ configuration which caused additional high load on the systems. This is reflected by the large spike after the deployment, which confirmed my belief that redis was experiencing increased latency of enqueuing items via resque. The fix? Good question. There are multiple ways to fix this scenario — turn off persisting to disk entirely, increase the save configuration flag to a higher interval, decrease the IO latency when writing to disk (faster disks), and many others.
For now, I changed persisting to disk to once an hour, and as you can see the change was immediately reflected. Going forward, we are planning to implement a non-blocking local buffer (another level of queueing) before being put into resque to reduce the latency of the API to the clients, which in turn gives an overall better user experience.
Onward we scale!
If you haven’t yet played around with New Relic I believe they have a free version which could be of great use to you — we at OMGPOP can’t recommend it enough.