To deliver a highly available application, it’s standard practice to set up alarms that get triggered if something goes wrong. Alarms may be triggered by: Good alarms are actionable; otherwise, important issues may be masked by seemingly unimportant alerts and get swept under the rug.

We tested our upstream servers to figure out why they were responding so slowly, and we found they weren’t actually responding slowly at all.

Establishing a connection is slow and expensive; to avoid opening a new connection for every request, our ELB and upstream servers were using the same connection to handle several requests (known as keep-alive connections).

Once an alarm is resolved, there are some metrics we consider when evaluating the significance of an alert: If the alarm was triggered despite no disruption on the preceding factors, then you may have a noisy alarm, and in our case, we did.

Related Articles