Performance Monitoring Tools

The New Standards of Monitoring: Lessons Learned from Our Production Systems

There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc.). Beyond that, there are a lot of lessons that I’ve learned from operating production systems that have influenced the breadth of monitoring that we perform day in and day out.

Below are a few monitoring checks that have been added to our regular check list—all of which claim their spot due to lessons learned (lessons we would’ve liked to avoid in the first place!).

Processes Creation Rate (Fork Rate)

We once had a problem where IPv6 was intentionally disabled on a box. This caused a significant and unexpected issue for us: each time a new network connection was created, modprobe would spawn a new process to evaluate IPv6 status. This rapid creation of new processes slowed our servers in what is known as a "fork bomb". We eventually tracked it down by noticing that the process counter in /proc/stat was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10/sec on a production server with steady traffic.

Flow Control Packets – Controlling Transmission

TL;DR; If your network configuration honors flow control packets and isn’t configured to disable them, they can temporarily cause dropped traffic (if this doesn’t sound like an outage, then I don’t know what does).

$ /usr/sbin/ethtool -S eth0 | grep flow_control

rx_flow_control_xon: 0

rx_flow_control_xoff: 0

tx_flow_control_xon: 0

tx_flow_control_xoff: 0

Note: Read this to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC’s. You should also trend these metrics on your switch gear. While you’re at it, watch your dropped frames.

Swap In/Out Rate: Boosting Memory Efficiency

It’s common to check for swap usage (extra space on your hard drive reserved to supplement your memory) above a threshold. But even if you have a small quantity of memory swapped, it’s actually the rate it’s swapped in/out that can impact performance, not the quantity. Opt for a more direct check for that state.

Server Boot Notification

Unexpected reboots are part of life. Do you know when they happen on your servers? Most people don’t. We use a simple init.d script that triggers an email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.

NTP Clock Offset

If not monitored, yes, one of your servers is probably off. If you’ve never thought about clock skew you might not even be running ntpd on your servers. Generally, there are three things to check for: 1) that ntpd is running; 2) clock skew inside your datacenter; and 3) clock skew from your master time servers to an external source.

We use check_ntp_time for this check.

DNS Resolutions

Internal DNS: it’s a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) local resolutions from each server; 2) external resolution and quantity of queries (if you have DNS servers in your datacenter); and 3) the availability of each upstream DNS resolver you use.

External DNS: it’s good to verify your external domains resolve correctly against each of your published external nameservers. We also rely on several CC TLD’s and we monitor those authoritative servers directly as well (yes, it’s happened that all authoritative nameservers for a TLD have been offline).

SSL Certificate Expiration

It’s the thing everyone forgets about because it happens so infrequently. An expired SSL Certificate could unexpectedly cause unavailability of a secure website.

The fix is easy, just check SSL expiration dates and get alerted with enough timeframe to renew your SSL certificates.

DELL OpenManage Server Administrator (OMSA)

We run with a split across two data centers: the first is a managed environment with DELL hardware, and the second is Amazon EC2. The key is to proactively monitor data centers and ensure procedures are in place for regular checks. For our DELL hardware, it’s important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.

Connection Limits: Managing Database and Memory

You probably run things like memcached (for in-memory caching) and MySQL (for database storage) but you may not have realized these have default connection limits. Do you monitor how close you are to those limits as you scale out application tiers?

Load Balancer Status

We configure our Load Balancers with a health check, which we can easily force to fail in order to have any given server removed from rotation. We’ve found it important to have visibility into the health check state, so we monitor and alert based on the same health check (if you use EC2 Load Balancers, you can monitor the ELB state from Amazon APIs).

This scratches the surface of how to keep a stable environment in the development world of your company. Keep monitoring—consistency is the name of the game!



Jehiah Czebotar is VP of Engineering at Bitly


« Rant: Re-Booting Boot Camp


Have Awful Tech Anthems Finally Died a Death? »
IDG Connect

IDG Connect tackles the tech stories that matter to you

  • Mail


Do you think your smartphone is making you a workaholic?