Reliability and Uptime

November 24, 2022

Have you ever seen a tag line that says a piece of software has “Five Nines” of uptime? Have you ever thought about what that means?

Downtime

What does it mean for a system to be down? More specifically, what does it mean for your system to be down?

For the purpose of easy comparison, I’ve put all the figures in dollars and cents - this works equally well with whichever currency you prefer.

Let’s imagine that you run a well performing site, and for every minute that site is down, it costs $50 in lost sales.

Incidentally, if you don’t have this figure, then you need it before making any decisions about downtime.

Right - now let’s look at Azure API Management - they offer 99.99% for the premium tier only, or 99.95% for Basic.

The following pricing chart was taken from here:

That’s $0.21 / hour for basic, and $3.83 / hour for premium.

To be clear, there may be other reasons to choose premium; I’m simply talking about guaranteed uptime here. Additionally, please don’t take this as an accurate cost estimation for API Management, or anything in Azure - the cost comparison is simply to make a point.

In a single month, there are 730 hours, so:

Basic	Premium
$153.3	$2795.9
99.95 % Uptime	99.99% Uptime
~ 22m possible downtime	~ 5m possible downtime

What we’re saying here is that, each month, you’re paying $2642.60 for an additional guaranteed 17m of uptime each month.

Your site makes $50 every minute, so in the potential down-time, you could have made $850.

I appreciate there may be other factors than cost: reputation, customer retention, etc - again - this is just a cost comparison.

Mitigating Downtime Yourself

Okay, we’ve discussed how much a managed service such as Azure can cost you. Remember that similar figures occur across services and across cloud providers - keeping a system running with no downtime at all is expensive.

Let’s imagine that you want to maintain 99.99% uptime yourself. Firstly, you can have zero direct dependencies - any API call that’s part of your synchronous flow must be removed because you can’t guarantee the uptime of the dependency - even if it’s also 99.99%, you don’t know which 5 minutes of the month they’ll be down for.

Next, you need to, at least, double your hardware provision - every piece of hardware that you run needs a failover, and that failover needs to happen almost instantaneously - since you have 5 minutes for the entire month.

If there are any issues, and you spot them as they happen, and you employ superman, you may get it sorted within the 5 minutes - but that’s your entire budget for the month.

Summary

I want to be clear here - I’m not, in some way, saying that you shouldn’t consider paying for reliability, or provisioning for downtime. I’m not saying that keeping your site running isn’t important, but it’s just worth doing some quick figures to work out, financially, exactly what that reliability is costing you.