Technology vendors and providers often throw around the term
First of all, five nines does NOT refer to reliability. It refers to availability. Availability is the probability that a device or service will be working when you go to use it.
Availability is composed of two factors: Mean Time Between Failures (MTBF) or uptime and Mean Time To Repair (MTTR) or downtime. MTBF is the measure of reliability -- how failure-prone is the technology. Both MTBF and MTTR are commonly measured in hours.
Calculating availability for unified communications
Availability is described by the following equation:
Availability = [MTBF ÷ (MTBF + MTTR)] X 100 = 9X.XXX%
The R in MTTR stands for repair, but that's not the measurement you should use. The R should refer to the total time to restore the product or service to full operating condition. The restoration number needs to include the time for:
- Failure detection
- Failure notification
- Vendor/provider response
- Repair or replacement
The availability metric does not tell everything you need to know. It doesn't tell you about the severity of an outage or the operational characteristics. Your system could suffer one huge outage or many short outages and still deliver 99+% availability. But the metric is still useful.
The following table translates 99.x% availability into operational terms. As you can see, the total downtime for five nines availability over 24 hours X 365-1/4 days is only five minutes and 15 seconds. This is a hard figure to deliver.
Translating five nines availability into time
|Availability||Downtime in one year|
|99.999%||5 minutes, 15 seconds|
|99.99%||52 minutes, 36 seconds|
|99.95%||4 hours, 23 minutes|
|99.9%||8 hours, 46 minutes|
|99.5%||1 day, 19 hours, 48 minutes|
|99%||3 days, 15 hours, 40 minutes|
Applying MTBF and MTTR to UC hardware and software
So what does the availability figure include? The MTBF and MTTR are almost always related to hardware. In a unified communications environment this includes servers, gateways, switches, routers, power supplies and endpoints, such as PCs and IP phones. It is true that most hardware components are highly available and could meet the 99.999% figure.
Availability figures provided by vendors are rarely based on field experience. The MTBF figure is usually a calculated prediction using the Telcordia parts count method originally developed by Bell Labs for telecommunications systems. It takes two years of operating in the field, without changes, to prove a MTBF figure. It is unusual for a system to remain unchanged for two years. Every time the hardware changes, a new prediction calculation must be produced. So, the availability figures are also predictions.
What is not included in the MTBF calculation is very revealing. The vendors do not include:
- Software failures
- Loss of electrical power
- Network loss (LAN and WAN)
- Time to install software changes, bug fixes and upgrades
- Preventive maintenance
- Server shutdown for operating system changes, bug fixes and upgrades
- Software reboot time
There is no formula for predicting the reliability of unified communications software -- or any software. With today's dependence on software, the products and services offered are no better than the software installed. The real reliability figure should be based on the software reliability, which cannot be predicted. Only field experience can be used to determine the software MTBF. Furthermore, unified communications software periodically changes, which does not help to stabilize the reliability or MTBF.
Is five nines availability a worthy pursuit?
Let's assume a situation of an operating unified communications network. The following example covers one year of operation with some modest assumptions of downtime. Just to be very conservative, this calculation assumes there are no hardware failures in one year.
- Software failure: one failure outage of one hour
- Loss of electrical power: assume there is UPS so there is no power loss in the year
- Network loss: one outage of one hour
- Time to install software changes, bug fixes and upgrades: multiple occasions when some device, probably the server, must be shut done for a total of 12 hours or one hour per month
- Preventive maintenance: one outage of one hour
- Server shutdown for operating system changes, bug fixes and upgrades: once per year for an outage of two hours
- Software reboot time: four times in one year of 15 minutes each for a total of one hour
This is a total of 17 hours of outage per year. This produces an availability of 99.8%. That's not bad, but it's not 99.999%. So this begs the question: Does an enterprise unified communications environment ever experience five nines availability? Not likely. However, is five nines availability worth pursuing?
Assume that your enterprise is operating 12 hours per day, five days a week and all 52 weeks in one year. This equates to only 36.6% of the full year. If anything comes down and is fixed outside of working hours, then 99.8% is very acceptable.
Trying to attain five nines availability is very costly because you must have redundancy for nearly every hardware component. There must be near instantaneous switchover from a failed component to an operating component. Also, the software must be very stable. This is a very costly solution that may not be necessary for an enterprise that is closed for 108 hours out of the week. If, however, the enterprise never closes, then the design must include some redundancy for those components most likely to fail.
The points to be made are:
- Five nines availability is probably impossible to attain when all the factors are included.
- Software is the Achilles heel of UC availability.
- The budget to attain five nines availability is out of the reach of most enterprises.
- Most enterprises are living with availability closer to 99.8%, and this is probably acceptable.
If you would like a more detailed discussion on this topic that includes the calculations for redundant configurations, email Delphifirstname.lastname@example.org and mention this article.
About the author: Gary Audin has more than 40 years of computer, communications and security experience. He has planned, designed, specified, implemented and operated data, LAN and telephone networks. These have included local area, national and international networks as well as VoIP and IP convergent networks in the U.S., Canada, Europe, Australia and Asia.
This was first published in November 2010