Why Redundancy is Supreme
The IT industry is full of acronyms, myths and legends. We, in the industry, seem hell-bent on reducing every name or action to a cluster of letters and numbers. We also appear to pay homage to claims and standards that have no real voracity when analysed with objectivity. The Nines, or the 9s, is such a ‘standard’.
The Nines is a way of expressing how available something is and began life in the Telco sector of the industry, created to indicate availability of services.
Within the Telco world there is a standard named ‘Carrier Grade’. Carrier Grade was, and is, a definition of reliability and performance. The Nines, on the other hand, are used to specify or indicate up-time. There is an argument that supports the definition of the Nines as a ‘probability of failure’ and that availability is a product of reliability.
Reliability – a system is performing correctly
Availability – a system is immediately ready for use
Manufacturers of IT equipment used in the data centre apply an estimation of reliability to their products. This estimation became MTBF and MTTR.
MTBF = Mean Time Between Failure, which really means the amount of time between a module’s failure (average and an estimate)
MTTR = Mean Time To Recovery is an estimate of the average time needed to recover or, repair, a module
What we need to bear in mind is that these are estimates made by the manufacturer and are not some hard and fast scientific standard. Manufacturers use quality standards and test-lab practices to try and make sure that their products meet a high standard where economics permit. You get what you pay for.
The Nines look like this:
99% (2 nines) = downtime of 3.65 days/year90% (1 nine) = downtime of 36.5 days/year
99.9% (3 nines) = downtime of 8.76 hours/year
99.99% (4 nines) = downtime of 52 minutes/year
99.999% (5 nines) = 5 minutes/year
99.9999% (6 nines) = 31 seconds/year
The Nines are units of percentage of time that something is available or, at least, that’s the theory.
Today, data centre and Telco networks need to be up and running 24 X 7 X 365. Customers and the market demand this level of availability. Manufacturers and integrators have to make sure that they have suitable levels of reliability designed into their products and services to be able to meet this demand.
We have striven in the preceding years to try to create resources that are always available. In doing this we had to have guides and measures, targets and boundaries to work to. The Nines have fitted that need quite nicely.
Interestingly, the arrival of Cloud has not increased the requirement for availability per se, and neither has it reinforced the pressure on manufacturers for reliability. Amazon EC2 has a Nines level of 99.95%. That is, the service has an SLA (Service Level Agreement) with its customer that guarantees access to the service at a level of 99.95%. Amazon has included some clever wording in its SLA, “guarantees 99.95% availability of the service within a Region over a trailing 365 period”.This meant that when Amazon suffered a 4 X day outage it could claim that it still met its SLA as the failures were isolated to particular technologies and not the EC2 service itself. Not much use to those who lost the service but, it does show that the Nines are not writ in stone. It could be argued that Amazon is being disingenuous and ‘creepy’. Companies involved in Cloud Services and Cloud technologies are, more and more, using cheaper and lesser quality hardware. There are many people who argue that cheap is best and that the customer doesn’t care what kit is used in any case. So, there is a drive to use large scale processing (Grid springs to mind) using lower cost equipment but, using lots of redundancy to mitigate the cheapness. This is particularly the case when Public Cloud services are delivered. Private Cloud will demand much higher quality equipment and levels of service.
While the Nines will continue to be used we should look a little deeper into just how they are used and why we should be careful with them when making claims about availability.
An Example of the Nines Confusion
Imagine the inside of a standard data centre. It will be populated with racks of equipment, mostly servers for our purpose. Each server is connected to the other servers by cables and switches. The whole data centre is a network of processing and storage that provides a service. The data centre provider will claim 99.999% availability. This implies that the data centre and its contents will be unavailable only for 5.26 minutes per 365 days.
Is this achievable and is it sensible to claim it is? The answer is no to both questions. It is impossible to keep a data centre running at 99.999%. What actually happens is that down-time is planned and systems are taken off-line in such a way as not to interfere with the service. So, what we know is that it isn’t the data centre or the equipment that is said to be available for 99.999% of the time over a period. It’s the service that it provides. While this may seem like common sense, you would be surprised how many people in the industry think and believe that availability is the concern of the data centre and its equipment.
Companies do make claims about data centre up-time. Usually they claim that they can provide 99.999% up-time. How can they make that claim? They do this by making sure that the failure of modules/components/units, which they know will happen, is mitigated by Redundancy. This must be the case because to claim 5 X 9s for a data centre implies that each system will meet that availability and each component in each system will also meet that level of availability. We know this is impossible because no manufacturer can claim, or has every produced, a component or unit that lasts forever. There will be a failure in the chain. That failure is evened out by Redundancy.
To make sense of why redundancy is supreme we need to understand resilience. The reason is that resilience is often stated as the key value of an infrastructure design. This is used to impress a customer by implying that resilience will protect the customer’s service. It won’t. Resilience isn’t about protecting the customer’s service. It’s about how the infrastructure fails.
Let me explain my last statement before you stop reading. The definition of resilience is:
Oxford English Dictionary definition #2 the capacity to recover quickly from difficulties; toughness. Definition #1 isn’t relevant to IT systems and services.
If we imagine a server that is running at 90% all the time under load, eventually a component will fail due to the heat being generated internally. Let’s assume the hard disks that are running the operating system or hypervisor fail. The definition of resilience comes into play. The manufacturer should have provided a guide to the level of utilisation, or a guide to the kind of performance and lifecycle that can be expected from that hard disk under certain conditions. There will be some indication of the maximum temperature that the disk can tolerate over a period of time. This parameter is the nature of the resilience of the component. So, from this example we can say that resilience has a limit. That means we will lose that disk to overheating. Yes but, there are two hard disks that make up the RAID1/Mirror that host the operating system or hypervisor. When one fails the other takes the load. The problem in our example is that both disks will be subject to the same overheating and both will have an identical resilience parameter for an overheated running environment. The logical outcome is that the server goes down as it can no longer provide the service without the operating system or hypervisor. In the case of a hypervisor loss, all the virtual (machines) servers hosted on that hypervisor could be lost, or at least need to be restarted on some other hardware if that is possible.
Resilience is tied to the parameters created and maintained by the manufacturer. A component is only as resilient as it is made to be.
Resilience is not an answer to required availability. It is more connected with reliability. Availability is subject to reliability. If you want to have a high level of availability, you have to take into account the level of reliability that is possible with your chosen components and systems.
So, how can we mitigate the weakness of resilience? The answer is redundancy.
Years ago, only Telcos needed to be concerned with availability. There were no companies or organisations that had a requirement for highly available data. Data was still based on hard-copy and not on computers. As computing became more sophisticated and available to the market, data started to become a word, and a product, that gained in importance. In the early days data existed on very large but, very small capacity, hard disks. Most data was saved to some kind of tape and later to floppy disks of one size or another. Availability as we know it wasn’t an issue. As long as you could get at your data, that was enough. As networking started to become an integral part of business computing infrastructures (step up from sneaker-net), so people began to think more about accessing data, and backing it up when that simple workgroup went down. Speed of access wasn’t really important. If it took a week to get it back, that was a relief.
Early networks were linear. Each workstation depended on the other for keeping the network running. Cabling was clumsy and needed terminators to be in place to work properly. Lose a workstation, or a terminator, and you could lose the network and lose access to the data you needed to do your work. Star and mesh network designs attempted to remove single system reliance. With these topologies it didn’t matter if a workstation or server failed (terminators had disappeared by this time) as the network stayed up. Once network stability improved users started to demand more availability of data. This was usually provided by backups. A little further along the road came the Microsoft’s Distributed File System. This enabled Replicas to be created of the data. If a file server failed, another could provide that data. Then came remote working, the Internet and the World-Wide-Web and the rest is history. The demand for always available data, applications and services had arrived with a vengeance.
Today we all demand high availability. We get pretty sore if our data and applications are not always available. Cloud Services bring ubiquitous data that exists everywhere. We can access it 24 X 7 X 365 using a simple web browser. Add to that our mobile devices and you can see that our infrastructures must be highly reliable and, by consequence, highly available.
We already know that resilience is not an answer for high availability. Only redundancy can provide the high levels of availability demanded today by users across the globe.
Redundancy is the solution to resilience. That sounds weird but, it’s true. By designing redundancy into every part of the infrastructure it is possible to overcome the limitations that manufacturers and service suppliers have built into their products. You can’t expect a server or any component of that sever to last forever and work perfectly under all conditions that might be imposed. We could say that resilience is about failure and redundancy is about success.
Solution Architects spend their time designing IT infrastructures. They work at all levels of the network stack. From the facilities of the data centre through the hardware, operating systems, hypervisors and the applications. Beyond that they are involved with the network, the storage and the servers, plus a wide range of other technologies and systems that are used to provide the services that are delivered daily to billions of users everywhere.
Architects design networks. Their responsibility is to create designs that are appropriate to the business need of a particular customer. This means that a design for a financial institution is going to be very different to that required for a small company making flowerpots. Where there is a similarity is in the value of the data to the users in each scenario. Therefore, the design must have levels of resilience and redundancy specific to each company’s business requirements. Financial institutions are themselves available 24 X 7 in one form of service or another. So, the network design has to have high levels of redundancy built in to each layer. Whereas the flowerpot manufacturer, probably, only wants to be able to access certain data each day and have the ability to get back a lost file within a time commensurate to the importance of that file to the business. In this case we are taking about backup and retrieval systems. In the case of the financial institution we are talking about two of everything everywhere (a simple description but, accurate).
SPOF Single Point of Failure
However, the redundant network is itself a SPOF. To make it a high availability service it needs another level of redundancy. That is the data centre that houses the network should also be in a redundant topology. These days most data centres are redundant. Not just redundancy based on the twin-model but, multiple data centres all connected with high-speed fibre and able to sustain 100% levels of availability to enable service delivery even if a data centre is hit by a complete outage of power, or if it is subject to a catastrophic attack that renders it useless.
Redundant routers are connected with a redundant cabling topology so that the loss of two routers, one at each site, will not damage services. There may be a degradation of service it terms of data speed of delivery but, the services will remain up.
The arrival of the Internet brought with it an even higher level of redundancy than was ever achievable by any single provider of data centre services. In most cases Internet Services Providers do not offer high levels of availability as a key selling point. This is odd as the Internet is a highly available environment.
Internet services are provided by multiple geo-located data centres, each with multiple levels of redundant systems and networks able to withstand a global catastrophe.The Internet is the ultimate redundant network. A massive MESH topology that has multi-pathing at the ‘n’th level. data centres around the globe are interconnected using a variety of media including fibre, copper and satellite.
The Nines lost meaning with the event of the Internet. However, the Nines and its indicated Availability stayed with us in the data centres and server-rooms across the world, and will be for the foreseeable future.
Oh yes, I forgot to mention that even with all this redundancy there will still be outages. That’s just the way it is.