Projects and random notes: Enterprise architecture RAMPS

Summary

How do you design a computing system to provide continuous service and to ensure that any failures interrupting service do not result in customer safety issues or loss of customers due to dissatisfaction?

Historically, system architects have taken two approaches to answer this question: building highly reliable, fail-safe systems with low probability of failure, or building mostly reliable systems with quick automated recovery.

The RAS (Reliability, Availability, Serviceability) concept for system design integrates concepts of design for reliability and for availability along with methods to quickly service systems that can't be recovered automatically. This approach is fundamental to systems where the concern is quality of service, customer retention, and due diligence for customer safety.

This article was compiled from

Wikipedia:

http://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability_(computer_hardware)

IBM:

http://www.ibm.com/developerworks/power/library/pa-bigiron2/

What is Reliability, Availability, Performance and Serviceability?

Reliability is evaluated by mean time to failure, and availability is measured by service uptime over a year of continuous operation. Reliability is a measure of the ability of a system to function correctly, including avoiding data corruption, whereas availability measures how often it is available for use, even though it may not be functioning correctly. For example, a server may run forever and so have ideal availability, but may be unreliable, with frequent data corruption.

Reliability of a system can be defined as the probability that it will produce correct outputs up to some given time t. Reliability is enhanced by features that help to avoid, detect and repair hardware faults. A reliable system does not silently continue and deliver results that include uncorrected corrupted data. Instead, it detects and, if possible, corrects the corruption, e.g., by retrying an operation for transient (soft) or intermittent errors, or else, for uncorrectable errors, isolating the fault and reporting it to higher level recovery mechanisms (which may failover to redundant replacement hardware, etc.), or else by halting the affected program or the entire system and reporting the corruption. Reliability is often characterized in terms of mean time between failures (MTBF), with reliability = exp(-t/MTBF).

Availability is the probability a system is operational at a given time, i.e. the amount of time a device is actually operating as the percentage of total time it should be operating. In high availability applications, availability may be reported as minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur. A highly available system would disable the malfunctioning portion and continue operating at a reduced capacity. In contrast, a less capable system might crash and become totally nonoperational. Availability is typically given as a percentage of the time a system is expected to be available, e.g., 99.999 percent ("five nines").

Availability %	Downtime per year	Downtime per month	Downtime per week
90%	36.5 days	72 hours	16.8 hours
95%	18.25 days	36 hours	8.4 hours
98%	7.30 days	14.4 hours	3.36 hours
99%	3.65 days	7.20 hours	1.68 hours
99.5%	1.83 days	3.60 hours	50.4 minutes
99.8%	17.52 hours	86.23 minutes	20.16 minutes
99.9% ("three nines")	8.76 hours	43.2 minutes	10.1 minutes
99.95%	4.38 hours	21.56 minutes	5.04 minutes
99.99% ("four nines")	52.6 minutes	4.32 minutes	1.01 minutes
99.999% ("five nines")	5.26 minutes	25.9 seconds	6.05 seconds
99.9999% ("six nines")	31.5 seconds	2.59 seconds	0.605 seconds

Serviceability or maintainability is the simplicity and speed with which a system can be repaired or maintained; if the time to repair a failed system increases, then availability will decrease. It includes various methods of easily diagnosing the system when problems arise. Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center without human intervention when the system experiences a system fault. The traditional focus has been on making the correct repairs with as little disruption to normal operations as possible.

Performance defines the ability of the system deliver within design limits. For example, in ERP software applications, this can be expressed as the measure = end user click to click time for a transaction is within 10 seconds; design limits: a maximum of 50000 order lines per year, 250 named users with a maximum 20% concurrency i.e. 50 concurrent users.

Why does this matter?

From the end-user standpoint, computing systems provide services, and any outage in that service can mean lost revenue, lost customers due to dissatisfaction, and, in extreme cases, loss of life and possible legal repercussions. For example, with cell phone services, if it's only in rare cases that I can't get a signal or make a connection, I'll stick with my service provider. But if this occurs too often or in critical locations like my office or home, I'll most likely switch providers. The end result is loss of revenue and loss of a customer. Embedded systems not only provide value added services such as communications, but they also provide critical services for human safety. For example, my anti-lock braking system is provided by a digital control service activated by my ignition. My expectation is that this service will work without failure once ignition is completed. Any system fault that might interrupt service should prevent me from using the vehicle before I start to drive. A failure during operation could result in loss of life and product liability issues.

Ideally, service outages would not be an issue at all, but experienced system architects know they must analyze, predict, and design systems for handling failure modes in advance. For safety-critical systems, this is the due diligence required to avoid product liability nightmares. Historically, system architects have taken two approaches to this problem: building highly reliable, fail-safe systems with low probability of failure, or building mostly reliable systems with quick automated recovery. Both approaches are judged with probability measures. Reliability is evaluated by mean time to failure, and availability is measured by service uptime over a year of continuous operation. The RAS (Reliability, Availability, Serviceability) concept for system design integrates concepts of design for reliability and for availability along with methods to quickly service failures that can't be designed for automatic recovery.

Building systems for very high reliability can be cost prohibitive, so RAS offers an approach to balance reliability with recovery and servicing features to control cost and ensure safety and quality of service. This approach is fundamental to systems where the concern is affordable quality of service, customer retention, and due diligence for customer safety.

Lessons from the design of IBM mainframes

You can learn a lot from the experience built into systems like the IBM® mainframes that have evolved from a rich heritage of design for reliability, availability, and serviceability. This example explores elements of IBM mainframe architecture to assist those developing new architectures by examining the design decisions made in the big iron mainframes. This article gives background on the evolution of RAS features developed for IBM mainframes and summarizes significant design decisions.

To best understand the evolution of RAS in IBM mainframe architecture, it is useful to step back in time to 1964 and examine RAS features in the IBM System/360™ (see Resources) and consider how architects have balanced the issues of cost, reliability, safety, availability, and servicing, and improved upon this over time. Early systems were most often centralized rather than in the hands of end users, and may have been less cost-sensitive than today's mainframes, but the concepts of availability and reliability emerged early and have evolved over time into the well-proven RAS features now found in the z990.

Often, system and application requirements will determine if availability is stressed over reliability or vice versa. For example, the concept of availability has also been fundamental to telecommunication systems, where most often quality of service is more of an issue than safety. In contrast, reliability has been fundamental to systems such as commercial flight control systems where failure means significant loss of life and assets. The balance of availability and reliability features should fit the system -- building FAA (Federal Aviation Administration) levels of reliability into cell phones would make the system much less affordable and therefore not usable by many customers. Likewise, safety-critical systems can't simply quote uptime to convince customers that the systems are not too risky to trust -- fail-safe operation, reliable parts, triple redundancy, and the extra cost that goes along with these design features is expected and will be paid for to mitigate risk.

The difference between availability and reliability

Availability is simply defined as the percentage of time over a well-defined period that a system or service is available for users. So, for example, if a system is said to have 99.999%, or five nines, availability, this system must not be unavailable more than five minutes over the course of a year. Quick recovery and restoration of service after a fault greatly increases availability. The quicker the recovery, the more often the system or service can go down and still meet the five nines criteria. Five nines is often called high availability, or HA.

In contrast, high reliability (HR) is perhaps best described by the old adage that a chain is only as strong as its weakest link. Building a system from components that have very low probability of failure leads to maximal system reliability. The overall expected system reliability simply is the product of all subsystem reliabilities, and the subsystem reliability is a product of all component reliabilities. Based upon this mathematical fact, components are required to have very low probability of failure if the subsystems and system are to also have reasonably low probability of failure. For example, a system composed of 10 components, each with 99.999% reliability, is (0.99999)10, or 99.99%, reliable. Any decrease in the reliability of a single component in this type of single-string design can greatly reduce overall reliability -- for example adding just one 95% reliable component would drop the overall reliability to 94.99%.

The z990 includes features that allow it to be serviced and upgraded without service interruption. The subsystem level of redundancy is the "book," which is an independently powered, multi-chip module, with cooling, memory cards, and IO cards. There are also redundant components within a book including CPU spares (2 per book) and redundant interconnection to level-2 cache. Furthermore, the z990 provides jumpering to preserve the interconnection of MP (multi-processor) books while other books are being serviced or replaced in case of non-recoverable errors.

Service or system outages can be caused by routine servicing, upgrades, and failures on most traditional computing systems. Probability of a system or service outage on the z990 is limited to scenarios where failures are non-recoverable by switching redundant components or subsystems into operation. In most component or subsystem failure modes, the z990 is able to isolate the non-recoverable book or component and continue to operate with some performance degradation. Redundant books and components allow the z990 to operate without service interruption until the degraded book is replaced. As you will see later on in this article, designing interconnection networks for isolation and switching of modules like the z990 books is complex. While this complexity adds to cost, it does significantly decrease probability of service interruption. You can find more detailed information on processor book management in the "Reliability, availability, and serviceability (RAS) of the IBM eServer z990" paper (see Resources).

How high reliability helps

It is theoretically possible to build a system with low-quality, not-so-reliable components and subsystems, and still achieve HA. This type of system would have to include massive redundancy and complex switching logic to isolate frequently failing components and to bring spares online very quickly in place of those components that failed to prevent interruption to service. Most often, it is better to strike a balance and invest in more reliable components to minimize the interconnection and switching requirements. If you take a very simple example of a system designed with redundant components that can be isolated or activated, it becomes clear that the interconnection and switching logic does not scale well to high levels of redundancy and sparing.

A trade-off can be made between the complexity of interconnecting components and redundancy management with the cost of including highly reliable components. The cost of hardware components with high reliability is fairly well known and can be estimated based upon component testing, expected failure rates, MTBF (mean time between failures), operational characteristics, packaging, and the overall physical features of the component.

System architects should also consider three simple parameters before investing heavily in HA or HR for a system component or subsystem:

Likelihood of unit failure
Impact of failure on the system
Cost of recovery versus cost of fail-safe isolation

How cost and safety factor in

HA design may not always ensure that a design will be safe. Much depends on how long service outages will be during recovery scenarios. For safety-critical systems such as flight control or anti-lock braking, it is possible that even very brief outages could lead to loss of system stability and total system failure. Thus, for safety-critical systems, the balance between HA and HR must often favor HR in order to avoid risky recovery scenarios.

Achieving HA/HR with redundancy as the primary method

One approach to HA is to use redundancy and switching to not only increase availability, but so lower reliability (and most often lower cost) components can be used, which for some systems yields overall target HA at the lowest overall system cost. The best example of this approach is RAID (Redundant Array of Inexpensive Disks); see Resources for examples and links. In fact, numerous RAID configurations have been designed which make trade-offs between HA/HR and serviceability by using larger numbers of disk drives of various types including SCSI, Fiber Channel, Serial-Attached SCSI, and Serial-Attached ATA drives. RAID also can provide improved storage performance by striping writes/reads over multiple drives as well as using drives for parity or more sophisticated error encoding methods. Typical RAID systems with volume protection allow for a drive failure and replacement with automatic recovery of the volume and no downtime given a single drive failure. Protection from double faults or failures while a RAID system is recovering has become of interest more recently and has led to development of RAID 6. One of the more interesting aspects of the RAID approach is that it not only relies upon specialized HA hardware, but on fairly complex software.

RAS designs should span hardware, firmware, and software layers

Perhaps much harder to estimate is the cost of highly reliable software. Clearly, reliable hardware running unreliable software will result in failure modes that are likely to cause service interruption. It is well accepted that complex software is often less reliable, and that the best way to increase reliability is with testing. Testing takes time and ultimately adds to cost and time to market.

Early on, system architects focused on designing HA and HR hardware with firmware to manage redundancy and to automate recovery. So, for example, firmware would reconfigure the components in the example in Table 1 to recover from a component failure. Traditionally, rigorous testing and verification have ensured that firmware has no flaws, but history has shown that defects can still wind up in the field and emerge due to subtle differences in timing or execution of data-driven algorithms.

High reliability software often comes at a high cost

Designing firmware and software for HR can be costly. The FAA requires rigorous documentation and testing to ensure that flight software on commercial aircraft is highly reliable. The DO-178B class A standard requires software developers to maintain copious design, test, and process documentation. Furthermore, testing must include formal proof that code has been well tested with criteria such as multiple condition decision coverage (MCDC). This criteria ensures that all paths and all statements in the code have been exercised and shown to work. It is very laborious and therefore greatly increases the cost of software components.

Cost trade-offs in the hardware layer

Designing for HR alone can be cost prohibitive, so most often a balance of design for HA and HR is better. HA at the hardware level is most often achieved through redundancy (sparing) and switching, which is the case for the z990 and has been fundamental to IBM mainframe design since the System/360. A trade-off is made between the cost of duplication and simply engineering higher reliability into components to reduce the MTBF. Over time, hardware designers have found balance between HR and HA features to optimize cost and availability. Fundamental to duplication schemes is the recovery latency. For example, the z990 has a dynamic CPU sparing (DCS) feature that can cover a failure so that firmware is unaffected by reconfiguration to isolate the errant CPU and to switch in the spare.

When considering component or subsystem duplication for HA, architects must carefully consider the complexity and latency of the recovery scheme and how this will affect firmware and software layers. Trade-offs between working to simply increase reliability instead of increasing availability through sparing should be analyzed. A simple well-proven methodology that is often employed by systems engineers is to consider the trade-off of probability of failure, impact of failure, and cost to mitigate impact or reduce likelihood of failure. This method is most often referred to as FMEA (Failure Modes and Effects Analysis); see Resources. Another, less formal process that system engineers often use is referred to as the "low hanging fruit" process. This process simply involves ranking system design features under consideration by cost to implement, reliability improvement, availability improvement, and complexity. The point of low hanging fruit analysis is to pick features that improve HA/HR the most for least cost and with least risk. Without existing products and field testing, the hardest part of FMEA or low hanging fruit analysis is estimating the probability of failure for components and estimating improvement to HA/HR for specific features. For hardware, the tried and true method for estimating reliability is based upon component testing, system testing, environmental testing, accelerated testing, and field testing. The trade-offs between engineering reliability and availability into hardware are fairly obvious, but how does this work with firmware and software?

Cost trade-offs in the firmware and software layers

Designing and implementing HR firmware and software can be very costly. The main approach for ensuring that firmware/software is highly reliable is verification with formal coverage criteria along with unit tests, integration tests, system tests, and regression testing. Test coverage criteria include feature points; but for HR systems, much more rigorous criteria are necessary, including statement, path, and, in extreme cases, multiple condition decision coverage (MCDC). The IBM z/OS testing has followed this rigorous approach to ensure HR. (This testing is described in the article "Testing z/OS; see Resources.) The FEDC system in z/OS provides support for HA, recognizing that despite rigorous testing, software/firmware testing cost and time spent must be limited at some point, and the design should include support for quick operator assisted recovery. Finally, the FEDC is also very useful for application developers who most likely would also like to strike a balance between rigorous testing of their application and provision for recovery.

Component-level error detection and correction

Ideally, all system, subsystem, and component errors can be detected and corrected in a hierarchy so that component errors are detected and corrected without any action required by the containing subsystem. This hierarchical approach for fault detection and fault protection/correction can greatly simplify verification of a RAS design. The z990 ECC memory component provides for single-bit error detection and automatic correction. The incorporation of ECC memory provides a component level of RAS, which can increase RAS performance and reduce the complexity of supporting RAS at higher levels.

Redundancy management at the subsystem level

The z990 includes a number of redundancy management features that provide online replacement/upgrade, automatic recovery with spares, and continuous operation in degraded modes for double and triple faults that require servicing. The z990 organizes processing, memory, and I/O resources into logical units called books, which are interconnected so that they can be switched into or out of operation without interruption to the overall system. This scheme includes the ability to add and activate processors, memory, and I/O connections while the system continues to run. The fault-tolerant book interconnection and cross-book CPU sparing provides excellent automatic recovery as well for most fault scenarios.

Finally, support subsystems such as cooling also include redundancy so that the system is not endangered by thermal control system faults. Redundancy and management of that redundancy with automated fault detection, isolation, and automatic recovery is fundamental to the z990 RAS design.

Fail-safe design

HR systems often include design elements that ensure that non-recoverable failures result in the system going out of service, along with safing to reduce risk of losing the asset, damaging property, or causing loss of life. The z990 may or may not be used in applications that are considered safety critical -- arguably, even a database error could result in significant loss of assets (for stock market applications) or even loss of life (in certain health care applications). The z990 does incorporate fail-safe modes when recovery is impossible or too costly to incorporate (for example, double processor faults in a single book) and the likelihood of a failure is low. In the case of double or triple faults, the z990 isolates the failed subsystem (a book) and requires operator assistance -- for most users. This is likely a good cost-versus-HA/HR trade-off, given the built-in support for serviceability in the z990 (on-line replacement).

Serviceability concepts

The z990 strikes a nice balance between HA, HR, system safety (permanent loss of data could have high related risk), and simplicity of operation and servicing. In most cases, this tracking of errors, data logging and upload to IBM with RETAIN and configuration tracking of FRUs (field-replaceable units) simplifies service calls, sometimes (or often) allowing a technician to handle a non-recoverable failure or upgrade without causing service interruption. The z990 organization of books as field-replaceable units with support to assist the operator in recovery greatly increases the z990 serviceability for faults that occur despite the HR/HA design.

Recovery concepts

Conceptually, architects should consider how levels of recovery are handled with varying degrees of automation, as depicted in Figure 2.

Figure 2. Supporting multiple levels of recovery autonomy

The z990 approach includes all levels of recovery automation and management between fully automatic, operator assisted, and fully manual. Manual recovery requirements are minimized, and manual recovery still includes isolation features so that FRUs can be replaced easily to restore full performance without service interruption and without impact to other processor books.

Putting it all together

The z990 is an excellent example of good RAS design because RAS is considered at all levels of hardware, from components and subsystems to the system level, and because firmware/software design for RAS is considered in addition to hardware. Perhaps most important, though, is the fact that the z990 RAS design spans all of these levels and layers so that overall RAS is a system feature. The well-integrated RAS features of the z990 no doubt increase the cost of the system considerably. However, the z990 provides customers with an HA/HR computing platform with low risk of losses due to downtime. The cost trade-off will vary for each system design based upon the risk and cost associated with downtime compared to the cost of decreasing the overall probability of downtime. The system architect has to find the right balance of cost, complexity, recovery automation, reliability, and time to market for each, given the system's intended usage. There is no simple formula for HA/HR analysis, but one place to start is with careful consideration of what the risk and cost is of being out of service -- if this can be measured in dollars lost per hour or day, potential loss of life, or loss of customers, this is a good starting point. If my business will potentially lose thousands of dollars per hour while my enterprise system is out of service, then I will be more willing to pay much more for HA/HR.

Is autonomic architecture the future for RAS design?

Strategies for RAS like the z990 have been refined and improved significantly since the concepts of continuous availability were introduced in early mainframes like the System/360. Clearly, RAS requires balance between safety and uninterrupted service on the one hand and the cost to provide these features on the other. The cost of additional RAS performance can be high, and this cost must be balanced with the risk and cost of occasional service failures and safety. How much is enough? How can the additional cost of better RAS be financed?

One concept is that systems that require little to no monitoring are not only more cost efficient, but necessary, if automation is to scale up significantly beyond present day systems. Enterprise systems include processing, storage, and I/O, with thousands of interconnections, hundreds of processors, and terabytes of data. Quick access to widely distributed information for decision support, global operations, and commerce is required, and the volume and speed of information flow is scaling beyond the point where traditional monitoring and service methods can be applied. Autonomic architecture is an alternative to traditional systems administration that uses RAS as a starting point to further reduce the human attention required to maintain enterprise systems. A full description of autonomic architecture is beyond the scope of this review of RAS strategies; however, you can find more information in the Resources section below.

Resources

This article was extracted from

Wikipedia:

http://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability_(computer_hardware)

IBM:

http://www.ibm.com/developerworks/power/library/pa-bigiron2/

Projects and random notes

Saturday, January 12, 2013

Enterprise architecture RAMPS

No comments: