Real Time and Fault Tolerant Systems
The Quest for Zero Downtime:
Since the dawn of the Internet, the need for application availability and reliability has continually increased over time. This need is especially strong for the military, aerospace, and aircraft control industries, where any amount of downtime can have fatal consequences. In the 1998 case study titled, "NCAPS: Application High Availability in UNIX Computer Clusters," by Luiz a. Laranjeira, Tandem Computers developed a specialized software system that can run on Unix computer clusters while providing a superior level of application availability. This essay offers a critique of the case study as well as of the software architecture and fault tolerance strategies used.
Design Goals
Since the dawn of the Internet, application availability has increased immensely. However, at the time of the above case study, there was still a need to improve the recovery times of existing high availability solutions, especially concerning real-time critical applications. Recovery times were too long, expensive, and unreliable, lasting anywhere between one minute and an hour. Therefore, the key design goal of the NCAPS system was to ultimately provide continuous availability of real-time critical systems in the event of hardware, software, or operating system faults. Also, by helping to significantly shrink recovery times of large-scale applications, the NCAPS design could not only ensure that these vital systems would remain up and running, but help reduce the hefty costs associated with downtime.
System Architecture
NCAPS provides specialized system software that runs on a Unix computer cluster with two or more nodes. According to some industry experts, this is the minimum requirement for a high availability cluster. Additionally, the system can provide more rapid failover because it is based on a primary/backup scheme, where two instances of an application are running at the same time.
As described in the case study, the NCAPS software architecture includes the Node Status Monitor (NSM), the Keepalive (KpA), the Process Pairs Manager (PPM), the Open Fault Tolerance Library (OftLib), and the Command Line Interface (CLI). The NSM, KpA, and PPM are replicated in both nodes and interact through continuous monitoring and message communication. The state of the two nodes is monitored by the NSM. The KpA keeps an eye on registered processes and uses a script to restart them in the case of failure.
More important, the PPM is the core of the NCAPS system and it starts, monitors, and manages application processes through the use of a process pairs paradigm. Plus, the PPM state model can be configured by the user, which is a key competitive advantage over other high availability software vendors.
Fault Tolerance Strategies Used
Redundancy
Redundancy has been defined as the duplication of critical components of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe (Answers.com, 2010). The two nodes of the NCAPS system offer redundancy because they mirror each other, always providing one node in primary status and the other in backup status. If one fails, the other is available to take over.
Redundancy in the NCAPS system can also be found in the NSM, where "heartbeats" are exchanged between the two NSMs. "When one NSM does not receive a configurable number of heartbeats from the other within a configurable period of time, it sends a node-down message to its subscribers (the PPM only). When the other node is restarted and the two NSMs resume exchanging heartbeats, the NSM sends a node-up message to its subscribers," (Laranjeira, 1998 p. 442).
Always ready to switch from a backup to a primary state, the PPM provides redundancy as well, "One instance of the PPM and of the watched application run in each of two nodes of a cluster. In one node an instance of the application is in a primary state and is providing service. In the other node another instance of the application is in a backup state. A backup application is not providing service, but it is initialized and ready to take over in case of a failure of the primary application or of its node," (Laranjeira, 1998 P. 442).
Described as a highly available service implemented as a primary and shadow instance, the Keepalive component offers further redundancy, "These two instances send heartbeats to each other and share information through a memory mapped file. If the shadow instance dies, the primary restarts it. If the primary instance dies, the shadow instance becomes primary, takes control of the memory mapped file, and spawns another shadow instance," (Laranjeira, 1998 p. 442).
Fault/Error Isolation/Containment
Information on fault/error isolation and containment of the NCAPS system was not clearly disclosed in the case study. Specifically regarding the PPM and application processes, however, the following was stated, "In failure situations, the PPM executes a cleanup script and restarts the application to a maximum configurable number of times. The cleanup script ensures that all application processes have exited before the application is restarted," (Laranjeira, 1998 p. 443). In addition, as you'll see below, the Hang Detection Service will "kill" an offending process if a hang is detected. But otherwise, relatively no information was provided on how faults or errors are contained.
Fault/Error Detection
In the case study, there is little-to-no definition of the types of faults the system is detecting, whether they are transient, permanent, or intermittent. Based on background course material, this may not be a good thing, "In an ultra-reliable system, it is essential to have error detection and recovery mechanisms designed to handle transient faults. These mechanisms must be able to distinguish transient faults from permanent or intermittent faults, so that when a transient fault is detected in a unit the unit is not discarded," (Course Objectives
). That said, it does not mean that the NCAPS system is unable to distinguish between the different types of faults, it just raises questions because this essential information was not provided in the case study.
Fault detection functionality in the NCAPS system can be found in the PPM, "When an application process fails, the PPM detects it and restarts it up to a maximum configurable number of times. After this threshold is exceeded, the next failure of that process will imply in a failure of the application," (Laranjeira, 1998 p. 443). Also, the Application Administration (AAD), a key component of the PPM, provides fault detection by mediating the interactions between the Application State Model (ASM) and the application. The AAD detects an application event, such as a failure, and directs it into the ASM. After a state change takes place, an ASM action triggers the AAD to send a state change command message to the application processes (Laranjeira, 1998 p. 446).
Part of the functionality provided by the Open Fault Tolerance Library (OftLib) is the Hang Detection Service (HDS), which offers the capability to detect faults that cause a process to "hang." By using heartbeats with specified time intervals, the Hang Detection Service detects when a heartbeat is not received when expected and thus responds with the appropriate action. At this point, HDS simply ends the offending process. Keepalive then detects that the process no longer exists and the appropriate recovery mechanisms are triggered (Laranjeira, 1998 p. 444).
System Reconfiguration
There was relatively little information provided on the system reconfiguration of the NCAPS system. Regarding the PPM, some functionality is configurable and allows users to define and execute their own scripts during a state change. Therefore, during specific state changes, the user may determine which actions should be applied, and may include the transfer of resources in the event of application failover or the trigger of an alarm due to a specified state change (Laranjeira, 1998 p. 443).
System Recovery
System recovery within the NCAPS system is typically handled by the PPM or the Command Line Interface (CLI). With regard to the PPM, "In failure situations, the PPM executes a cleanup script and restarts the application up to a maximum configurable number of times. The cleanup script ensures that all application processes have exited before the application is restarted." (Laranjeira, 1998 p. 443). In addition, the CLI provides system administrators with the control they need to perform a range of capabilities, including the ability "to query the application's state or to manually cause the application to failover, become primary, reinitialize, inhibit the failover function, (when the application is in the backup state), un-inhibit the failover function, startup or shutdown," (Laranjeira, 1998 p. 444).
Conclusion
Overall, it appears that the NCAPS system is a highly effective solution that is built on a solid, logical architecture. From the primary/backup design to the PPM, NSM, and Keepalive components, redundancy is prevalent throughout the system and helps to provide high availability, resiliency, and security. Also, multiple components help monitor the system, applications, and application processes as well as allow communication between the various components. Plus, the PPM, AAD, and HDS can detect faults by monitoring system heartbeats, errors, and potential failures. System reconfiguration can be defined by the user and system recovery can be handled by the PPM and the CLI. Because the system has a lot of user-defined capabilities, users gain the flexibility to configure the system to meet their specific needs.
While there was a lot of detailed information in the case study, there were some information gaps. A definition of the types of faults the system detects, such as transient, permanent, or intermittent, and how the system handled the different faults would have been helpful. Also, knowing how the faults and errors were then isolated and contained would have been useful.
Since the case study was written in 1998, it would be interesting to see where the product and functionality is at today. The desire and need for highly available systems has only increased over time and it appears that the NCAPS system would have a strong lead over the competition.
Real Time and Fault Tolerant Systems
Part II
It's no secret that the Internet has grown into an abundant, international resource that many people use -- and rely on -- daily. "Approximately 1.5 billion people worldwide use Internet today, and Internet usage continues to increase exponentially. A recent survey revealed that approximately 78% to 80% of the people in the age group of 18-50, use Internet," (Arunnima, B.S. -- no date).
From e-mail communications to online shopping, people can use the Internet to access the information or service they need whenever they want, from wherever they want, 24 hours a day, 7 days a week. This convenience and accessibility has led to an expectation that the services and information will be delivered no matter what. These expectations can be especially high for banking companies who offer online access to customer accounts and private information. Customers have come to expect that the services they need will not only be available, but reliable and secure. "Gone are those days where a customer would walk into a bank and wait for a representative to help do a fund transfer or to request for a demand draft. Expectations of customers have changed with the technological advancements in Internet and telecommunications. Today's tech savvy customer would even want to deposit a cheque being at home at his/her convenience," (Arunnima, B.S. -- no date?)
Internet banking
Because of its popularity, Internet banking was the Web service chosen for this essay. More and more, people are embracing the convenience of online banking: "About 75% of American banking customers surveyed during an October 2008 study reported using online banking to keep track of their expenses. Not surprisingly, a similar number confirmed that they were watching finances more closely during the current economic downturn. Online banking reported the strongest growth among all channels -- customers wanted to watch their finances more closely, at least cost, and only banking served both ends." (Jaymalya Palit, no date.)
Banking customers now expect access to their money at all times, whether to simply check their financial status or to pay bills and transfer funds. This requires a Web service that can ensure that services are available around the clock and that failures and errors won't bring the system down. While it may not be considered a "life or death" situation if a customer can't get into his/her account at a critical time, it could cause distress and/or affect a person's credit by missing a payment by the due date or not being able to transfer needed funds.
The user experience: Web 2.0
For the pseudo online banking service presented here, Web 2.0 will serve as the front-end software foundation. A proven and effective technology, Web 2.0 has the capabilities to provide a customer-centric model, which is particularly helpful in the banking industry. "Technology can now enable banks to provide personalised interaction on party assisted or even unassisted channels. Powered by Web 2.0 technology, Internet banking is moving towards greater personalization and interactivity," (Jaymalya Palit, no date). These capabilities not only provide the appropriate next-generation technology, but also enable banks to establish better relationships with their customers, "Those banks that successfully deliver a memorable and unique customer experience, consistently across their offline and online channels, can hope to steal a march over their competitors," (Jaymalya Palit, no date.)
For this design, Web 2.0 will be implemented as the user-interface of the Web service. The banking interface will be customized by user demands and feedback, with products and services displayed accordingly. The user access screen will be password- protected and will contain several sections that provide account information for the specific user, such as the different types of banking accounts, bill pay, transfers, banking statements, messages, and information on additional banking products. Furthermore, the information provided will have to be up-to-the-minute, allowing customers to see exactly what their financial status may be at any given time. To fully enrich the customer's experience, this may include integration with third-party services such as financial news, stocks, and weather forecasts with information displayed in a multi-service window. The purpose of the multi-service window is to allow the user to open several service windows at one time, and without encountering an Internet "traffic jam," and thus improving the customer experience. "This kind of development model enables banks it engineers to pay more attention to individual service development, respond quickly to financial innovation demand from business staff, and improve the service constantly," (Chen, Hong & Yu, 2009).
A "Channel Handler" on the server side supports communication with the browser through the XML or JSON data formats. The server application must also manage the components of the Web 2.0 graphic user interface, or GUI. In addition, the Web 2.0 framework is responsible for loading the required resources and managing the data models, as well as presenting and organizing the GUI (Chen, Hong & Yu, 2009).
You’re 81% through this paper. Sign up to read the full paper.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.