IEEE P1003.0 Draft 13 - September 1991 Copyright (c) 1991 by the Institute of Electrical and Electronics Engineers, Inc. 345 East 47th Street New York, NY 10017, USA All rights reserved as an unpublished work. This is an unapproved and unpublished IEEE Standards Draft, subject to change. The publication, distribution, or copying of this draft, as well as all derivative works based on this draft, is expressly prohibited except as set forth below. Permission is hereby granted for IEEE Standards Committee participants to reproduce this document for purposes of IEEE standardization activities only, and subject to the restrictions contained herein. Permission is hereby also granted for member bodies and technical committees of ISO and IEC to reproduce this document for purposes of developing a national position, subject to the restrictions contained herein. Permission is hereby also granted to the preceding entities to make limited copies of this document in an electronic form only for the stated activities. The following restrictions apply to reproducing or transmitting the document in any form: 1) all copies or portions thereof must identify the document's IEEE project number and draft number, and must be accompanied by this entire notice in a prominent location; 2) no portion of this document may be redistributed in any modified or abridged form without the prior approval of the IEEE Standards Department. Other entities seeking permission to reproduce this document, or any portion thereof, for standardization or other activities, must contact the IEEE Standards Department for the appropriate license. Use of information contained in this unapproved draft is at your own risk. IEEE Standards Department Copyright and Permissions 445 Hoes Lane, P.O. Box 1331 Piscataway, NJ 08855-1331, USA +1 (908) 562-3800 +1 (908) 562-1571 [FAX] ENVIRONMENT INTERIM DOCUMENT P1003.0/D13 5.4 Fault Management _R_e_s_p_o_n_s_i_b_i_l_i_t_y: _R_i_c_h _B_e_r_g_m_a_n The trade-offs in this subclause involve: - Testability and Verifiability vs. simplicity and time - Confidence and reliability vs. cost vs. risk - Downtime vs. availability - Maintainability vs. higher system cost These services allow the system to react to the loss or incorrect operation of system components at various levels (hardware, logical, services, etc.). The classical model of fault tolerance has a three-step approach. The three steps are fault detection, fault isolation, and fault recovery. Typically implementations divide these steps into multiple steps or integrate them into one or two steps. Additionally, fault diagnosis services support the other steps in the treatment of a fault. Various fault tolerance strategies, such as checkpointing and voting, are implemented as a collection of services comprising one or more of the steps in the fault tolerance classical model. For example, services involved in implementing a three-node voting scheme will include a vote comparator service (fault detection), vote analyzer service (fault isolation/fault diagnosis), a service to pass the majority ``answer'' through (fault recovery) as well as a service to disable the faulty resource and reconfigure the voters (fault recovery/reconfiguration). _F_a_u_l_t__D_e_t_e_c_t_i_o_n Fault detection services are concerned with determining when a fault has occurred in the system. Fault detection services are both passive and active. Active services are those that attempt to determine the status of various system components by testing those components. Passive services, on the other hand, try to ascertain system components by passively gathering information and watching the behavior of the system. _F_a_u_l_t__I_s_o_l_a_t_i_o_n_; Fault isolation services attempt to determine the component at fault and segregate the faulty component from the rest of the system. Services may be shared between the fault detection and isolation service library in that they perform both functions. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 5.4 Fault Management 245 P1003.0/D13 GUIDE TO THE POSIX OPEN SYSTEMS _F_a_u_l_t__R_e_c_o_v_e_r_y Fault recovery services attempt to bring the system into a consistent state. These services may be very interrelated to the scheduling services, network services, and data base services, depending on the recovery scheme used. Redundancy of resources is many times needed to support fault recovery. Resources may include data, process, processor, disk drive, etc. As parts of the system fail, it may no longer be possible to satisfy all the requirements of the application. Services to support graceful degradation may be used to ensure that critical activities do not fail. _F_a_u_l_t__D_i_a_g_n_o_s_i_s These services deal with the system's ability to analyze the attributes of a system fault and determine its cause. These services tend to be very interrelated with fault detection and fault isolation services. _F_a_u_l_t__A_v_o_i_d_a_n_c_e These services involve the avoidance of faults before a failure in the system component occurs. If a system can detect that the operation of a component is approaching the edge of its operational range, a standby or backup component could be phased in to replace it. Another form of fault avoidance is logging of shocks, temperature extremes, etc., so that it can be predicted that a component will not meet its original expected service life. _S_o_f_t_w_a_r_e__S_a_f_e_t_y These services involve the system's ability to keep application software from causing harm to the system's software, hardware, or user. For instance, a process may attempt to write into another process's memory space without permission. A good example of a reliability method that may provide software safety is a bounds checker. The checker compares an answer supplied against the bounds. If it is not within the bounds, the bounds checker will not allow the answer to propagate, possibly causing damage to the system's integrity. Additionally, it may send a fault message (or security violation information, depending on the type of answers expected) to the proper service. To enhance software safety, other services and processes should be only given the resources necessary to complete their job. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 246 5 POSIX OSE Cross-Category Services ENVIRONMENT INTERIM DOCUMENT P1003.0/D13 _S_t_a_t_u_s__o_f__S_y_s_t_e_m__C_o_m_p_o_n_e_n_t_s These services involve the obtrusive and nonobtrusive diagnosis of the state of system components. For further explanation of these services, see Fault Detection and Fault Diagnosis services. These services may additionally need to record and/or display information concerning performance, configuration, and general system information. _R_e_c_o_n_f_i_g_u_r_a_t_i_o_n These services allow the system to reconfigure its view of the world. This services allow the system to substitute different resources to perform system functions such as substituting a new physical I/O channel to support a logical channel. These services are part of the API but their use may be restricted to specially authorized programs such as those used by the target system operator. _M_a_i_n_t_a_i_n_a_b_i_l_i_t_y Maintainability services provide support for the maintenance of a system. A major component of that support is the collection and logging of information about the operation of the system. Typical information to be logged is: - Software and hardware errors during operation - Processes that failed or almost failed to meet scheduled deadlines - Performance metrics for system tuning - Times when the system operated in extreme environmental conditions - Errors reported during startup self-testing - Attempts to violate rules of the system's security policy. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 5.4 Fault Management 247 P1003.0/D13 Section 6: Profiles _R_e_s_p_o_n_s_i_b_i_l_i_t_y: _F_r_i_t_z _S_c_h_u_l_z This section targets those who want to know more about what profiles are and those who are in the process of developing their own profiles. The latter group consists of those developing formal ``Standardized Profiles'' and those developing less formal profiles for their industry group (e.g., a banking trade association) or their own company or enterprise for procurement or strategic planning purposes. Those not involved in the development of profiles should read 6.2. Parts of 6.3 also may be useful, especially the earlier subclauses that give definitions of terms and explain concepts more precisely. Developers of profiles that are not formal POSIX Standardized Profiles (POSIX SPs) should read all of Section 6. Developers of profiles that are formal POSIX SPs should read all of Section 6 and Annex A. 6.1 Scope The information presented here about profiles is limited in scope to assist those needing to understand profile concepts as they apply to the POSIX Open System Environment. Covered are profiles constructed from standards (and profiles) listed within this guide (that, by design, are consistent with POSIX.1). The goal is to create a common approach and documentation scope and style for POSIX-oriented profiles. Annex A goes further by giving specific guidance to developers of formal POSIX SPs. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 6.1 Scope 249