------------------------------------------------------------------------------ Copyright 1998 IEEE. Published in the Proceedings of the ISW'98, 28-30 October 1998 in Orlando Florda, USA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966 ------------------------------------------------------------------------------ On Survivable Multi-Networks for Information Systems Survivability## A Position Paper for the 1998 Information Survivability Workshop (ISW'98) David Tipper* Deepankar Medhi** William Yurcik*++ Robert Cotter** *Department of Info. Science and Telecommunications University of Pittsburgh Pittsburgh, PA 15260 **Department of Computer Networking University of Missouri-Kansas City Kansas City, MO 64110 Abstract: A major attack can significantly reduce the capability to deliver services in large-scale information systems. We address the survivability of large-scale heterogeneous information systems which consist of various services provided over multiple interconnected networks with different topologies and multi-vendor equipment with both wireline and wireless infrastructure -- the communications network part of such systems is referred to as multi-networks. We are developing a comprehensive set of solutions for network design and management aspects of providing adequate service continuity in the event of major attack on multi-networks. The end goal is to support critical services in the event of major attack by making optimal use of network resources while minimizing network congestion. We expect many of our results will extend from conventional attacks which physically destroy links and nodes to virtual attacks which destroy or corrupt network control information and databases. Introduction Due to the rapidly growing demand for information transfer such as voice, data, and video across communication networks, the need for reliable communication service has become increasingly important. The potentially drastic effects of communication network failures have been demonstrated by several highly publicized network failures showing the need for survivable networks that provide service that is robust to failures. The final report of the President's Commission on Critical Infrastructure Protection (PCCIP) alludes to serious vulnerabilities and threats in eight critical national infrastructures. Perhaps equally important, if not more so, is that all of these critical infrastructures are closely interdependent; a failure in one sector can easily affect other sectors. Furthermore, all of the national infrastructures depend on the underlying telecommunications (computer-communication) infrastructure such as computing resources, databases, private networks, and the Internet.{Peter Neumann} The impact of previous outages due to reliability failures brings into question the fragility of the U.S. telecommunications infrastructure to an intelligent malicious attack. Speaking publicly about infrastructure threats for the first time, the Director of the CIA, George Tenet, testified before Congress that several foreign governments have "information warfare" programs targeting the USA.{testimony before the Senate Committee on Governmental Affairs 6/24/98} Scope of Research For brevity, the communication network portion of such systems are referred to here as multi-networks. We focus on the development of multi-networks design models/algorithms to provide a quality of service (QoS) specified under single and multiple attack/failure conditions. We address the problem of intelligently designing and evolving a network architecture to improve survivability, starting from existing architectures and legacy networks. Given the multi-networks environment already deployed, we are developing network management algorithms (e.g., provisioning of backup routes, virtual circuit rerouting algorithms, etc.) which make optimum use of network resources after an attack/failure (both large and small types) in support of critical services. We concentrate on the design and analysis of multiple priority traffic restoration techniques to provide service continuity while minimizing the network congestion. We are developing a multi-layer restoration approach involving a coordinated strategy between different layers (transmission, traffic, application layers). Since fault recovery is possible at various layers, one aspect of our work is determining what combinations of traffic restoration should be used at each layer and how this is related to the network topological design. The restoration algorithms will be suitable for automatic invocation by network components, resulting in a self-configuring system that adapts to the changing fault environment. We specifically address the issue of survivable multicasting services since these emerging services (audio/video conferencing, sensor data distribution, etc.) will be critical under an attack. The use of multicasting to reduce redundant traffic flows has the potential for an orders of magnitude decrease of traffic congestion but multicasting also introduces vulnerabilities that are not present in unicast transmissions such as the involvement of potentially more links and nodes, control complexity due to group dynamics, and all-or-nothing restoration requirements. An emphasis is given to studying the transient network congestion that occurs after a failure and incorporating its effect into the design of the network and the traffic restoration algorithms. A major factor of network performance after a failure is the transient congestion period that results from restored circuits attempting to send out a backlog of traffic accumulated for retransmission after a failure. Thus not only will a critical network user be provided service continuity, but the service quality must be maintained to the highest degree possible. Selected Research Results We have made progress in several directions on understanding the network dynamics to address for a major failure in a multi-networks environment. We touch on some selected research results as described below: Our work on network design for survivable multi-networks has thus far focused on the development of procedures for deploying survivable virtual networks on top of existing physical infrastructure. ATM and circuit-switched network architectures currently allow the establishment of virtual network overlays on a physical network; for example, the provisioning of Virtual Paths (VPs) in an ATM network. One technique that can be adopted to provide survivability in virtual networks is to provision both a working and a link/node disjoint backup path which can be switched to in the event of a failure in the working path. We have developed a generic integer optimization model formulation for the layout of ATM working and backup VPs which results in the minimum bandwidth requirements. This model allows for the incorporation of priorities (i.e., whether or not a VP is provisioned a backup) and specification of how many links and nodes may be shared between the working and backup paths if disjoint paths are not possible. A second optimization model was developed for the network dimensioning problem where network reconfigurability was taken into consideration as well as the QoS requirement acceptable under a failure. In a parallel effort, we have developed a new technique for the layout of survivable multipoint (aka multicast) group communications in connection- oriented (i.e., ATM) networks. The technique is termed the Self-Healing Virtual Ring (SHVR) multicast and consists of two counter-rotating rings made up of Virtual Circuits (VCs). One ring is normally used for communication with the second ring serving as a hot-standby to which traffic can be rerouted in the event of an attack. A performance analysis comparing the SHVR approach with a similar hot-standby approach using shared multicast trees or VC Mesh groups shows that the SHVR approach requires less bandwidth and simpler signaling to provide survivability. We have also developed a network dimensioning model for providing multicasting services by developing a k-shortest tree based concept. We have implemented this model to determine network survivability design for networks with sizes up to 100 nodes. The restoration time is critical in determining whether a user is provided service continuity. A major part of restoration time is the detection and notification time. We have initiated a measurement-based benchmarking study of alarm detection and notification times in an ATM testbed laboratory. This effort focuses on quantifying the time delay in lower layers of the protocol stack notifying higher layers of a failure. Selected Publications: K. Balakrishnan, D. Tipper, and D. Medhi, "Routing Strategies for Fault Recovery in Wide Area Networks," Proceedings of IEEE Military Communications Conference (Milcom '95), San Diego, CA, November, 1995. K. Balakrishnan, D. Tipper and J. Hammond, ``An Analysis of the Timing of Traffic Restoration in Wide Area Communication Networks," Proceedings of 14th International Teletraffic Congress, Antibes, France, June, 1994. K. Balakrishnan, S. Menon and D. Tipper, ``A Study of Issues Relating to Traffic Restoration in Wide Area Communication Networks," Proceedings of IEEE Southeastcon 94, Miami, FL, April, 1994. R. Cotter, D. Medhi and D. Tipper, ``Traffic Backlog and Impact on Network Dimensioning for Survivability for Wide-Area VP-based ATM Networks," Proceedings of 15th International Teletraffic Congress, Washington, DC, June 1997. T. Dahlberg, S. Ramaswamy and D. Tipper,``Survivability Issues in Wireless Mobile Networks,'' Proceedings of First International Workshop on Mobile and Wireless Communication Networks, Paris, France, May, 1997. B. Jager and D. Tipper, `` On Fault Recovery Priority in ATM Networks,'' Proceedings of IEEE ICC '98, Atlanta, GA, June, 1998. D. Medhi, ``A Unified Approach to Network Survivability for Teletraffic Networks: Models, Algorithms and Analysis," IEEE Trans. on Communications, Vol. 42, pp. 534-548, 1994. D. Medhi and R. Khurana, ``Optimization and Performance of Network Restoration Schemes for Wide-Area Teletraffic Networks," Journal of Network and Systems Management , Vol. 3, No. 3, pp. 265-294, September 1995. D. Medhi and D. Tipper, ``Towards Fault Recovery and Management in Communication Networks," Guest Editorial, Journal of Network and Systems Management, Vol. 5, No. 2, June 1997. A. Pitsillides, S. Nikolopoulos and D. Tipper,``Addressing Network Survivability Issues by Finding the K-best Paths Through a Trellis Graph,'' Proceedings of IEEE INFOCOM '97, Kobe, Japan, April 1997. S. A. Shah and D. Medhi, ``Performance under a Failure of Wide-Area Datagram Networks with Unicast and Multicast Traffic Routing," Proc. of IEEE Military Communications Conference (MILCOM'98), Bradford, Mass, October 1998. D. Tipper, J. Hammond, S. Sharma, A. Khetan, K. Balakrishnan, and S. Menon, ``An Analysis of the Congestion Effects of Link Failures in Wide Area Networks," IEEE Journal on Selected Areas in Communications, Vol.12, pp. 179-192, Jan 1994. W.-P. Wang, D. Tipper, B. Jaeger and D. Medhi, ``Fault Recovery Routing in Wide Area Packet Networks," Proceedings of 15th International Teletraffic Congress, Washington, DC, June 1997. ------- ## work supported in part by Defense Advanced Research Projects Agency grant F30602-97-1-0257 and National Science Foundation grant NCR-95-06652 . ++ author for correspondence, additional contact information: Email yurcik@tele.pitt.edu, telephone (412) 624-9411, FAX (412) 624-2788; supported in part by NASA Earth Systems Science grant # NGT-30019, Defense Advanced Research Projects Agency grant #F30602-97-1-0257, and SAE International - The Engineering Society For Advancing Mobility Land Sea Air and Space