MIT Department of Electrical Engineering & Computer Science

E E C S

Hive: Fault Containment For Shared-Memory Multiprocessors

John Chapin
Stanford University

Monday, March 11, 1996
3:00 PM (2:45 refreshments)
Room NE43-518
EECS Special Seminar

Abstract

Hive is a scalable shared-memory operating system capable of surviving common failures caused by hardware faults and system software bugs. Unlike previous shared-memory operating systems such as Unix, Mach, and Windows NT that must reboot when a serious failure occurs, Hive limits the effects of a fault to applications that were using the failed component. With Hive, large shared-memory multiprocessors can be built without the reliability problems previously associated with such complex systems.

I present the architecture of Hive and the key implementation features that provide fault containment without adding significant performance overheads. These include an internal distributed system of multiple kernels, novel virtual memory system mechanisms, an extremely low latency interprocessor RPC implementation, and hardware mechanisms for memory protection and recovery. The hardware mechanisms are implemented in the Stanford FLASH multiprocessor, which is not available yet, so I present the results of performance and fault injection experiments using the SimOS hardware simulation environment.

HOST: Prof. Frans Kaashoek


URL of this page: http://www-eecs.mit.edu/AY95-96/events/30.html
Created: Mar 7, 1996  | Modified: Jun 25, 1997
This announcement is from the MIT EECS 1995-96 archive.  | Current events
To MIT EECS home page  | Your comments and inquiries are welcome.