MIT Department of Electrical Engineering & Computer Science
Hive: Fault Containment For Shared-Memory Multiprocessors
John Chapin
Stanford University
Monday, March 11, 1996
3:00 PM (2:45 refreshments)
Room NE43-518
EECS Special Seminar
Abstract
Hive is a scalable shared-memory operating system capable of surviving
common failures caused by hardware faults and system software
bugs. Unlike previous shared-memory operating systems such as Unix,
Mach, and Windows NT that must reboot when a serious failure occurs,
Hive limits the effects of a fault to applications that were using the
failed component. With Hive, large shared-memory multiprocessors can
be built without the reliability problems previously associated with
such complex systems.
I present the architecture of Hive and the key implementation features
that provide fault containment without adding significant performance
overheads. These include an internal distributed system of multiple
kernels, novel virtual memory system mechanisms, an extremely low
latency interprocessor RPC implementation, and hardware mechanisms for
memory protection and recovery. The hardware mechanisms are
implemented in the Stanford FLASH multiprocessor, which is not
available yet, so I present the results of performance and fault
injection experiments using the SimOS hardware simulation environment.
HOST: Prof. Frans Kaashoek
URL of this page:
http://www-eecs.mit.edu/AY95-96/events/30.html
Created: Mar 7, 1996
|
Modified: Jun 25, 1997
This announcement is from the MIT EECS 1995-96 archive.
|
Current events
To MIT EECS home page
|
Your comments
and inquiries are welcome.