
Introduction
Future microprocessors will contain of billions of transistors, many of which will be dead-on-arrival. Those that survive will be subjected to the effects of numerous wearout mechanisms such as time-dependent dielectric breakdown (TDDB), hot carrier injection (HCI), electromigration (EM), and negative bias temperature instability (NBTI.), which result in performance degradation and eventual device failure. The goal of this reliability research is to develop resilient architectures robust enough to operate in such hostile environments.
Wearout Detection and Failure Prediction
In order to mitigate reliability concerns, architects and circuit designers typically employ either error detection or failure prediction mechanisms. Error detection, is used to diagnose failed or failing components by identifying (potentially transient) pieces of incorrect state within the system. Once an error is detected, the problem is diagnosed and corrective actions may be taken. The second technique, failure prediction, supplies the system with a failure forecast allowing it to take preventative measures to avoid, or at least mitigate, the effects of device failures.
Historically, high-end server systems have relied on error detection to provide a high degree of system reliability. Error detection is typically implemented through coarse grained replication. This replication can be conducted either in space through the use of replicated hardware, or in time by way of redundant computation. The use of redundant hardware is costly in terms of both power and area and does not significantly increase the lifetime of the processor without additional cold-spare devices, which further increases the cost of such techniques. Redundancy in time is potentially less expensive but may only provide transient error detection unless redundant hardware is readily available.
Failure prediction techniques are typically less costly to implement, however, they may suffer from inaccuracy. In our reliability research, we propose leveraging a variety of symptoms exhibited by aging circuits, indicative of wearout to predict the failure of structures within a microprocessor core. One such technique is to utilize online circuit timing information to predict failure. Wearout mechanisms such as electromigration, time-dependent dielectric breakdown, hot carrier injection, and negative bias temperature instability all exhibit signs of their progression by affecting circuit timing characteristics. By conducting statistical analysis of circuit-level timing characteristics, we are able to accurately detect the onset of wearout within microprocessor structures.
Architectural Exploration
StageNet
The goal of this work is to design a robust and scalable multithreaded system. To ensure reliable operation, the system is designed to be both introspective and reconfigurable. Introspection enables the system to accurately detect and diagnose transient faults, which may be caused by particle strikes, excessive temperature densities, or electrical noise. Component failures due to wearout phenomenon, such as electromigration and dielectric breakdown, can also be anticipated by an effective introspection scheme. Fine-grained reconfigurability maximizes the lifetime of the system and allows it to adapt to changing conditions on the chip. The adaptive capabilities of the system allow it to maintain service in the face of failing components. Further, this system is designed with a large number of redundant structures which gives it the flexibility of supporting multiple threads when appropriate to maximize performance or as structures fail, to reduce throughput and gracefully degrade over time.

Processor cores within the proposed system are designed as part of a network-on-chip, where each stage in a coarse-grained processor pipeline corresponds to a node in the network. Pipeline stages are then replicated and grouped together to create multiple logical processors. The interconnect between these pipeline stages is designed to be flexible so that the system can react to local phenomenon. Multiple nodes may be allocated temporarily to a single thread to exploit instruction level parallelism, while at other times the system may evenly distribute resources between all of the logical processors in order to maximize throughput. Similarly, the system can temporarily arrest allocation of nodes that are excessively hot, as well as retire nodes that are deemed defective. As nodes wear out and eventually fail, throughput gradually decreases and performance gracefully degrades.
Relevant Publications
- The StageNet Fabric for Constructing Reslilient Multicore Systems
(paper: pdf)
Shantanu Gupta, Shuguang Feng, Amin Ansari, Jason Blome, and Scott Mahlke
Proc. 41st Intl. Symposium on Microarchitecture (MICRO)
Nov. 2008. - A Reconfigurable Microarchitecture Building Block for Resilient CMP Systems
(paper: pdf)
Shantanu Gupta, Shuguang Feng, Amin Ansari, Jason Blome, and Scott Mahlke
Proc. 2008 Intl. Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)
Oct. 2008. - Reliable Systems on Unreliable Fabrics
(paper: pdf) Todd Austin, Valeria Bertacco, Scott Mahlke, and Yu Cao
IEEE Design and Test of Computers
Vol. 25, No. 4, Jul. 2008, pp. 322-332. - Olay: Combat the Signs of Aging with Introspective Reliability Management.
Shuguang Feng, Shantanu Gupta, and Scott Mahlke
(paper: pdf slides: ppt)
The Workshop on Quality-Aware Design (W-QUAD)
Jun. 2008. - StageNet: A Reconfigurable CMP Fabric for Resilient Systems
(paper: pdf; slides: ppt)
Shantanu Gupta, Shuguang Feng, Jason Blome, and Scott Mahlke.
2nd Reconfigurable and Adaptive Architecture Workshop (RAAW)
Dec. 2007. - Self-calibrating Online Wearout Detection
(paper: pdf; slides: ppt)
Jason Blome, Shuguang Feng, Shantanu Gupta, and Scott Mahlke.
Proc. 40th Intl. Symposium on Microarchitecture (MICRO)
Dec. 2007, pp. 109-120. - Architecting a Reliable CMP Switch Architecture
(paper: pdf)
Kypros Constantinides, Stephen Plaza, Jason Blome, Valeria Bertacco, Scott Mahlke, Todd Austin, Bin Zhang, and Michael Orshansky
ACM Transactions on Architecture and Code Optimization
Vol. 4, No. 1, Mar. 2007, pp. 1-37. - Online Timing Analysis for Wearout Detection
(paper: pdf; slides: ppt)
Jason Blome, Shuguang Feng, Shantanu Gupta, Scott Mahlke.
2nd Workshop on Architectural Reliability (WAR)
Dec. 2006. - Cost-Efficient Soft Error Protection for Embedded Microprocessors
(paper: pdf; slides: ppt)
Jason A. Blome, Shantanu Gupta, Shuguang Feng, Scott Mahlke, and Daryl Bradley.
Proc. 2006 Intl. Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)
Oct. 2006, pp. 421-431. - BulletProof: A Defect-Tolerant CMP Switch Architecture
(paper: pdf; slides: ppt)
Kypros Constantinides, Stephen Plaza, Jason Blome, Bin Zhang, Valeria Bertacco, Scott Mahlke, Todd Austin, and Michael Orshansky
Proc. 12th Intl. Symposium on High-Performance Computer Architecture (HPCA)
Feb. 2006, pp. 3-14. - A Microarchitectural Analysis of Soft Error Propagation in a Production-level Embedded Microprocessor
(paper: pdf; slides: ppt)
Jason Blome, Scott Mahlke, Daryl Bradley, and Krisztian Flautner.
1st Workshop on Architectural Reliability (WAR)
Nov. 2005. - Assessing SEU Vulnerability via Circuit-level Timing Analysis
(paper: pdf; slides: ppt)
Kypros Constantinides, Stephen Plaza, Jason Blome, Bin Zhang, Valeria Bertacco, Scott Mahlke, Todd Austin, and Michael Orshansky.
1st Workshop on Architectural Reliability (WAR)
Nov. 2005.
Page last modified February 21, 2007.