Reliability

Introduction

Future microprocessors will contain of billions of transistors, many of which will be dead-on-arrival. Those that survive will be subjected to the effects of numerous wearout mechanisms such as time-dependent dielectric breakdown (TDDB), hot carrier injection (HCI), electromigration (EM), and negative bias temperature instability (NBTI.), which result in performance degradation and eventual device failure. The goal of this reliability research is to develop resilient architectures robust enough to operate in such hostile environments.

Wearout Detection and Failure Prediction

In order to mitigate reliability concerns, architects and circuit designers typically employ either error detection or failure prediction mechanisms. Error detection, is used to diagnose failed or failing components by identifying (potentially transient) pieces of incorrect state within the system. Once an error is detected, the problem is diagnosed and corrective actions may be taken. The second technique, failure prediction, supplies the system with a failure forecast allowing it to take preventative measures to avoid, or at least mitigate, the effects of device failures.

Historically, high-end server systems have relied on error detection to provide a high degree of system reliability. Error detection is typically implemented through coarse grained replication. This replication can be conducted either in space through the use of replicated hardware, or in time by way of redundant computation. The use of redundant hardware is costly in terms of both power and area and does not significantly increase the lifetime of the processor without additional cold-spare devices, which further increases the cost of such techniques. Redundancy in time is potentially less expensive but may only provide transient error detection unless redundant hardware is readily available.

Failure prediction techniques are typically less costly to implement, however, they may suffer from inaccuracy. In our reliability research, we propose leveraging a variety of symptoms exhibited by aging circuits, indicative of wearout to predict the failure of structures within a microprocessor core. One such technique is to utilize online circuit timing information to predict failure. Wearout mechanisms such as electromigration, time-dependent dielectric breakdown, hot carrier injection, and negative bias temperature instability all exhibit signs of their progression by affecting circuit timing characteristics. By conducting statistical analysis of circuit-level timing characteristics, we are able to accurately detect the onset of wearout within microprocessor structures.

Architectural Exploration

StageNet

The goal of this work is to design a robust and scalable multithreaded system. To ensure reliable operation, the system is designed to be both introspective and reconfigurable. Introspection enables the system to accurately detect and diagnose transient faults, which may be caused by particle strikes, excessive temperature densities, or electrical noise. Component failures due to wearout phenomenon, such as electromigration and dielectric breakdown, can also be anticipated by an effective introspection scheme. Fine-grained reconfigurability maximizes the lifetime of the system and allows it to adapt to changing conditions on the chip. The adaptive capabilities of the system allow it to maintain service in the face of failing components. Further, this system is designed with a large number of redundant structures which gives it the flexibility of supporting multiple threads when appropriate to maximize performance or as structures fail, to reduce throughput and gracefully degrade over time.

stagenet

Processor cores within the proposed system are designed as part of a network-on-chip, where each stage in a coarse-grained processor pipeline corresponds to a node in the network. Pipeline stages are then replicated and grouped together to create multiple logical processors. The interconnect between these pipeline stages is designed to be flexible so that the system can react to local phenomenon. Multiple nodes may be allocated temporarily to a single thread to exploit instruction level parallelism, while at other times the system may evenly distribute resources between all of the logical processors in order to maximize throughput. Similarly, the system can temporarily arrest allocation of nodes that are excessively hot, as well as retire nodes that are deemed defective. As nodes wear out and eventually fail, throughput gradually decreases and performance gracefully degrades.

Relevant Publications


Page last modified February 21, 2007.