Maestro

As CMOS feature sizes venture deep into the nanometer regime, wearout mechanisms including negative-bias temperature instability and timedependent dielectric breakdown can severely reduce processor operating lifetimes and performance. This paper presents an introspective reliability management system, Maestro, to tackle reliability challenges in future chip multiprocessors (CMPs) head-on. Unlike traditional approaches, Maestro relies on low-level sensors to monitor the CMP as it ages (introspection). Leveraging this real-time assessment of CMP health, runtime heuristics identify wearout-centric job assignments (management). By exploiting the complementary effects of the natural heterogeneity (due to process variation and wearout) that exists in CMPs and the diversity found in system workloads, Maestro composes job schedules that intelligently control the aging process. Monte Carlo experiments show that Maestro significantly enhances lifetime reliability through intelligent wear-leveling, increasing the expected service life of a population of 16-core CMPs by as much as 38% compared to a naive, round-robin scheduler. Furthermore, in the presence of process variation, Maestro's wearout-centric scheduling outperformed both performance counter and temperature sensor based schedulers, achieving an order of magnitude more improvement in lifetime throughput -- the amount of useful work done by a system prior to failure.

Read the Maestro paper here: pdf

StageNet

The goal of this work is to design a robust and scalable multithreaded system. To ensure reliable operation, the system is designed to be both introspective and reconfigurable. Introspection enables the system to accurately detect and diagnose transient faults, which may be caused by particle strikes, excessive temperature densities, or electrical noise. Component failures due to wearout phenomenon, such as electromigration and dielectric breakdown, can also be anticipated by an effective introspection scheme. Fine-grained reconfigurability maximizes the lifetime of the system and allows it to adapt to changing conditions on the chip. The adaptive capabilities of the system allow it to maintain service in the face of failing components. Further, this system is designed with a large number of redundant structures which gives it the flexibility of supporting multiple threads when appropriate to maximize performance or as structures fail, to reduce throughput and gracefully degrade over time.

Processor cores within the proposed system are designed as part of a network-on-chip, where each stage in a coarse-grained processor pipeline corresponds to a node in the network. Pipeline stages are then replicated and grouped together to create multiple logical processors. The interconnect between these pipeline stages is designed to be flexible so that the system can react to local phenomenon. Multiple nodes may be allocated temporarily to a single thread to exploit instruction level parallelism, while at other times the system may evenly distribute resources between all of the logical processors in order to maximize throughput. Similarly, the system can temporarily arrest allocation of nodes that are excessively hot, as well as retire nodes that are deemed defective. As nodes wear out and eventually fail, throughput gradually decreases and performance gracefully degrades.

Read the StageNet paper here: pdf

Page last modified January 22, 2016.