
To meet an insatiable consumer demand for greater performance at less power, silicon technology has scaled to unprecedented dimensions. However, the pursuit of faster processors and longer battery life has come at the cost of device reliability. Given the rise of processor (un)reliability as a first-order design constraint, there has been a growing interest in low-cost, non-intrusive techniques for transient fault detection.
Many of these recent proposals have relied on the availability of hardware recovery mechanisms. Although common in aggressive out-of-order machines, hardware support for speculative rollback and recovery is less common in lower-end commodity and embedded processors. We have recently developed Encore, a software-based fault recovery mechanism tailored for these lower-cost systems that lack native hardware support for speculative rollback.
New compiler analyses and algorithms are developed that enable Encore to provide this fault recovery at very low costs. By exploiting fine-grained idempotence analysis and cost-sensitive heuristics that only target statistically relevant code regions, Encore can achieve high recoverability coverage without the accompanying costs associated with traditional software-based checkpointing solutions. Experimental results show that Encore can recover from up to 95% of detected faults for certain applications and on average only imposes 6% of runtime performance overhead.
 
The high-level Encore vision: At compile-time, application code is partitioned into single-entry, multiple-exit
regions that are subsequently analyzed and instrumented to enable low-cost rollback recovery from
transient faults. Flexible heuristics enable Encore to refine the partitioning and instrumentation
passes, customizing their behavior to achieve the desired tradeoff between reliability and performance
overheads.
Additional Work: We are currently working to extend this beyond a backend implementation form to an IR level implementation to allow for the targeting of multiple processor architectures, and to hopefully improve upon the process overall by using improved memory dependence analyses and improved loop processing.
Page last modified January 22, 2016.
