As silicon technologies enter deep sub-micron realms, circuit level techniques are increasingly being employed in addition to architectural techniques to achieve the stringent performance and power goals of embedded applications. Dynamic frequency/voltage scaling (DFS and DVS) is a widely used technique to reduce the overall energy consumption of a computer system, particularly with workloads with high variation in processing requirements. DFS/DVS can either be used to push the operating frequency of a circuit beyond the nominal operating frequency assumed during design time to achieve improved performance or reduce voltage/frequency to reduce energy consumption at times when the full capabilities of the hardware are not required by the application. A critical issue for a DFS/DVS-enabled computer system is determining the safe operating voltage in which maximum execution efficiency is achieved, while guaranteeing correct operation of all components.
Razor is a cost-effective technique to perform in-situ error detection and recovery from timing-related circuit errors. Razor uses a delay-error tolerant flip-flop on critical paths to scale the supply voltage to the point of first failure for a given frequency. A Razor flip-flop detects timing errors by sampling values in a time staggered manner and flags an error in case of a discrepancy. This allows voltage margins to be completely eliminated to increase execution efficiency. Razor also allows scaling below the first failure point into the sub-critical region to deliberately sustain a low error-rate the enables more energy-efficient operation.
One of the difficulties of Razor is that techniques specific to the circuit under consideration must be employed to recover from the error. For example, in the case of a processor, the error signal can be used to squash and replay previous instructions. In the case of a loop accelerator, the finite state machine should revert to a prior known-good state. Even though Razor promises improved performance by going beyond design time limitations, Razor-enabling a circuit is a difficult problem. The circuit has to be analyzed to determine locations to insert Razor latches and design-specific techniques have to be created to recover from timing errors.
The focus of this research is an automated system to synthesize Razor-enabled loop accelerators from high-level specifications. The inputs to the system are the target application expressed in C and the desired performance. Compiler analyses and scheduling are used to synthesize a minimum cost loop accelerator for the application to meet the given performance. The synthesis system exploits the regular structure of the accelerator datapath template to automatically determine locations to place Razor latches. An application specific error recovery mechanism is also automatically derived by the synthesis system. The loop accelerator is augmented with extended shift registers and store queues to enable rollback and re-execution when a timing violation is detected. This is illustrated in the figure below:
The compiler is able to provide information about the widths of different values used in the accelerator at different times. This information can be used to predict error rates and, therefore, provide a potential power reduction value to aid in the decision of whether to use Razor or not. The figure below shows the number of timing errors incurred by the applications per 100 billion cycles with increasing frequency. Only benchmarks with a non-zero number of errors within the 200% frequency range are shown. The number of errors is dependent on the number of critical operations (e.g. multiply) the application is performing in a cycle and the mix of data flowing through the datapath implementing those critical operations. For example, idct performs about 5 multiply operations per cycle and the data profile shows idct operates primarily on 32-bit data. Thus it incurs about 35 errors in 100 billion cycles at 200% frequency. In contrast, dequant has only 2 multiply operations and about 90% of data operands are 8-bit numbers. Thus it incurs only 7 errors in 100 billion cycles.
Relevant Publications
- None.
Page last modified January 22, 2016.