
Customizing a processor's instruction set exploits the observation that recurring subgraphs frequently exist in the dataflow graphs (DFG) of applications. By implementing these frequently occurring subgraphs in hardware as instructions, performance improves, code size decreases, and energy consumption is reduced.
We believe the process of selecting instruction set extensions should be automated, since the number of potential subgraphs in an application is too large for a person to process (it grows exponentially with the size of the DFG). For example, the DFG in the figure below is 1/4 of one basic block for the encryption application Blowfish. The black squares in this figure represent operations, and the red lines are dataflow edges. There are many subgraphs, some of which are highlighted.
Dataflow graph from the application Blowfish
The CCCP group has developed tools to automatically select appropriate instruction set extensions for a given application, or set of applications. We have also developed compiler algorithms to utilize the instruction set extensions in applications that were not examined by the selection tools.
Perhaps more importantly, we've developed techniques to cost-effectively generalize these instructions to make them more applicable across a range of applications. Generalization is key, since we want to make the custom instructions usable even if the application changes because of bug fixes, or slight algorithmic changes.
One technique for subgraph generalization, we term wildcards. Wildcards take advantage of the fact that some operations are similar, or can be added to nodes for very little additional cost. For example, add and subtract are very similar, so subtract could be cheaply added to the bottom node in the example below. The new unit could then execute shift-or-subtract in addition to shift-or-add.
A second technique for subgraph generalization, we term subsumed subgraphs. This arose from the observation that each each node has an identity input, and values can pass through those nodes unchanged using this identity. This allows one to execute a shift-add on the unit below, for example.
Generalization techniques for instruction set extensions.
The CCCP group has taken these techniques and implemented them in ARM's OptimoDE design framework. Some results of this are presented below. Two applications are listed for each point along the X-axis signifying the application being run and the application the instruction set extensions were designed for, respectively. Speedup is relative to an ARM 10 processor.
Clearly, instruction set extensions are powerful, and the generalization techniques improve their utilization across an entire domain of applications.
Performance improvement in the encryption domain.
Relevant Publications
- A Customized Processor for Energy Efficient Scientific Computing
(paper: pdf)
Ankit Sethia, Ganesh Dasika, Trevor Mudge, and Scott Mahlke
IEEE Transactions on Computers
Vol. 61, No. 12, Dec. 2012, pp. 1711-1723. - PEPSC: A Power Efficient Processor for Scientific Computing
(paper: pdf ; slides: ppt)
Ganesh Dasika, Ankit Sethia, Trevor Mudge, and Scott Mahlke
Proc. 20th Intl. Conference on Parallel Architectures and Compilation Techniques (PACT)
Oct. 2011. - MEDICS: Ultra-Portable Processing for Medical Image Reconstruction
Ganesh Dasika, Ankit Sethia, Vincentius Robby, Trevor Mudge, and Scott Mahlke
Proc. 19th Intl. Conference on Parallel Architectures and Compilation Techniques (PACT)
Sep. 2010. - DVFS in Loop Accelerators using BLADES
(paper: pdf ; slides: ppt)
Ganesh Dasika, Shidhartha Das, Kevin Fan, Scott Mahlke and David Bull
Proc. 45th Design Automation Conference (DAC)
Jun. 2008, pp. 894-897. - Scalable Subgraph Mapping for Acyclic Computation Accelerators
(paper: pdf ; slides: ppt)
Nathan Clark, Amir Hormati, Scott Mahlke, and Sami Yehia.
Proc. 2006 Intl. Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)
Oct. 2006, pp. 147-157. - Automated Custom Instruction Generation for Domain-Specific Processor Acceleration
(paper: pdf)
Nathan Clark, Hongtao Zhong, and Scott Mahlke.
IEEE Transactions on Computers
Vol. 54, No. 10, Oct. 2005, pp. 1258-1270. - Application Specific Processing on a General Purpose Core via Transparent Instruction Set Customization
(paper: pdf ; slides: ppt)
Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner.
Proc. 37th Intl. Symposium on Microarchitecture (MICRO)
Dec. 2004, pp. 30-40. - OptimoDE: Programmable Accelerator Engines Through Retargetable Customization
(slides: ppt)
Nathan Clark, Hongtao Zhong, Kevin Fan, Scott Mahlke, Krisztian Flautner, and Koen Van Nieuwenhove.
Proc. Hot Chips 16
Aug. 2004. - Automatic Design of Application Specific Instruction Set Extensions Through Dataflow Graph Exploration
Nathan Clark, Hongtao Zhong, Wilkin Tang and Scott Mahlke
Intl. Journal of Parallel Programming
Vol. 31, No. 6, Dec. 2003, pp. 429-449. - Processor Acceleration Through Automated Instruction Set Customization
(paper: pdf ; slides: ppt)
Nathan Clark, Hongtao Zhong, and Scott Mahlke
Proc. 36th Intl. Symposium on Microarchitecture (MICRO)
Dec. 2003. pp. 129-140. - Automatically Generating Custom Instruction Set Extensions
(paper: pdf ; slides: ppt)
Nathan Clark, Wilkin Tang, and Scott Mahlke.
Proc. 1st Workshop on Application Specific Processors (WASP)
Nov. 2002, pp. 94-101.
Page last modified May 11, 2012.