Voltron Processor

Chip multiprocessors with multiple simpler cores are gaining popularity because they have the potential to drive future performance gains without exacerbating the problems of power dissipation and complexity. Current chip multiprocessors increase throughput by utilizing multiple cores to perform computation in parallel. These designs provide real benefits for server-class applications that are explicitly multi-threaded. However, for desktop and other systems where single-thread applications dominate, multicore systems have yet to offer much benefit. Chip multiprocessors are most efficient at executing coarse-grain threads that have little communication. However, general-purpose applications do not provide many opportunities for identifying such threads, due to frequent use of pointers, recursive data structures, if-then-else branches, small function bodies, and loops with small trip counts.

To attack this mismatch, we propose a multicore architecture, referred to as Voltron, that extends traditional multicore systems to enable efficient execution of single thread applications across multiple cores. The Voltron architecture exploits instruction level parallelism (ILP), fine-grain thread level parallelism (TLP) and statistical loop level parallelism (LLP) in single thread applications.

 Voltron architecture Figure 1. Voltron architecture


Figure 1(a) shows an overall diagram of a four-core Voltron system. The four cores are organized in a two-dimensional mesh. Each core is a VLIW processor with extensions to communicate with neighboring cores. The cores access a unified memory space and share a banked L2 cache.

Figure 1(b) shows the datapath architecture of each core, which is very similar to that of a conventional VLIW processor. Each core has a complete pipeline, including an L1 instruction and data cache, an instruction fetch and decode unit, register files, and function units (FUs). A Voltron core differs from a conventional VLIW processor in that, in addition to other normal functional units such as an integer ALU, a floating-point ALU, and memory units, each core has a communication unit (CU). The CU can communicate with other cores in the processor through the operand network by executing special communication instructions. A low-cost transactional memory is used to support.

Figure 1(c) shows the details of the dual-mode operand network. The Voltron scalar operand network supports two modes to meet the requirement of ILP and fine-grain TLP execution: a direct mode that has a very low latency (1 cycle per hop) but requires both parties of the communication to be synchronized, and a queue mode that has a higher latency (2 cycles + 1 cycle per hop) but allows asynchronous communication.

Voltron supports two execution modes that are customized for the form of parallelism that is being exploited: coupled and decoupled. In coupled mode, all cores execute in lock-step, collectively behaving like a wide-issue multicluster VLIW machine. In decoupled mode each core independently executes its own thread. Coupled efficiently exploits ILP using the direct mode operand network, while decoupled exploits LLP and fine-grain TLP using the queue mode operand network.

The compiler is responsible for extracting parallelism in the program and orchestrating the execution across multiple cores. See Voltron compiler techniques for details.

The two execution modes in Voltron allow the processor to adjust itself to the parallelism in applications. Coupled mode offers the advantage of fast inter-core communication and the ability to efficiently exploit ILP across the cores. Decoupled mode offers the advantage of fast synchronization and the ability to overlap execution across loop iterations and in code regions with frequent cache misses. Experimental results show that dual-mode execution is significantly more effective than either mode alone.

Relevant Publications


Page last modified January 22, 2016.