Distributed VLIW

A major problem with traditional VLIW processors is that they do not scale efficiently due to bottlenecks that result from centralized resources. On the datapath side, the centralized register file quickly becomes the bottleneck as it is scaled to support more function units (FUs) in the system. The cost and access time of the register file increase quadratically with the number of ports. To support scalable datapath design, multicluster architecture was proposed, in which the centralized register file is broken down into several small register files, each of the smaller register files supplies operands for a subset of function units, thereby forming clusters.

A VLIW design faces a similar scaling problem with the control path where conventional designs utilize a centralized instruction memory or cache to store instructions. A centralized instruction fetch, decode, and distribution system issues control signals on every clock cycle to all the FUs and storage elements in the datapath to direct their operation. This centralized control system does not scale well due to complexity, latency, and energy consumption. As processor issue width is scaled, the number of instruction bits grows accordingly, increasing hardware cost for instruction fetch, decode, and distribution. The distance separating FUs and storage elements from the instruction memory also grows as the design is scaled, thereby increasing the latency to transmit control signals as well as the energy required to transmit values.

We introduce an architecture model called distributed VLIW, or DVLIW, to support scalable control path design. The contribution of DVLIW is distributing the VLIW control path while still support compressed encoding. The central idea is to distribute the instruction fetch, decode, and distribution logic in the same manner that the register file is distributed in a multicluster datapath. Each cluster contains an instruction memory or cache combined with fetch, decode, and distribution units that provide control within a cluster. All clusters have their own program counter (PC) and next PC generation hardware to facilitate distributed instruction sequencing. The following diagram shows the block diagram of a DVLIW processor.

Figure1. Distributed VLIW
Dataflow graph from the application
			   Blowfish. There are several recurring
			   patterns visible in this DFG.


The datapath in the DVLIW is that of a conventional multicluster VLIW, with each cluster having its own FUs and register files. In addition, the control path is partitioned as each cluster has its own program counter (PC), instruction cache, shift/align network, and instruction register (IR). In every cycle, each cluster fetches operations from its I-cache according to its own PC. All clusters execute synchronously and the execution order of operations is the same as that of a traditional VLIW architecture. All operations in a logical instruction word are fetched and executed at the same cycle in different clusters. If any cluster incurs a cache miss, all clusters must stall.

The code organization in all levels of the memory hierarchy is changed for DVLIW. In conventional architectures with a centralized PC, all operations within the same instruction word are placed sequentially in memory, as shown in igure2(a). Thus, operations for different clusters, e.g. A0 and A1, are placed next to each other. With such a code organization, distributing the I-cache is difficult because the actual distribution of instruction bits must occur during program execution by the hardware. This generally precludes complex instruction compression schemes as the run-time distribution algorithm must be very simple. In the DVLIW architecture, the operations for each cluster are grouped together, and code for different clusters is placed separately in the memory (or in separate memories), as shown in Figure2(b). This organization of code allows a cluster to compute its own next PC without knowing the size of operations in other clusters, thus allowing all clusters to fetch and execute independently.

Figure2. Code layout in DVLIW
Dataflow graph from the application
			   Blowfish. There are several recurring
			   patterns visible in this DFG.

A major challenge that arises for the DVLIW architecture is branch execution. A valid single thread of execution must be maintained using multiple instruction streams. Therefore, each instruction stream must execute instructions from the same logical instruction word each cycle and branch to the same logical target at the same time. As shown in Figure2(b), operations in a logical instruction word are stored in separate locations. Thus every cluster has a different branch target for each branch operation. Special architectural and compiler support is proposed to solve this problem.

The proposed branch mechanism is based on the unbundled branch architecture in HPL-PD. The unbundled branch architecture separately specifies each portion of the branch: target address, condition, and control transfer point. In a DVLIW architecture with unbundled branches, branch targets must be computed separately for each cluster. The branch condition is computed in one cluster and broadcast to all the clusters. If the branch condition is taken, each cluster transfers control to its individual branch target. All of these branch targets correspond to the same logical block. The compiler inserts new branch operations so that all the clusters branch at the same time.

With DVLIW, multiple instruction streams are executed at the same time. These streams collectively function as a single logical stream on a conventional VLIW processor. Although clusters fetch independently, they execute operations in the same logical position each cycle and branch to the same logical location. Procedures, basic blocks, and instruction words are vertically sliced and stored in different locations in the instruction memory hierarchy. However, the logical organization is maintained by the compiler to ensure proper execution. The DVLIW architecture can be viewed as a special chip multiprocessor system that, through compiler orchestration, collectively executes a single program exploiting instruction-level parallelism. A DVLIW architecture can also dynamically repartition itself to support concurrent execution of multiple instruction streams. Thus, applications with both large degrees of parallelism as well as those with limited but thread-level parallelism can be efficiently executed on the DVLIW.

Relevant Publications


Page last modified January 22, 2016.