Streaming

Support for parallelism in hardware has greatly evolved in the past decade as a response to the ever-increasing demand for higher performance and better power efficiency in different application domains. Various companies have introduced vastly different solutions to bridge the performance and power gap that many applications are facing. These solutions include shared-memory multicore systems~(Intel Core i7~\cite{intel08:3}), distributed-memory multicore processors~(IBM Cell~\cite{ibm06}), tiled architectures~(Tilera~\cite{tilera08}) and in some cases a combination of these~(Intel Larrabee~\cite{seiler08} and Intel Stellarton~\cite{intel10}). Among these solutions, heterogeneous architectures, as shown in Figure~\ref{heterogeneous-template-fig}, not only achieve higher performance and efficiency by combining multiple cores into one die, but they are also equipped with acceleration engines to enable more efficient parallelism support for certain application domains. For example, SIMD engines~(e.g., Altivec~\cite{semicond09}, Neon~\cite{ltd09}, SSE4~\cite{intel06}) integrated into multi-core systems enable more efficient data-level parallelism support for several important application domains such as multimedia, graphics, and encryption. Although acceleration engines, such as SIMD units or FPGAs, are not suitable for all applications, if an application can be tailored to efficiently exploit them, the performance and power benefits can often be superior to the gains from other general purpose architecture solutions.

Programming heterogeneous architectures is an important problem that is impeding the wider adoption of such systems. Traditional sequential programming languages are ill-suited for heterogeneous architectures because they have a single instruction stream and a monolithic memory. Extracting task/pipeline/data-level parallelism from these languages needs extensive and often intractable compiler analysis. Using a different programming model and compilation framework for each component of the system is also undesirable because it limits the portability and retargetability of the program requiring \textit{each program to be rewritten and optimized for a specific architecture}. Architecture-specific programming models and languages, such as Verilog and CUDA~\cite{nvidia07}, that target specific components, such as FPGAs and GPUs, expose parallelism to the compiler, but in their current form fail to provide portable code and do not present a unified model to the programmer. The main problem with these languages is that explicitly-programmed parallelism in each application has to be tuned for different targets based on the parameters of each hardware component and interfacing between different parts of an application written in different architecture-specific languages is non-trivial.

A higher level of programming abstraction along with intelligent static and dynamic compiler optimizations can solve the issues of programming heterogeneous systems while maintaining portability and retargetability. One such abstraction is offered by the streaming paradigm. This programming paradigm provides an extensive set of compiler optimizations for mapping and scheduling applications to various parallel architectures~(\cite{gordon06, gordon02}). The retargetability of streaming languages, such as StreamIt~\cite{thies02}, has made them a good choice for parallel system programmers. Streaming language retargetability and the resulting performance benefits on multi-core systems are mainly due to having well-encapsulated constructs that expose the parallelism and communication without depending on the topology or granularity of the underlying architecture. Compilers for these languages take advantage of the high-level information available at the program level to efficiently map the exposed parallelism to the target architecture.

Most of the work on stream compilation has so far focused on how to compile streaming applications to homogeneous multi-core systems. However, compiling stream programs to other important components of heterogeneous architectures, such as FPGAs, SIMD engines and GPUs, is still an open question. In this thesis, we propose new techniques and compilation frameworks for static and dynamic compilation of programs in the streaming domain, specifically those implemented in synchronous data flow~(SDF, see Chapter~\ref{ch:back}) model, to various components of heterogeneous systems. Our techniques further extend the retargetability and portability of streaming applications by enabling programmers to write a streaming application once and efficiently run it on various parts of a system. An overview of our system is shown in Figure~\ref{system-fig}. The following sections briefly explain the four parts of our compiler and runtime system: Optimus, Macross, Sponge and Flextream.

Relevant Publications


Page last modified January 22, 2016.