© Lorem Ipsum Dolor 2010
2013 IEEE High Performance
Extreme Computing Conference
(HPEC ‘13)
Seventeenth Annual HPEC Conference
10 - 12 September 2013
Westin Hotel, Waltham, MA USA
Instructor: Deshanand Singh, Director of High-Level Design at Altera
In recent years, Field-Programmable Gate Arrays have become extremely powerful computational platforms that can efficiently solve many
complex problems. Modern FPGAs comprise effectively millions of programmable elements, signal processing elements and high-speed
interfaces, all of which are necessary to deliver a complete solution. The power of FPGAs is unlocked via low-level programming
languages such as VHDL and Verilog, which allow designers to explicitly specify the behavior of each programmable element. While these
languages provide a means to create highly efficient logic circuits, they are akin to “assembly language” programming for modern
processors. This is a serious limiting factor for both productivity and the adoption of FPGAs on a wider scale.
In this tutorial, we use the OpenCL language to explore techniques that allow us to program FPGAs at a level of abstraction closer to
traditional software-centric approaches. OpenCL is an industry standard parallel language based on ‘C’ that offers numerous advantages
that enable designers to take full advantage of the capabilities offered by FPGAs, while providing a high-level design entry language that is
familiar to a wide range of programmers.
Because this is a tutorial, we will walk you through progressively more complex examples. We will start with Lincoln Lab’s own HPEC
Challenge TDFIR benchmark. We selected it because it is simple and familiar to the HPEC community. Lincoln’s benchmark includes data
and a test harness that verifies correctness. The optimal OpenCL implementation requires slightly different styles of coding to exploit the
underlying architectures of a CPU, GPU and FPGA.
Next we will examine an implementation of single precision, general-element, matrix multiplication (SGEMM). It is an example of a highly-
parallel algorithm for which an efficient circuit structures are well known. We show how this application can be implemented in OpenCL and
how the high-level description can be optimized to generate the most efficient circuit in hardware.
Finally, we will show a detailed case study of a video compression algorithm written in OpenCL. We will walk through the optimization
steps required to transform the code from an initial working implementation to one that is able to provide significantly higher performance
than a modern GPU. In addition to the performance benefits of the FPGA implementation, we will show board level power measurements
that demonstrate compelling power efficiency as well.
The tutorial will contain live demonstrations and is supported by Altera and BittWare. It will include an overview of OpenCL but attendees
will gain more from this tutorial if they walk in with some basic knowledge of OpenCL.