© Lorem Ipsum Dolor 2010
2013 IEEE High Performance
Extreme Computing Conference
(HPEC ‘13)
Seventeenth Annual HPEC Conference
10 - 12 September 2013
Westin Hotel, Waltham, MA USA
High-performance Dynamic Programming on FPGAs with OpenCL
Sean Settle, Altera Corporation
Abstract: Field programmable gate arrays (FPGAs) provide reconfigurable computing fabrics that can be tailored to a wide range of time
and power sensitive applications. Traditionally, programming FPGAs required an expertise in complex hardware description languages
(HDLs) or proprietary high-level synthesis (HLS) tools. Recently, Altera released the world’s first OpenCL conformant SDK for FPGAs.
OpenCL is an open, royalty-free standard for cross-platform, parallel programming of heterogeneous systems that together with Altera
extensions significantly reduces FPGA development time and costs in high-performance computing environments. In this paper, we
demonstrate dynamic programming on FPGAs with OpenCL by implementing the Smith Waterman algorithm for DNA, RNA, or protein
sequencing in bioinformatics in a manner readily familiar to both hardware and software developers. Results show that Altera FPGAs
significantly outperform leading CPU and GPU parallel implementations by over an order of magnitude in both absolute performance
and relative power efficiency.
High Throughput Energy Efficient Parallel FFT Architecture on FPGAs
Ren Chen, USC; Neungsoo Park,, Konkuk University; Viktor Prasanna, University of Southern California
Abstract: To process streaming data, throughput is a key performance metric for FFT designs. However, high throughput FFT
architectures consume large amount of power due to complex routing and memory access. In this paper, we propose a high throughput
energy efficient multi-FFT architecture based on Radix-x Cooley-Turkey algorithm. In the proposed architecture, multiple time-
multiplexed pipeline FFT processors are used to achieve the same throughput of a fully spatial parallel FFT architecture. This design
avoids complex routing, thus reducing the interconnection power. Furthermore, a dynamic memory activation scheme is developed to
reduce the memory power. Post place-and-route results show that, for N-point FFT (64<=N<=4096), our designs improve the energy
efficiency (defined as GOPS/Joule) by 17% to 26%, compared with a state-of-the-art design. For various throughput requirements, the
proposed design achieves 50-63 GOPS/Joule, i.e., up to 78% of the peak energy efficiency of FFT designs on FPGAs.
Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs
Kiran Matam, USC; Hoang Le, USC; Viktor Prasanna, University of Southern California
Abstract: Energy efficiency is emerging as one of the key metrics in scientific computing. In this work, we evaluate the energy efficiency
of the floating point matrix multiplication on the state-of-the art FPGAs. First, we implement a design parameterized with the problem
size and the type of storage memory. Next, to understand the efficiency of our implementations, we propose and implement a minimal
architecture to measure an upper bound on the energy efficiency of any matrix multiplication implementation. Lastly, we model and
estimate the energy efficiency of the large-scale matrix multiplication using external DRAM. Our implementations can achieve energy
efficiency up to 7.07 and 2.28 GFlops / Joule at worst-case signal rate for single and double precision, respectively. When compared to
the measured upper bound on energy efficiency, our implementations can sustain up to 72% and 84% for single and double precision,
respectively. For large-scale matrix multiplication we estimate an energy efficiency of 5.21 and 1.60 GFlops / Joule for single and double
precision, respectively.
Image Search System
Peter Cho, MIT Lincoln Lab; Michael Yee, MIT Lincoln Lab
Abstract: Digital images are currently shot and stored in vast numbers. Billions of photos and video clips may now be accessed via public
internet and private offline archives. Yet navigating through image repositories generally requires clicking through seas of thumbnails.
Aside from occasional human-tagged keywords, little connection typically exists between archived images to help users find stills or
frames of interest. New search capabilities are consequently needed to mine huge imagery volumes. In this talk, we present a prototype
system which enables user exploration of global structure as well as individual picture drill-down for O(104) images. Our search engine
builds upon advances made over the past decade in SIFT matching, image feature clustering and large data handling. Using our software
tools, image analysts can investigate disparate data sets such as 30K photos shot semi-cooperatively around MIT, 6K Grand Canyon jpeg
files downloaded from Flickr, and 2K video frames extracted from a news broadcast. All these data sets were processed on Lincoln
Laboratory's grid cluster. Our system's netcentric design enables multi-user collaboration on different sets of archived pictures. The
search engine is based upon a postgres database which stores topological relationships, space-time metadata and derived image
attributes. Its front-end includes a web browser thin client and graph viewer thick client whose states remain synchronized. The browser
displays a currently selected image as well as sibling thumbnails. The graph viewer represents digital pictures as nodes whose coloring
indicates SIFT feature overlap. Hierarchical node clustering generates family pyramid structures for a priori unorganized sets of input
images. Thin and thick client perusing of higher pyramid levels provides a practical means for gaining comprehensive insight into large
data collections. Users can assign captions and annotate image regions of interest with our system. Moreover, various known or
automatically derived attributes are searchable. For example, air/ground views and EO/IR sensing modalities are readily identified via
graph node coloring. Color histograms for individual pictures and dominant color content for entire collections may similarly be retrieved
from the database. Human face detection results based upon work by Kalal, Matas and Mikolajczyk (2008) are also exploitable in the thin
and thick clients. We close by discussing relations automatically uncovered among YouTube video clips of the April 2013 Boston
bombings. The results compellingly demonstrate our search system's potential for exploiting massive, social media imagery archives.
Optimizing Performance of HPC Storage Systems: Optimzing performance for reads and writes
Torben Kling Petersen, Xyratex; John Fragalla, Xyratex
Abstract: The performance of HPC storage systems depend upon a variety of factors. The results of using any of the standard
benchmark suites for storage depends not only on the storage architecture, but also on the type of disk drives, type and design of the
interconnect, and the type and number of clients. In addition, each of the benchmark suites have a number of different parameters and
test methodologies that require careful analysis to determine the optimal settings for a successful benchmark run. To reliably
benchmark a storage solution, every stage of the solution needs to be analyzed including block and file performance of the RAID,
network and client throughput to the entire filesystem and meta data servers. For a filesystem to perform at peak performance, there
needs to be a balance between the actual performance of the disk drives, the SAS chain supporting the RAID sets, the RAID code used
(whether hardware RAID controllers or software MD-RAID), the interconnect and finally the clients. This paper describes these issues
with respect to the Lustre filesystem. The dependence of benchmark results with respect to various parameters is shown. Using a single
storage enclosure consisting of 8 RAID sets (8+2 drives each) it is possible achieve both read and write performances in excess of 6 GB/s
which translates to more than 36 GB/s per rack of measured client based throughput. This paper will focus on using Linux performance
tool, obdfilter-survey, and IOR to measure different levels of the filesystem performance using Lustre.
A Clustered Manycore Processor Architecture for Embedded and Accelerated Applications
Benoit Dupont de Dinechin, Kalray
Abstract: The Kalray MPPA-256 processor integrates 256 user cores and 32 system cores on a chip with 28nm CMOS technology. Each
core implements a 32-bit 5-issue VLIW architecture. These cores are distributed across 16 compute clusters of 16+1 cores, and 4 quad-
core I/O subsystems. Each compute cluster and I/O subsystem owns a private address space, while communication and synchronization
between them is ensured by data and control Networks-On-Chip (NoC). The MPPA-256 processor is also fitted with a variety of I/O
controllers, in particular DDR, PCI, Ethernet, Interlaken and GPIO. We demonstrate that the MPPA-256 processor clustered manycore
architecture is effective on two different classes of applications: embedded computing, with the implementation of a professional H.264
video encoder that runs in real-time at low power; and high-performance computing, with the acceleration of a financial option pricing
application. In the first case, a cyclostatic dataflow programming environment is exploited, that automates application distribution over
the execution resources. In the second case, a explicit parallel programming model based on POSIX processes, threads, and IPC is used.