© Lorem Ipsum Dolor 2010
2013 IEEE High Performance
Extreme Computing Conference
(HPEC ‘13)
Seventeenth Annual HPEC Conference
10 - 12 September 2013
Westin Hotel, Waltham, MA USA
LLSuperCloud: Sharing HPC Systems for Diverse Rapid Prototyping
Albert Reuther, MIT Lincoln Laboratory; Jeremy Kepner, MIT
Abstract: The supercomputing and enterprise computing arenas come from very different lineages. However, the advent of
commodity computing servers has brought the two arenas closer than they have ever been. Within enterprise computing,
commodity computing servers have resulted in the development of a wide range of new cloud capabilities: elastic computing,
virtualization, and data hosting. Similarly, the supercomputing community has developed new capabilities in heterogeneous,
massively parallel hardware and software. Merging the benefits of enterprise clouds and supercomputing has been a challenging
goal. Significant effort has been expended in trying to deploy supercomputing capabilities on cloud computing systems. These efforts
have resulted in unreliable, low-performance solutions, which requires enormous expertise to maintain. LLSuperCloud provides a
novel solution to the problem of merging enterprise cloud and supercomputing technology. More specifically LLSuperCloud reverses
the traditional paradigm of attempting to deploy supercomputing capabilities on a cloud and instead deploys cloud capabilities on a
supercomputer. The result is a system that can handle heterogeneous, massively parallel workloads while also providing high
performance elastic computing, virtualization, and databases. The benefits of LLSuperCloud are highlighted using a mixed workload
of C MPI, parallel Matlab, Java, databases, and virtualized web services.
An Improved Eigensolver for Quantum-dot Cellular Automata Simulations
Aaron Baldwin, Valparaiso University; Jeffrey Will, Valparaiso University; Douglas Tougaw, Valparaiso University
Abstract: The work in this paper describes the application of an optimized eigensolver algorithm to produce the kernel calculations
for simulating quantum-dot cellular automata (QCA) circuits, an emerging implementation of quantum computing. The application
of the locally optimal block preconditioned conjugate gradient (LOBPCG) method to calculate the eigenvalues and eigenvectors for
this simulation was shown to exhibit a 15.6 speedup over the commonly used QR-method for a representative simulation and has
specific advantages for the Hermitian, positive-definite, sparse matrices commonly encountered in simulating the Time-Independent
Schrödinger equation. We present the computational savings for a simulation analyzing the effect of stray charges near a four-cell
line of QCA cells with a single driver cell, and we discuss implications for wider application. We further discuss issues of problem
preconditioning which are specific to QCA simulation when utilizing the LOBPCG method.
PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs
Julia Mullen, MIT Lincoln Laboratory; Michael Wolf, MIT Lincoln Laboratory; Anna Klein, MIT Lincoln Laboratory
Abstract: Recent projections suggest that applications and architectures will need to attain 75 GFLOPS/W in order to support future
DoD missions. Meeting this goal requires deeper understanding of kernel and application performance as a function of power and
architecture. As part of the PAKCK study, a set of DoD application areas, including signal and image processing and big data/graph
computation, were surveyed to identify performance critical kernels relevant to DoD missions. From that survey, we present the
characterization of dense matrix-vector product, two dimensional FFTs, and sparse matrixdense vector multiplication on the NVIDIA
Fermi and Intel Sandy Bridge architectures. We describe the methodology that was developed for characterizing power usage and
performance on these architectures and present power usage and performance per Watt for all three kernels. Our results indicate
that 75 GFLOPS/W is a very challenging target for these kernels, especially for the sparse kernels, whose performance was orders of
magnitude lower than dense kernels.
CrowdCL: Web-Based Volunteer Computing with WebCL
Tommy MacWilliam, Cris Cecka, Harvard University
Abstract: We present CrowdCL, an open-source framework for the rapid development of volunteer computing and OpenCL
applications on the web. Drawing inspiration from existing GPU libraries like PyCUDA, CrowdCL provides an abstraction layer for
WebCL aimed at reducing boilerplate and improving code readability. CrowdCL also provides developers with a framework to easily
run computations in the background of a web page, which allows developers to distribute computations across a network of clients
and aggregate results on a centralized server. We compare the performance of CrowdCL against serial implementations in Javascript
and Java across a variety of platforms. Our benchmark results show strong promise for the web browser as a high-performance
distributed computing platform.
GPU-Based Space-Time Adaptive Processing (STAP) for Radar
Thomas Benson, Georgia Tech Research Institute Ryan Hersey, ; Edwin Culpepper, AFRL
Abstract: Space-time adaptive processing (STAP) utilizes a two-dimensional adaptive filter to detect targets within a radar data set
with speeds similar to the background clutter. While adaptively optimal solutions exist, they are prohibitively computationally
intensive. Thus, researchers have developed alternative algorithms with nearly optimal filtering performance and greatly reduced
computational intensity. While such alternatives reduce the computational requirements, the computational burden remains
significant and efficient implementations of such algorithms remains an area of active research. This paper focuses on an efficient
graphics processor unit (GPU) based implementation of the extended factored algorithm (EFA) using the common unified device
architecture (CUDA) framework provided by NVIDIA.