© Lorem Ipsum Dolor 2010
2013 IEEE High Performance Extreme Computing Conference (HPEC ‘13) Seventeenth Annual HPEC Conference 10 - 12 September 2013 Westin Hotel, Waltham, MA USA
Home Organizers Conference Registration Call for Papers Agenda Author Guidelines Hotel Past & Future Conferences Vendor Displays
Site created and maintained by Ballos Associates
Clouds & Grids I
LLSuperCloud: Sharing HPC Systems for Diverse Rapid Prototyping Albert Reuther, MIT Lincoln Laboratory; Jeremy Kepner, MIT Abstract: The supercomputing and enterprise computing arenas come from very different lineages. However, the advent of commodity computing servers has brought the two arenas closer than they have ever been. Within enterprise computing, commodity computing servers have resulted in the development of a wide range of new cloud capabilities: elastic computing, virtualization, and data hosting. Similarly, the supercomputing community has developed new capabilities in heterogeneous, massively parallel hardware and software. Merging the benefits of enterprise clouds and supercomputing has been a challenging goal. Significant effort has been expended in trying to deploy supercomputing capabilities on cloud computing systems. These efforts have resulted in unreliable, low-performance solutions, which requires enormous expertise to maintain.  LLSuperCloud provides a novel solution to the problem of merging enterprise cloud and supercomputing technology. More specifically LLSuperCloud reverses the traditional paradigm of attempting to deploy supercomputing capabilities on a cloud and instead deploys cloud capabilities on a supercomputer. The result is a system that can handle heterogeneous, massively parallel workloads while also providing high performance elastic computing, virtualization, and databases. The benefits of LLSuperCloud are highlighted using a mixed workload of C MPI, parallel Matlab, Java, databases, and virtualized web services. An Improved Eigensolver for Quantum-dot Cellular Automata Simulations Aaron Baldwin, Valparaiso University; Jeffrey Will, Valparaiso University; Douglas Tougaw, Valparaiso University Abstract: The work in this paper describes the application of an optimized eigensolver algorithm to produce the kernel calculations for simulating quantum-dot cellular automata (QCA) circuits, an emerging implementation of quantum computing.  The application of the locally optimal block preconditioned conjugate gradient (LOBPCG) method to calculate the eigenvalues and eigenvectors for this simulation was shown to exhibit a 15.6 speedup over the commonly used QR-method for a representative simulation and has specific advantages for the Hermitian, positive-definite, sparse matrices commonly encountered in simulating the Time-Independent Schrödinger equation.  We present the computational savings for a simulation analyzing the effect of stray charges near a four-cell line of QCA cells with a single driver cell, and we discuss implications for wider application.  We further discuss issues of problem preconditioning which are specific to QCA simulation when utilizing the LOBPCG method. PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs Julia Mullen, MIT Lincoln Laboratory; Michael Wolf, MIT Lincoln Laboratory; Anna Klein, MIT Lincoln Laboratory Abstract: Recent projections suggest that applications and architectures will need to attain 75 GFLOPS/W in order to support future DoD missions. Meeting this goal requires deeper understanding of kernel and application performance as a function of power and architecture. As part of the PAKCK study, a set of DoD application areas, including signal and image processing and big data/graph computation, were surveyed to identify performance critical kernels relevant to DoD missions.  From that survey, we present the characterization of dense matrix-vector product, two dimensional FFTs, and sparse matrixdense vector multiplication on the NVIDIA Fermi and Intel Sandy Bridge architectures.  We describe the methodology that was developed for characterizing power usage and performance on these architectures and present power usage and performance per Watt for all three kernels. Our results indicate that 75 GFLOPS/W is a very challenging target for these kernels, especially for the sparse kernels, whose performance was orders of magnitude lower than dense kernels. CrowdCL: Web-Based Volunteer Computing with WebCL Tommy MacWilliam, Cris Cecka, Harvard University Abstract: We present CrowdCL, an open-source framework for the rapid development of volunteer computing and OpenCL applications on the web. Drawing inspiration from existing GPU libraries like PyCUDA, CrowdCL provides an abstraction layer for WebCL aimed at reducing boilerplate and improving code readability. CrowdCL also provides developers with a framework to easily run computations in the background of a web page, which allows developers to distribute computations across a network of clients and aggregate results on a centralized server. We compare the performance of CrowdCL against serial implementations in Javascript and Java across a variety of platforms. Our benchmark results show strong promise for the web browser as a high-performance distributed computing platform. GPU-Based Space-Time Adaptive Processing (STAP) for Radar Thomas Benson, Georgia Tech Research Institute Ryan Hersey, ; Edwin Culpepper, AFRL Abstract: Space-time adaptive processing (STAP) utilizes a two-dimensional adaptive filter to detect targets within a radar data set with speeds similar to the background clutter. While adaptively optimal solutions exist, they are prohibitively computationally intensive. Thus, researchers have developed alternative algorithms with nearly optimal  filtering performance and greatly reduced computational intensity. While such alternatives reduce the computational requirements, the computational burden remains significant and efficient implementations of such algorithms remains an area of active research. This paper focuses on an efficient graphics processor unit (GPU) based implementation of the extended factored algorithm (EFA) using the common unified device architecture (CUDA) framework provided by NVIDIA.