2012 IEEE High Performance Extreme Computing Conference (HPEC ‘12) Sixteenth Annual HPEC Conference 10 - 12 September 2012 Westin Hotel, Waltham, MA USA

Site created and maintained by Ballos Associates

Multithreaded FPGA Acceleration of DNA Sequence Mapping Walid Najjar*, UC Riverside; Edward Fernandez, University of California, Riverside; Stefano Lonardi, University of California, Riverside; Jason Villarreal, Jacquard Computing Inc. Abstract: In bioinformatics, short read alignment is a computationally intensive operation that involves matching millions or billions of short strings (called

reads) against a reference genome. . A representative run would require to match tens of millions of reads of length of about 100 symbols against

a genome that can consists of a few billion characters. Existing short read aligners are expected to report all the occurrences of each read as well

as allow users to control the number of allowed mismatches between reads and reference genome. Popular software implementations such as

Bowtie [8] or BWA [10] can take many hours or days to execute, making the problem an ideal candidate for hardware acceleration. In this paper,

we describe FHAST (FPGA Hardware Accelerated Sequencing-matching Tool), a hardware accelerator that acts as a drop-in replacement for

short read alignment software. Our architecture masks memory latency by executing many concurrent hardware threads accessing memory

simultaneously and consists of multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested

FHAST on the Convey HC-1 [9], taking advantage of the large amount of memory bandwidth available to the system and the shared memory

image between hardware and software. We compare the running of FHAST against the running of bowtie on the Convey HC-1 on human

chromosome 14 as well as the entire human genome and show up to ~70X improvement in total end-to-end execution time, reducing runs that

take several hours to a few minutes. We also favorably compare the rate of growth when expanding FHAST to utilize multiple FPGAs against

multiple CPUs in Bowtie. :::::::

Floating Point Vector Processing using 28nm FPGAs Michael Parker*, Altera Corporation; Dan Pritsker, Altera Corporation Abstract: Vector processing is useful for implementation of many linear algebra algorithms used in many commercial, government and military applications.

Typically, this is implemented using software on specialized multi-core CPU or GPU architectures. A compelling alternative is FPGA-based

implementation, using floating point single precision implementation. This paper examines implementation of one such algorithm, the QR

decomposition and back substitution, a common for solution of non-square over-determined systems of equations. This has been implemented

using a mid-sized 28nm FPGA. Performance (GFLOPs), throughput, Fmax and power consumption are measured. The algorithm is implemented

as a parameterizable core, which can be easily configured for all the matrix sizes benchmarked herein.

:::::::

Reconfigurable Advanced Rapid-prototyping Environment (RARE) Michael Bonato*, Colorado Engineering, Inc. Abstract: The size, weight, and power (SWaP) budgets available to support embedded computing in airborne platforms, especially unmanned vehicles, are

typically very constrained. Legacy backplane-centric approaches limit a designer’s flexibility to implement high performance computing in these

tight spaces. This paper describes an out-of-the-box architectural approach, developed under a program sponsored by the Missile Defense

Agency (MDA), enabling modular, scalable, high performance embedded computing within challenging SWaP footprints.

:::::::

Large Scale Network Situational Awareness via 3D Gaming Technology Matthew Hubbell*, MIT Lincoln Laboratory Abstract: Obtaining situational awareness of network activity across an enterprise presents unique visualization challenges. IT analysts are required to

quickly gather and correlate large volumes of disparate data to identify the existence of anomalous behavior. This paper will show how the MIT

Lincoln Laboratory LLGrid Team has approached obtaining network situational awareness utilizing the Unity 3D video game engine. We have

developed a 3D environment of the physical plant in the format of a networked multi player First Person Shooter (FPS) to demonstrate a virtual

depiction of the current state of the network and the machines operating on the network. Within the game or virtual world an analyst or player can

gather critical information on all network assets as well as perform physical system actions on machines in question. 3D gaming technology

provides tools to create an environment that is both visually familiar to the player as well display immense amounts of system data in a meaningful

and easy to absorb format. Our prototype system was able to monitor and display 5000 assets in ~10% of the time of our network time window.

:::::::

Scable Cyber-Security for Terabit Cloud Computing Jordi Ros-Giralt*, Reservoir Labs Abstract: This paper addresses the problem of scalable cyber-security using a cloud computing architecture. Scalability is treated in two contexts: (1)

performance and power efficiency and (2) degree of cyber security-relevant information detected by the cyber-security cloud (CSC). We provide a

framework to construct CSCs, which derives from a set of fundamental building blocks (forwarders, analyzers and grounds) and the identification

of the smallest functional units (atomic CSC cells or simply aCSC cells) capable of embedding the full functionality of the cyber-security cloud.

aCSC cells are then studied and several high-performance algorithms are presented to optimize the system’s performance and power efficiency.

Among these, a new queuing policy—called tail early detection (TED)—is introduced to proactively drop packets in a way that the degree of

detected information is maximized while saving power by avoiding spending cycles on less relevant traffic components. We also show that it is

possible to use aCSC cells as core building blocks to construct arbitrarily large cyber-security clouds by structuring the cells using a hierarchical

architecture. To demonstrate the utility of our framework, we implement one cyber-security “mini-cloud” on a single chip prototype based on the

Tilera’s TILEPro64 processor demonstrating performance of up to 10Gbps.

:::::::

Scalable Cryptographic Authentication for High Performance Computing Andrew Prout*, MIT Lincoln Laboratory Abstract: High performance computing (HPC) uses supercomputers and computing clusters to solve large computational problems. Frequently HPC

resources are shared systems and access to restricted data sets or resources must be authenticated. These authentication needs can take

multiple forms, both internal and external to the HPC cluster. A computational stack that uses web services among nodes in the HPC may need to

perform authentication between nodes of the same job or a job may need to reach out to data sources outside the HPC. Traditional authentication

mechanisms such as passwords or digital certificates encounter issues with the distributed and potentially disconnected nature of HPC systems.

Distributing and storing plain-text passwords or cryptographic keys among nodes in a HPC system without special protection is a poor security

practice. Systems to reach back to the user’s terminal for access to the authenticator are possible, but only in fully-interactive supercomputing

where connectivity to the user’s terminal can be guaranteed. Point solutions can be enabled for these use cases, such as software-based role or

self-signed certificates, however they require significant expertise in digital certificates to configure. A more general solution is called for that is

both secure and easy to use. This paper presents an overview of a solution implemented on the interactive, on-demand LLGrid computing system

[3,4,5] at MIT Lincoln Laboratory and its use to solve one such authentication problem.

:::::::

On an MPI Rank/Node Layout Utility for Improving Performance of Communications Intensive Heterogeneous MPI Applications on SGI Altix ICE 8200 Systems Bracy Elton*, DRC Abstract: :::::::

An update on Scalable Implementation of Primitives for Homomorphic EncRyption – FPGA implementation using Simulink David Cousins*, Raytheon BBN Technologies Abstract: Accellerating the development of a practical Fully Homomorphic Encryption (FHE) scheme is the goal of the DARPA PROCEED program. For the

past year, this program has had as its focus the acceleration of various aspects of the FHE concept toward practical implementation and use. FHE

would be a game-changing technology to enable secure, general computation on encrypted data, e.g., on untrusted off-site hardware. However,

FHE will still require several orders of magnitude improvement in computation before it will be practical for widespread use. Recent theoretical

breakthroughs demonstrated the existence of FHE schemes [1, 2], and to date much progress has been made in both algorithmic and

implementation improvements. Specifically our contribution to the Proceed program has been the development of FPGA based hardware

primitives to accelerate the computation on encrypted data using FHE based on lattice techniques [3]. Our project, SIPHER, has been using a

state of the art tool-chain developed by Mathworks to implement VHDL code for FPGA circuits directly from Simulink models. Our baseline

Homomorphic Encryption prototypes are developed directly in Matlab using the fixed point toolbox to perform the required integer arithmetic.

Constant improvements in algorithms require us to be able to quickly implement them in a high level language such as Matlab. We reported on

our initial results at HPEC 2011 [4]. In the past year, increases in algorithm complexity have introduced several new design requirements for our

FPGA implementation. This report presents new Simulink primitives that had to be developed to deal with these new requirements.

:::::::

Modeling Optical Scatter from Complex Aerosol Samples Adam Milstein, MIT Lincoln Laboratory Abstract: MIT Lincoln Laboratory has developed an advanced modeling capability to investigate elastic scattering properties of biological and inert aerosols

at near- and mid-wave infrared wavelengths. The aerosol sample’s optical cross section and Mueller matrix across multiple scattering angles are

calculated and validated against optical measurements. The modeling approach uses the discrete dipole approximation to compute elastic optical

scatter from an ensemble of randomly-oriented particles of arbitrary shape, index of refraction, and size distribution. The calculation over large

ensembles of particles requires significant computational resources, and is thus implemented on LLGRID, a large parallel grid computer. This

presentation shows results for several types of particles, including bacterial spore clusters and dust particles, with comparison to microscopy and

optical measurements from the Standoff Aerosol Active Signature Testbed.

This work is sponsored under Air Force contract FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the

authors and are not necessarily endorsed by the United States Government.

:::::::

Scrubbing Optimization via Availability Prediction (SOAP) for Reconfigurable Space Computing Quinn Martin*, University of Florida, NSF CHREC; Alan George, CHREC, ECE Dept. University of Florida Abstract: Reconfigurable computing with FPGAs can be highly effective in terms of performance, adaptability, and power for accelerating space

applications, but their configuration memory must be scrubbed to prevent the accumulation of single-event upsets. Many scrubbing techniques

currently exist, each with different advantages, making it difficult for the system designer to choose the optimal scrubbing strategy for a given

mission. This paper surveys the currently available scrubbing techniques and introduces the SOAP method for predicting system availability for

various scrubbing strategies using Markov models. We then apply the method to compare hypothetical Virtex-5 and Virtex-6 systems for blind,

CRC-32, and Frame ECC scrubbing strategies in LEO and HEO. We show that availability in excess of 5 nines can be obtained with modern,

FPGA-based systems using scrubbing. Furthermore, we show the value of the SOAP method by observing that different scrubbing strategies are

optimal for different types of missions.

:::::::

CUDA and OpenCL implementations of 3D CT reconstruction for Biomedical Imaging Saoni Mukherjee*, Northeastern University Abstract: Biomedical image reconstruction applications with large datasets can benefit from acceleration. Graphic Processing Units(GPUs) are particularly

useful in this context as they can produce high fidelity images rapidly. An image algorithm to reconstruct conebeam computed tomography(CT)

using 2D projections is implemented using GPUs. The implementation takes slices of the target, weighs the projection data and then filters the

weighted data to backproject the data and create the final 3D construction. This is implemented on two types of hardware: CPU and a

heterogeneous system combining CPU and GPU. The CPU code in C and MATLAB are compared with the heterogeneous versions written in

CUDA-C and OpenCL. The relative performance is tested and evaluated on a mathematical phantom as well as mouse data.

:::::::

Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU Tao Cui*, Department of ECE, Carnegie Mellon Unviersity, Pittsburgh Abstract: Solving a large number of load flow problems quickly is required for various power system especially smart grid applications including Monte

Carlo analysis, long term steady state simulation, system benchmarking, among others. However, due to the computational burden, such

applications are considered to be time consuming and infeasible for online or realtime application. In this work we developed a high performance

framework for high throughput distribution load flow computation, taking advantage of performance-enhancing features of multi-core CPUs and

various code optimization techniques. We optimized the data structure to better fit the memory hierarchy. We use SPIRAL as the code generator

to exploit the inherent pattern of load flow model. Multiple levels of parallelism including SIMD and multithreading are also used for the particular

problem setup. The optimized solver is able to achieve more that 50% of a Core i7 CPU’s machine peak, which translates to solving millions of

load flow problems within a second for IEEE 37 test feeder. Finally, a scheduling threading structure is also designed to enable real time

application.

:::::::

Efficient and Scalable Computations with Sparse Tensors Muthu Baskaran*, Reservoir Labs Inc.; Benoit Meister, Reservoir Labs Inc.; Nicolas Vasilache, Reservoir Labs Inc.; Richard Lethin, Reservoir Labs Inc. Abstract: For applications that deal with large amounts of high dimensional multi-aspect data, it becomes natural to represent such data as tensors or multi-

way arrays. Multi-linear algebraic computations such as tensor decompositions are performed for summarization and analysis of such data. Their

use in real-world applications can span across domains such as signal processing, data mining, computer vision, and graph analysis. The major

challenges with applying tensor decompositions in real-world applications are (1) dealing with large-scale high dimensional data and (2) dealing

with sparse data. Recently, algorithms for tensor decompositions that account for the sparsity of data have been proposed. In this paper, we

describe new sparse tensor storage formats that are flexible for performing tensor computations and for handling storage requirements. Further,

we propose an optimization that improves data reuse and reduces redundant or unnecessary computations in Tucker decomposition algorithms.

Furthermore, we couple our data reuse optimization and the benefits of our sparse tensor storage formats to provide a memory-efficient scalable

solution for handling large-scale sparse tensor computations. We demonstrate improved performance and address memory scalability using our

techniques on both synthetic small data sets and large-scale sparse real data sets. :::::::

Benchmarking Parallel Eigen Decomposition for Residuals Analysis of Very Large Graphs Edward Rutledge*, MIT Lincoln Laboratory; Benjamin Miller, MIT Lincoln Laboratory; Michelle Beard, MIT Lincoln Laboratory Abstract: Graph analysis is used in many domains, from the social sciences to physics and engineering. The computational driver for one important class of

graph analysis algorithms is the computation of leading eigenvectors of a graph’s modularity matrix. This paper explores the computational

implications of performing an eigen decomposition of a graph’s modularity matrix using commodity cluster hardware and freely available

eigensolver software, for graphs with 1 million to 1 billion vertices, and 8 million to 8 billion edges. Working with graphs of these sizes, parallel

eigensolvers are of particular interest. Our results suggest that graph analysis approaches based on eigen space analysis of graph residuals are

feasible even for graphs of these sizes.

:::::::

Driving Big Data With Big Compute Chansup Byun*, MIT Lincoln Laboratory Abstract: Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and

processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters

provide high parallel efficiency for compute intensive workloads. Bringing the big data and big compute communities together is an active area of

research. The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds. LLGrid

MapReduce allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster. D4M

(Dynamic Distributed Dimensional Data Model) provided a high level distributed arrays interface to the Apache Accumulo database. The

accessibility of these technologies is assessed by measuring the effort to use these tools and is typically a few lines of code. The performance is

assessed by measuring the insert rate into the Accumulo database. Using these tools a database insert rate of 3M inserts/second has been

achieved on an 8 node cluster.

:::::::

Benchmarking LAMMPS on Utility Server GPUs Antoinetee Silas*, USACE-ERDC-ITL Abstract: The High Performance Computing Modernization Program’s Technology Insertion (TI) uses a variety of application benchmark codes to assist in

determining which HPC vendors are capable of providing the best computing machinery. LAMMPS is a key molecular dynamics code that has

been included in the TI benchmark suite for years. With recent technological advances made using graphical processing units (GPUs), we are

now exploring the benefit of implementing the aforementioned code within the GPU environment. By way of the standard LAMMPS benchmarking

package given for a typical TI process, we will examine its performance on a utility server that houses GPUs. The resulting times given within a

CPU-only environment versus a heterogeneous environment will help determine if the effort for a successful GPU build is worthwhile. One present

benefit is that the LAMMPS code already possesses the capability to be run on a GPU with minimal coding effort. Therefore, the primary focus lies

within the test cases, computing environments, and results.

:::::::

General Purpose Computing on Graphics Processing Units: Decomposition Strategy Henry Au, Gregory Lum Space and Naval Warfare Systems Center Pacific (SSC Pacific), Pearl City, HI Abstract: This paper describes the optimization strategies when porting traditional C/C++ algorithms which run on CPU's to parallel processing

architectures found on Graphics Processing Units (GPUs). The CUDA parallel programming architecture is also explored through the use of

NVIDIA's Visual Profiler for performance analysis. Real time video feeds, such as from onshore surveillance cameras, offer limited visibility when

fog, haze, smoke, or dust clouds are present. In order to enhance the video, image processing algorithms such as the Adaptive Linear Filter

(ALF) are performed. However, algorithms such as the ALF require large computational time thus limiting the picture quality, size of the video, or

number of video feeds being processed concurrently in real time. The GPUs parallel processing computational power is exploited to attain speed

ups so that image processing can be performed on the fly in real time. Thus, surveillance is enhanced by providing visual improvement for

detection and classification of objects in low-visibility conditions using the ALF. The ALF was selected to provide an image processing context for

algorithm optimization on GPUs. The optimization strategies being explored will be CUDA Host memory allocations, streams, and asynchronous

memory transfers. Performance results of the ALF running on the GPU and the GPU after optimization will also be reported. As well, GPU

limitations will also be briefly discussed in this paper as not every algorithm will benefit from execution on parallel processing architectures.

:::::::

Anatomy of a Globally Recursive Embedded LINPACK Benchmark Piotr Luszczek*, University of Tennessee Abstract: We present a complete bottom-up implementation of an embedded LINPACK benchmark on iPad 2. We use a novel formulation of a recursive LU

factorization that is recursive and parallel at the global scope. We be believe our new algorithm presents an alternative to existing linear algebra

parallelization techniques such as master-worker and DAG-based approaches. We show a assembly API that allows us a much higher level of

abstraction and provides rapid code development within the confines of mobile device SDK. We use performance modeling to help with the

limitation of the device and the limited access to device from the development environment not geared for HPC application tuning.

:::::::

STINGER: High Performance Data Structure for Streaming Graphs David Ediger*, Georgia Institute of Technology Abstract: The current research focus on “big data” problems highlights the scale and complexity of analytics required and the high rate at which data may

be changing. Future applications in this space will need to cope with high rates of change at scale. STINGER is a scalable, high performance

graph data structure that enables these applications. Key attributes of STINGER are fast insertions, deletions, and updates on graphs with

semantic information and skewed degree distributions. We demonstrate a process of algorithmic and architectural optimizations that enable high

performance on the Cray XMT family and Intel multicore servers. Our implementation of STINGER processes over 3 million updates per second

on a scale-free graph with 537 million edges.

:::::::

Cluster-based 3D Reconstruction of Aerial Video Scott Sawyer*, MIT Lincoln Laboratory Abstract: Large-scale 3D scene reconstruction using Structure from Motion (SfM) continues to be very computationally challenging despite much active

research in the area. We propose an efficient, scalable processing chain designed for cluster computing and suitable for use on aerial video. The

sparse bundle adjustment step, which is iterative and difficult to parallelize, is accomplished by partitioning the input image set, generating

independent point clouds in parallel, and then fusing the clouds and combining duplicate points. We compare this processing chain to a leading

parallel SfM implementation, which exploits fine-grained parallelism in various matrix operations and is not designed to scale beyond a multi-core

workstation with GPU. We show our cluster-based approach offers significant improvement in scalability and runtime while producing comparable

point cloud density and more accurate point location estimates. :::::::

A MATLAB-to-Target Development Workflow using Sourcery VSIPL++ Stefan Seefeld*, Mentor Graphics, Inc. Abstract: A hybrid MATLAB/C++ programming model for high performance embedded computing is presented. It is shown how the use of a common data

model and API can help not only to speed up the development process, but also to keep the original MATLAB model in sync with the evolving C++

code, and thus allowing it to remain a gold standard for the project as it evolves. Keywords: multi-language, scripting, VSIPL++, prototyping,

signal- and image-processing

:::::::

Big Ocean Data: Numerically Conservative Parallel Query Processing for CFD Applications Prof. Bill Howe, University of Washington Abstract: The ocean sciences are rapidly evolving from an expeditionary science to an observatory-based science. As a result, data integration and analysis are replacing data acquisition as the bottleneck to new discoveries --- the observatories coming online are capable of collecting data faster than we can understand it. The applications and datasets in this domain have several properties that complicate the design of efficient abstractions. Unlike atmospheric models, complex domains can only be modeled precisely using polyhderal meshes ("unstructured grids") rather than simpler arrays or relations. Many common tasks are extremely sensitive to numerical errors, forcing researchers to access raw data at full resolution and implement their own data integration techniques by hand. File-oriented standards and systems popular in this domain offer no support for parallel processing. In this talk, I'll describe an algorithm for efficient and numerically conservative regridding of arbitrary unstructured grids, and show how this primitive can be made to commute with a mesh subsetting operator. Together, these two operations appear sufficient to efficiently express a variety of tasks and idioms in ocean data analysis, including model coupling, visualization, distributed / parallel processing, adaptive mesh refinement. Finally, I'll describe our current release of the code and how we have integrated it with the Hyrax OPeNDAP server, a popular platform for serving ocean and atmospheric data over the Internet. :::::::

Senior Advisory Board
Technical Committee
Sponsors
Program Committee
Registration Rates
Call for Papers
Submit Technical Talk/Poster
Vendor Demo Sign Up
Paper Submission Guidelines
Presentation Guidelines
Projected Conference Dates
Past Proceedings