© Lorem Ipsum Dolor 2010
2012 IEEE High Performance
Extreme Computing Conference
(HPEC ‘12)
Sixteenth Annual HPEC Conference
10 - 12 September 2012
Westin Hotel, Waltham, MA USA
Multithreaded FPGA Acceleration of DNA Sequence Mapping
Walid Najjar*, UC Riverside; Edward Fernandez, University of California, Riverside; Stefano Lonardi, University of California, Riverside; Jason
Villarreal, Jacquard Computing Inc.
Abstract:
In bioinformatics, short read alignment is a computationally intensive operation that involves matching millions or billions of short strings (called
reads) against a reference genome. . A representative run would require to match tens of millions of reads of length of about 100 symbols against
a genome that can consists of a few billion characters. Existing short read aligners are expected to report all the occurrences of each read as well
as allow users to control the number of allowed mismatches between reads and reference genome. Popular software implementations such as
Bowtie [8] or BWA [10] can take many hours or days to execute, making the problem an ideal candidate for hardware acceleration. In this paper,
we describe FHAST (FPGA Hardware Accelerated Sequencing-matching Tool), a hardware accelerator that acts as a drop-in replacement for
short read alignment software. Our architecture masks memory latency by executing many concurrent hardware threads accessing memory
simultaneously and consists of multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested
FHAST on the Convey HC-1 [9], taking advantage of the large amount of memory bandwidth available to the system and the shared memory
image between hardware and software. We compare the running of FHAST against the running of bowtie on the Convey HC-1 on human
chromosome 14 as well as the entire human genome and show up to ~70X improvement in total end-to-end execution time, reducing runs that
take several hours to a few minutes. We also favorably compare the rate of growth when expanding FHAST to utilize multiple FPGAs against
multiple CPUs in Bowtie.
:::::::
Floating Point Vector Processing using 28nm FPGAs
Michael Parker*, Altera Corporation; Dan Pritsker, Altera Corporation
Abstract:
Vector processing is useful for implementation of many linear algebra algorithms used in many commercial, government and military applications.
Typically, this is implemented using software on specialized multi-core CPU or GPU architectures. A compelling alternative is FPGA-based
implementation, using floating point single precision implementation. This paper examines implementation of one such algorithm, the QR
decomposition and back substitution, a common for solution of non-square over-determined systems of equations. This has been implemented
using a mid-sized 28nm FPGA. Performance (GFLOPs), throughput, Fmax and power consumption are measured. The algorithm is implemented
as a parameterizable core, which can be easily configured for all the matrix sizes benchmarked herein.
:::::::
Reconfigurable Advanced Rapid-prototyping Environment (RARE)
Michael Bonato*, Colorado Engineering, Inc.
Abstract:
The size, weight, and power (SWaP) budgets available to support embedded computing in airborne platforms, especially unmanned vehicles, are
typically very constrained. Legacy backplane-centric approaches limit a designer’s flexibility to implement high performance computing in these
tight spaces. This paper describes an out-of-the-box architectural approach, developed under a program sponsored by the Missile Defense
Agency (MDA), enabling modular, scalable, high performance embedded computing within challenging SWaP footprints.
:::::::
Large Scale Network Situational Awareness via 3D Gaming Technology
Matthew Hubbell*, MIT Lincoln Laboratory
Abstract:
Obtaining situational awareness of network activity across an enterprise presents unique visualization challenges. IT analysts are required to
quickly gather and correlate large volumes of disparate data to identify the existence of anomalous behavior. This paper will show how the MIT
Lincoln Laboratory LLGrid Team has approached obtaining network situational awareness utilizing the Unity 3D video game engine. We have
developed a 3D environment of the physical plant in the format of a networked multi player First Person Shooter (FPS) to demonstrate a virtual
depiction of the current state of the network and the machines operating on the network. Within the game or virtual world an analyst or player can
gather critical information on all network assets as well as perform physical system actions on machines in question. 3D gaming technology
provides tools to create an environment that is both visually familiar to the player as well display immense amounts of system data in a meaningful
and easy to absorb format. Our prototype system was able to monitor and display 5000 assets in ~10% of the time of our network time window.
:::::::
Scable Cyber-Security for Terabit Cloud Computing
Jordi Ros-Giralt*, Reservoir Labs
Abstract:
This paper addresses the problem of scalable cyber-security using a cloud computing architecture. Scalability is treated in two contexts: (1)
performance and power efficiency and (2) degree of cyber security-relevant information detected by the cyber-security cloud (CSC). We provide a
framework to construct CSCs, which derives from a set of fundamental building blocks (forwarders, analyzers and grounds) and the identification
of the smallest functional units (atomic CSC cells or simply aCSC cells) capable of embedding the full functionality of the cyber-security cloud.
aCSC cells are then studied and several high-performance algorithms are presented to optimize the system’s performance and power efficiency.
Among these, a new queuing policy—called tail early detection (TED)—is introduced to proactively drop packets in a way that the degree of
detected information is maximized while saving power by avoiding spending cycles on less relevant traffic components. We also show that it is
possible to use aCSC cells as core building blocks to construct arbitrarily large cyber-security clouds by structuring the cells using a hierarchical
architecture. To demonstrate the utility of our framework, we implement one cyber-security “mini-cloud” on a single chip prototype based on the
Tilera’s TILEPro64 processor demonstrating performance of up to 10Gbps.
:::::::
Scalable Cryptographic Authentication for High Performance Computing
Andrew Prout*, MIT Lincoln Laboratory
Abstract:
High performance computing (HPC) uses supercomputers and computing clusters to solve large computational problems. Frequently HPC
resources are shared systems and access to restricted data sets or resources must be authenticated. These authentication needs can take
multiple forms, both internal and external to the HPC cluster. A computational stack that uses web services among nodes in the HPC may need to
perform authentication between nodes of the same job or a job may need to reach out to data sources outside the HPC. Traditional authentication
mechanisms such as passwords or digital certificates encounter issues with the distributed and potentially disconnected nature of HPC systems.
Distributing and storing plain-text passwords or cryptographic keys among nodes in a HPC system without special protection is a poor security
practice. Systems to reach back to the user’s terminal for access to the authenticator are possible, but only in fully-interactive supercomputing
where connectivity to the user’s terminal can be guaranteed. Point solutions can be enabled for these use cases, such as software-based role or
self-signed certificates, however they require significant expertise in digital certificates to configure. A more general solution is called for that is
both secure and easy to use. This paper presents an overview of a solution implemented on the interactive, on-demand LLGrid computing system
[3,4,5] at MIT Lincoln Laboratory and its use to solve one such authentication problem.
:::::::
On an MPI Rank/Node Layout Utility for Improving Performance of Communications Intensive Heterogeneous MPI Applications on
SGI Altix ICE 8200 Systems
Bracy Elton*, DRC
Abstract:
:::::::
An update on Scalable Implementation of Primitives for Homomorphic EncRyption – FPGA implementation using Simulink
David Cousins*, Raytheon BBN Technologies
Abstract:
Accellerating the development of a practical Fully Homomorphic Encryption (FHE) scheme is the goal of the DARPA PROCEED program. For the
past year, this program has had as its focus the acceleration of various aspects of the FHE concept toward practical implementation and use. FHE
would be a game-changing technology to enable secure, general computation on encrypted data, e.g., on untrusted off-site hardware. However,
FHE will still require several orders of magnitude improvement in computation before it will be practical for widespread use. Recent theoretical
breakthroughs demonstrated the existence of FHE schemes [1, 2], and to date much progress has been made in both algorithmic and
implementation improvements. Specifically our contribution to the Proceed program has been the development of FPGA based hardware
primitives to accelerate the computation on encrypted data using FHE based on lattice techniques [3]. Our project, SIPHER, has been using a
state of the art tool-chain developed by Mathworks to implement VHDL code for FPGA circuits directly from Simulink models. Our baseline
Homomorphic Encryption prototypes are developed directly in Matlab using the fixed point toolbox to perform the required integer arithmetic.
Constant improvements in algorithms require us to be able to quickly implement them in a high level language such as Matlab. We reported on
our initial results at HPEC 2011 [4]. In the past year, increases in algorithm complexity have introduced several new design requirements for our
FPGA implementation. This report presents new Simulink primitives that had to be developed to deal with these new requirements.
:::::::
Modeling Optical Scatter from Complex Aerosol Samples
Adam Milstein, MIT Lincoln Laboratory
Abstract:
MIT Lincoln Laboratory has developed an advanced modeling capability to investigate elastic scattering properties of biological and inert aerosols
at near- and mid-wave infrared wavelengths. The aerosol sample’s optical cross section and Mueller matrix across multiple scattering angles are
calculated and validated against optical measurements. The modeling approach uses the discrete dipole approximation to compute elastic optical
scatter from an ensemble of randomly-oriented particles of arbitrary shape, index of refraction, and size distribution. The calculation over large
ensembles of particles requires significant computational resources, and is thus implemented on LLGRID, a large parallel grid computer. This
presentation shows results for several types of particles, including bacterial spore clusters and dust particles, with comparison to microscopy and
optical measurements from the Standoff Aerosol Active Signature Testbed.
This work is sponsored under Air Force contract FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the
authors and are not necessarily endorsed by the United States Government.
:::::::
Scrubbing Optimization via Availability Prediction (SOAP) for Reconfigurable Space Computing
Quinn Martin*, University of Florida, NSF CHREC; Alan George, CHREC, ECE Dept. University of Florida
Abstract:
Reconfigurable computing with FPGAs can be highly effective in terms of performance, adaptability, and power for accelerating space
applications, but their configuration memory must be scrubbed to prevent the accumulation of single-event upsets. Many scrubbing techniques
currently exist, each with different advantages, making it difficult for the system designer to choose the optimal scrubbing strategy for a given
mission. This paper surveys the currently available scrubbing techniques and introduces the SOAP method for predicting system availability for
various scrubbing strategies using Markov models. We then apply the method to compare hypothetical Virtex-5 and Virtex-6 systems for blind,
CRC-32, and Frame ECC scrubbing strategies in LEO and HEO. We show that availability in excess of 5 nines can be obtained with modern,
FPGA-based systems using scrubbing. Furthermore, we show the value of the SOAP method by observing that different scrubbing strategies are
optimal for different types of missions.
:::::::
CUDA and OpenCL implementations of 3D CT reconstruction for Biomedical Imaging
Saoni Mukherjee*, Northeastern University
Abstract:
Biomedical image reconstruction applications with large datasets can benefit from acceleration. Graphic Processing Units(GPUs) are particularly
useful in this context as they can produce high fidelity images rapidly. An image algorithm to reconstruct conebeam computed tomography(CT)
using 2D projections is implemented using GPUs. The implementation takes slices of the target, weighs the projection data and then filters the
weighted data to backproject the data and create the final 3D construction. This is implemented on two types of hardware: CPU and a
heterogeneous system combining CPU and GPU. The CPU code in C and MATLAB are compared with the heterogeneous versions written in
CUDA-C and OpenCL. The relative performance is tested and evaluated on a mathematical phantom as well as mouse data.
:::::::
Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU
Tao Cui*, Department of ECE, Carnegie Mellon Unviersity, Pittsburgh
Abstract:
Solving a large number of load flow problems quickly is required for various power system especially smart grid applications including Monte
Carlo analysis, long term steady state simulation, system benchmarking, among others. However, due to the computational burden, such
applications are considered to be time consuming and infeasible for online or realtime application. In this work we developed a high performance
framework for high throughput distribution load flow computation, taking advantage of performance-enhancing features of multi-core CPUs and
various code optimization techniques. We optimized the data structure to better fit the memory hierarchy. We use SPIRAL as the code generator
to exploit the inherent pattern of load flow model. Multiple levels of parallelism including SIMD and multithreading are also used for the particular
problem setup. The optimized solver is able to achieve more that 50% of a Core i7 CPU’s machine peak, which translates to solving millions of
load flow problems within a second for IEEE 37 test feeder. Finally, a scheduling threading structure is also designed to enable real time
application.
:::::::
Efficient and Scalable Computations with Sparse Tensors
Muthu Baskaran*, Reservoir Labs Inc.; Benoit Meister, Reservoir Labs Inc.; Nicolas Vasilache, Reservoir Labs Inc.; Richard Lethin, Reservoir
Labs Inc.
Abstract:
For applications that deal with large amounts of high dimensional multi-aspect data, it becomes natural to represent such data as tensors or multi-
way arrays. Multi-linear algebraic computations such as tensor decompositions are performed for summarization and analysis of such data. Their
use in real-world applications can span across domains such as signal processing, data mining, computer vision, and graph analysis. The major
challenges with applying tensor decompositions in real-world applications are (1) dealing with large-scale high dimensional data and (2) dealing
with sparse data. Recently, algorithms for tensor decompositions that account for the sparsity of data have been proposed. In this paper, we
describe new sparse tensor storage formats that are flexible for performing tensor computations and for handling storage requirements. Further,
we propose an optimization that improves data reuse and reduces redundant or unnecessary computations in Tucker decomposition algorithms.
Furthermore, we couple our data reuse optimization and the benefits of our sparse tensor storage formats to provide a memory-efficient scalable
solution for handling large-scale sparse tensor computations. We demonstrate improved performance and address memory scalability using our
techniques on both synthetic small data sets and large-scale sparse real data sets.
:::::::
Benchmarking Parallel Eigen Decomposition for Residuals Analysis of Very Large Graphs
Edward Rutledge*, MIT Lincoln Laboratory; Benjamin Miller, MIT Lincoln Laboratory; Michelle Beard, MIT Lincoln Laboratory
Abstract:
Graph analysis is used in many domains, from the social sciences to physics and engineering. The computational driver for one important class of
graph analysis algorithms is the computation of leading eigenvectors of a graph’s modularity matrix. This paper explores the computational
implications of performing an eigen decomposition of a graph’s modularity matrix using commodity cluster hardware and freely available
eigensolver software, for graphs with 1 million to 1 billion vertices, and 8 million to 8 billion edges. Working with graphs of these sizes, parallel
eigensolvers are of particular interest. Our results suggest that graph analysis approaches based on eigen space analysis of graph residuals are
feasible even for graphs of these sizes.
:::::::
Driving Big Data With Big Compute
Chansup Byun*, MIT Lincoln Laboratory
Abstract:
Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and
processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters
provide high parallel efficiency for compute intensive workloads. Bringing the big data and big compute communities together is an active area of
research. The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds. LLGrid
MapReduce allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster. D4M
(Dynamic Distributed Dimensional Data Model) provided a high level distributed arrays interface to the Apache Accumulo database. The
accessibility of these technologies is assessed by measuring the effort to use these tools and is typically a few lines of code. The performance is
assessed by measuring the insert rate into the Accumulo database. Using these tools a database insert rate of 3M inserts/second has been
achieved on an 8 node cluster.
:::::::
Benchmarking LAMMPS on Utility Server GPUs
Antoinetee Silas*, USACE-ERDC-ITL
Abstract:
The High Performance Computing Modernization Program’s Technology Insertion (TI) uses a variety of application benchmark codes to assist in
determining which HPC vendors are capable of providing the best computing machinery. LAMMPS is a key molecular dynamics code that has
been included in the TI benchmark suite for years. With recent technological advances made using graphical processing units (GPUs), we are
now exploring the benefit of implementing the aforementioned code within the GPU environment. By way of the standard LAMMPS benchmarking
package given for a typical TI process, we will examine its performance on a utility server that houses GPUs. The resulting times given within a
CPU-only environment versus a heterogeneous environment will help determine if the effort for a successful GPU build is worthwhile. One present
benefit is that the LAMMPS code already possesses the capability to be run on a GPU with minimal coding effort. Therefore, the primary focus lies
within the test cases, computing environments, and results.
:::::::
General Purpose Computing on Graphics Processing Units: Decomposition Strategy
Henry Au, Gregory Lum
Space and Naval Warfare Systems Center Pacific (SSC Pacific), Pearl City, HI
Abstract:
This paper describes the optimization strategies when porting traditional C/C++ algorithms which run on CPU's to parallel processing
architectures found on Graphics Processing Units (GPUs). The CUDA parallel programming architecture is also explored through the use of
NVIDIA's Visual Profiler for performance analysis. Real time video feeds, such as from onshore surveillance cameras, offer limited visibility when
fog, haze, smoke, or dust clouds are present. In order to enhance the video, image processing algorithms such as the Adaptive Linear Filter
(ALF) are performed. However, algorithms such as the ALF require large computational time thus limiting the picture quality, size of the video, or
number of video feeds being processed concurrently in real time. The GPUs parallel processing computational power is exploited to attain speed
ups so that image processing can be performed on the fly in real time. Thus, surveillance is enhanced by providing visual improvement for
detection and classification of objects in low-visibility conditions using the ALF. The ALF was selected to provide an image processing context for
algorithm optimization on GPUs. The optimization strategies being explored will be CUDA Host memory allocations, streams, and asynchronous
memory transfers. Performance results of the ALF running on the GPU and the GPU after optimization will also be reported. As well, GPU
limitations will also be briefly discussed in this paper as not every algorithm will benefit from execution on parallel processing architectures.
:::::::
Anatomy of a Globally Recursive Embedded LINPACK Benchmark
Piotr Luszczek*, University of Tennessee
Abstract:
We present a complete bottom-up implementation of an embedded LINPACK benchmark on iPad 2. We use a novel formulation of a recursive LU
factorization that is recursive and parallel at the global scope. We be believe our new algorithm presents an alternative to existing linear algebra
parallelization techniques such as master-worker and DAG-based approaches. We show a assembly API that allows us a much higher level of
abstraction and provides rapid code development within the confines of mobile device SDK. We use performance modeling to help with the
limitation of the device and the limited access to device from the development environment not geared for HPC application tuning.
:::::::
STINGER: High Performance Data Structure for Streaming Graphs
David Ediger*, Georgia Institute of Technology
Abstract:
The current research focus on “big data” problems highlights the scale and complexity of analytics required and the high rate at which data may
be changing. Future applications in this space will need to cope with high rates of change at scale. STINGER is a scalable, high performance
graph data structure that enables these applications. Key attributes of STINGER are fast insertions, deletions, and updates on graphs with
semantic information and skewed degree distributions. We demonstrate a process of algorithmic and architectural optimizations that enable high
performance on the Cray XMT family and Intel multicore servers. Our implementation of STINGER processes over 3 million updates per second
on a scale-free graph with 537 million edges.
:::::::
Cluster-based 3D Reconstruction of Aerial Video
Scott Sawyer*, MIT Lincoln Laboratory
Abstract:
Large-scale 3D scene reconstruction using Structure from Motion (SfM) continues to be very computationally challenging despite much active
research in the area. We propose an efficient, scalable processing chain designed for cluster computing and suitable for use on aerial video. The
sparse bundle adjustment step, which is iterative and difficult to parallelize, is accomplished by partitioning the input image set, generating
independent point clouds in parallel, and then fusing the clouds and combining duplicate points. We compare this processing chain to a leading
parallel SfM implementation, which exploits fine-grained parallelism in various matrix operations and is not designed to scale beyond a multi-core
workstation with GPU. We show our cluster-based approach offers significant improvement in scalability and runtime while producing comparable
point cloud density and more accurate point location estimates.
:::::::
A MATLAB-to-Target Development Workflow using Sourcery VSIPL++
Stefan Seefeld*, Mentor Graphics, Inc.
Abstract:
A hybrid MATLAB/C++ programming model for high performance embedded computing is presented. It is shown how the use of a common data
model and API can help not only to speed up the development process, but also to keep the original MATLAB model in sync with the evolving C++
code, and thus allowing it to remain a gold standard for the project as it evolves. Keywords: multi-language, scripting, VSIPL++, prototyping,
signal- and image-processing
:::::::
Big Ocean Data: Numerically Conservative Parallel Query Processing for CFD Applications
Prof. Bill Howe, University of Washington
Abstract:
The ocean sciences are rapidly evolving from an expeditionary science to an observatory-based science. As a result, data integration and
analysis are replacing data acquisition as the bottleneck to new discoveries --- the observatories coming online are capable of collecting data
faster than we can understand it. The applications and datasets in this domain have several properties that complicate the design of efficient
abstractions. Unlike atmospheric models, complex domains can only be modeled precisely using polyhderal meshes ("unstructured grids") rather
than simpler arrays or relations. Many common tasks are extremely sensitive to numerical errors, forcing researchers to access raw data at full
resolution and implement their own data integration techniques by hand. File-oriented standards and systems popular in this domain offer no
support for parallel processing.
In this talk, I'll describe an algorithm for efficient and numerically conservative regridding of arbitrary unstructured grids, and show how this
primitive can be made to commute with a mesh subsetting operator. Together, these two operations appear sufficient to efficiently express a
variety of tasks and idioms in ocean data analysis, including model coupling, visualization, distributed / parallel processing, adaptive mesh
refinement. Finally, I'll describe our current release of the code and how we have integrated it with the Hyrax OPeNDAP server, a popular
platform for serving ocean and atmospheric data over the Internet.
:::::::