© Lorem Ipsum Dolor 2010
2012 IEEE High Performance Extreme Computing Conference (HPEC ‘12) Sixteenth Annual HPEC Conference 10 - 12 September 2012 Westin Hotel, Waltham, MA USA
Home Organizers General Information Call for Papers Agenda Author Guidelines Hotel Past & Future Conferences Conference Registration
Site created and maintained by Ballos Associates
Multithreaded FPGA Acceleration of DNA Sequence Mapping Walid Najjar*, UC Riverside; Edward Fernandez, University of California, Riverside; Stefano Lonardi, University of California, Riverside; Jason Villarreal, Jacquard Computing Inc. Abstract: In bioinformatics, short read alignment is a computationally intensive operation that involves matching millions or billions of short strings (called  reads) against a reference genome. . A representative run would require to match tens of millions of reads of length of about 100 symbols against  a genome that can consists of a few billion characters. Existing short read aligners are expected to report all the occurrences of each read as well  as allow users to control the number of allowed mismatches between reads and reference genome. Popular software implementations such as  Bowtie [8] or BWA [10] can take many hours or days to execute, making the problem an ideal candidate for hardware acceleration. In this paper,  we describe FHAST (FPGA Hardware Accelerated Sequencing-matching Tool), a hardware accelerator that acts as a drop-in replacement for  short read alignment software. Our architecture masks memory latency by executing many concurrent hardware threads accessing memory  simultaneously and consists of multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested  FHAST on the Convey HC-1 [9], taking advantage of the large amount of memory bandwidth available to the system and the shared memory  image between hardware and software. We compare the running of FHAST against the running of bowtie on the Convey HC-1 on human  chromosome 14 as well as the entire human genome and show up to ~70X improvement in total end-to-end execution time, reducing runs that  take several hours to a few minutes. We also favorably compare the rate of growth when expanding FHAST to utilize multiple FPGAs against  multiple CPUs in Bowtie. :::::::  Floating Point Vector Processing using 28nm FPGAs Michael Parker*, Altera Corporation; Dan Pritsker, Altera Corporation Abstract: Vector processing is useful for implementation of many linear algebra algorithms used in many commercial, government and military applications.  Typically, this is implemented using software on specialized multi-core CPU or GPU architectures. A compelling alternative is FPGA-based  implementation, using floating point single precision implementation. This paper examines implementation of one such algorithm, the QR  decomposition and back substitution, a common for solution of non-square over-determined systems of equations. This has been implemented  using a mid-sized 28nm FPGA. Performance (GFLOPs), throughput, Fmax and power consumption are measured. The algorithm is implemented  as a parameterizable core, which can be easily configured for all the matrix sizes benchmarked herein.  :::::::  Reconfigurable Advanced Rapid-prototyping Environment (RARE) Michael Bonato*, Colorado Engineering, Inc. Abstract: The size, weight, and power (SWaP) budgets available to support embedded computing in airborne platforms, especially unmanned vehicles, are  typically very constrained. Legacy backplane-centric approaches limit a designer’s flexibility to implement high performance computing in these  tight spaces. This paper describes an out-of-the-box architectural approach, developed under a program sponsored by the Missile Defense  Agency (MDA), enabling modular, scalable, high performance embedded computing within challenging SWaP footprints.  :::::::  Large Scale Network Situational Awareness via 3D Gaming Technology Matthew Hubbell*, MIT Lincoln Laboratory Abstract: Obtaining situational awareness of network activity across an enterprise presents unique visualization challenges. IT analysts are required to  quickly gather and correlate large volumes of disparate data to identify the existence of anomalous behavior. This paper will show how the MIT  Lincoln Laboratory LLGrid Team has approached obtaining network situational awareness utilizing the Unity 3D video game engine. We have  developed a 3D environment of the physical plant in the format of a networked multi player First Person Shooter (FPS) to demonstrate a virtual  depiction of the current state of the network and the machines operating on the network. Within the game or virtual world an analyst or player can  gather critical information on all network assets as well as perform physical system actions on machines in question. 3D gaming technology  provides tools to create an environment that is both visually familiar to the player as well display immense amounts of system data in a meaningful  and easy to absorb format. Our prototype system was able to monitor and display 5000 assets in ~10% of the time of our network time window.  :::::::  Scable Cyber-Security for Terabit Cloud Computing Jordi Ros-Giralt*, Reservoir Labs Abstract: This paper addresses the problem of scalable cyber-security using a cloud computing architecture.  Scalability is treated in two contexts: (1)  performance and power efficiency and (2) degree of cyber security-relevant information detected by the cyber-security cloud (CSC).  We provide a  framework to construct CSCs, which derives from a set of fundamental building blocks (forwarders, analyzers and grounds) and the identification  of the smallest functional units (atomic CSC cells or simply aCSC cells) capable of embedding the full functionality of the cyber-security cloud.   aCSC cells are then studied and several high-performance algorithms are presented to optimize the system’s performance and power efficiency.   Among these, a new queuing policy—called tail early detection (TED)—is introduced to proactively drop packets in a way that the degree of  detected information is maximized while saving power by avoiding spending cycles on less relevant traffic components. We also show that it is  possible to use aCSC cells as core building blocks to construct arbitrarily large cyber-security clouds by structuring the cells using a hierarchical  architecture.  To demonstrate the utility of our framework, we implement one cyber-security “mini-cloud” on a single chip prototype based on the  Tilera’s TILEPro64 processor demonstrating performance of up to 10Gbps.  :::::::  Scalable Cryptographic Authentication for High Performance Computing Andrew Prout*, MIT Lincoln Laboratory Abstract: High performance computing (HPC) uses supercomputers and computing clusters to solve large computational problems. Frequently HPC  resources are shared systems and access to restricted data sets or resources must be authenticated. These authentication needs can take  multiple forms, both internal and external to the HPC cluster. A computational stack that uses web services among nodes in the HPC may need to  perform authentication between nodes of the same job or a job may need to reach out to data sources outside the HPC. Traditional authentication  mechanisms such as passwords or digital certificates encounter issues with the distributed and potentially disconnected nature of HPC systems.  Distributing and storing plain-text passwords or cryptographic keys among nodes in a HPC system without special protection is a poor security  practice. Systems to reach back to the user’s terminal for access to the authenticator are possible, but only in fully-interactive supercomputing  where connectivity to the user’s terminal can be guaranteed. Point solutions can be enabled for these use cases, such as software-based role or  self-signed certificates, however they require significant expertise in digital certificates to configure. A more general solution is called for that is  both secure and easy to use. This paper presents an overview of a solution implemented on the interactive, on-demand LLGrid computing system  [3,4,5] at MIT Lincoln Laboratory and its use to solve one such authentication problem.  :::::::  On an MPI Rank/Node Layout Utility for Improving Performance of Communications Intensive Heterogeneous MPI Applications on SGI Altix ICE 8200 Systems Bracy Elton*, DRC Abstract: :::::::  An update on Scalable Implementation of Primitives for Homomorphic EncRyption – FPGA implementation using Simulink David Cousins*, Raytheon BBN Technologies Abstract: Accellerating the development of a practical Fully Homomorphic Encryption (FHE) scheme is the goal of the DARPA PROCEED program. For the  past year, this program has had as its focus the acceleration of various aspects of the FHE concept toward practical implementation and use. FHE  would be a game-changing technology to enable secure, general computation on encrypted data, e.g., on untrusted off-site hardware. However,  FHE will still require several orders of magnitude improvement in computation before it will be practical for widespread use.  Recent theoretical  breakthroughs demonstrated the existence of FHE schemes [1, 2], and to date much progress has been made in both algorithmic and  implementation improvements. Specifically our contribution to the Proceed program has been the development of FPGA based hardware  primitives to accelerate the computation on encrypted data using FHE based on lattice techniques [3].   Our project, SIPHER, has been using a  state of the art tool-chain developed by Mathworks to implement VHDL code for FPGA circuits directly from Simulink models. Our baseline  Homomorphic Encryption prototypes are developed directly in Matlab using the fixed point toolbox to perform the required integer arithmetic.  Constant improvements in algorithms require us to be able to quickly implement them in a high level language such as Matlab.  We reported on  our initial results at HPEC 2011 [4]. In the past year, increases in algorithm complexity have introduced several new design requirements for our  FPGA implementation. This report presents new Simulink primitives that had to be developed to deal with these new requirements.  :::::::  Modeling Optical Scatter from Complex Aerosol Samples Adam Milstein, MIT Lincoln Laboratory Abstract: MIT Lincoln Laboratory has developed an advanced modeling capability to investigate elastic scattering properties of biological and inert aerosols  at near- and mid-wave infrared wavelengths. The aerosol sample’s optical cross section and Mueller matrix across multiple scattering angles are  calculated and validated against optical measurements.  The modeling approach uses the discrete dipole approximation to compute elastic optical  scatter from an ensemble of randomly-oriented particles of arbitrary shape, index of refraction, and size distribution.  The calculation over large  ensembles of particles requires significant computational resources, and is thus implemented on LLGRID, a large parallel grid computer.  This  presentation shows results for several types of particles, including bacterial spore clusters and dust particles, with comparison to microscopy and  optical measurements from the Standoff Aerosol Active Signature Testbed.  This work is sponsored under Air Force contract FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the  authors and are not necessarily endorsed by the United States Government.  :::::::  Scrubbing Optimization via Availability Prediction (SOAP) for Reconfigurable Space Computing Quinn Martin*, University of Florida, NSF CHREC; Alan George, CHREC, ECE Dept. University of Florida Abstract: Reconfigurable computing with FPGAs can be highly effective in terms of performance, adaptability, and power for accelerating space  applications, but their configuration memory must be scrubbed to prevent the accumulation of single-event upsets. Many scrubbing techniques  currently exist, each with different advantages, making it difficult for the system designer to choose the optimal scrubbing strategy for a given  mission. This paper surveys the currently available scrubbing techniques and introduces the SOAP method for predicting system availability for  various scrubbing strategies using Markov models. We then apply the method to compare hypothetical Virtex-5 and Virtex-6 systems for blind,  CRC-32, and Frame ECC scrubbing strategies in LEO and HEO. We show that availability in excess of 5 nines can be obtained with modern,  FPGA-based systems using scrubbing. Furthermore, we show the value of the SOAP method by observing that different scrubbing strategies are  optimal for different types of missions.  :::::::  CUDA and OpenCL implementations of 3D CT reconstruction for Biomedical Imaging Saoni Mukherjee*, Northeastern University Abstract: Biomedical image reconstruction applications with large datasets can benefit from acceleration. Graphic Processing Units(GPUs) are particularly  useful in this context as they can produce high fidelity images rapidly. An image algorithm to reconstruct conebeam computed tomography(CT)  using 2D projections is implemented using GPUs. The implementation takes slices of the target, weighs the projection data and then filters the  weighted data to backproject the data and create the final 3D construction. This is implemented on two types of hardware: CPU and a  heterogeneous system combining CPU and GPU. The CPU code in C and MATLAB are compared with the heterogeneous versions written in  CUDA-C and OpenCL. The relative performance is tested and evaluated on a mathematical phantom as well as mouse data.  :::::::  Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU Tao Cui*, Department of ECE, Carnegie Mellon Unviersity, Pittsburgh Abstract: Solving a large number of load flow problems quickly is required for various power system especially smart grid applications including Monte  Carlo analysis, long term steady state simulation, system benchmarking, among others. However, due to the computational burden, such  applications are considered to be time consuming and infeasible for online or realtime application. In this work we developed a high performance  framework for high throughput distribution load flow computation, taking advantage of performance-enhancing features of multi-core CPUs and  various code optimization techniques. We optimized the data structure to better fit the memory hierarchy. We use SPIRAL as the code generator  to exploit the inherent pattern of load flow model. Multiple levels of parallelism including SIMD and multithreading are also used for the particular  problem setup. The optimized solver is able to achieve more that 50% of a Core i7 CPU’s machine peak, which translates to solving millions of  load flow problems within a second for IEEE 37 test feeder. Finally, a scheduling threading structure is also designed to enable real time  application.  :::::::  Efficient and Scalable Computations with Sparse Tensors Muthu Baskaran*, Reservoir Labs Inc.; Benoit Meister, Reservoir Labs Inc.; Nicolas Vasilache, Reservoir Labs Inc.; Richard Lethin, Reservoir Labs Inc. Abstract: For applications that deal with large amounts of high dimensional multi-aspect data, it becomes natural to represent such data as tensors or multi-  way arrays. Multi-linear algebraic computations such as tensor decompositions are performed for summarization and analysis of such data. Their  use in real-world applications can span across domains such as signal processing, data mining, computer vision, and graph analysis. The major  challenges with applying tensor decompositions in real-world applications are (1) dealing with large-scale high dimensional data and (2) dealing  with sparse data. Recently, algorithms for tensor decompositions that account for the sparsity of data have been proposed. In this paper, we  describe new sparse tensor storage formats that are flexible for performing tensor computations and for handling storage requirements. Further,  we propose an optimization that improves data reuse and reduces redundant or unnecessary computations in Tucker decomposition algorithms.  Furthermore, we couple our data reuse optimization and the benefits of our sparse tensor storage formats to provide a memory-efficient scalable  solution for handling large-scale sparse tensor computations. We demonstrate improved performance and address memory scalability using our  techniques on both synthetic small data sets and large-scale sparse real data sets. :::::::  Benchmarking Parallel Eigen Decomposition for Residuals Analysis of Very Large Graphs Edward Rutledge*, MIT Lincoln Laboratory; Benjamin Miller, MIT Lincoln Laboratory; Michelle Beard, MIT Lincoln Laboratory Abstract: Graph analysis is used in many domains, from the social sciences to physics and engineering. The computational driver for one important class of  graph analysis algorithms is the computation of leading eigenvectors of a graph’s modularity matrix. This paper explores the computational  implications of performing an eigen decomposition of a graph’s modularity matrix using commodity cluster hardware and freely available  eigensolver software, for graphs with 1 million to 1 billion vertices, and 8 million to 8 billion edges. Working with graphs of these sizes, parallel  eigensolvers are of particular interest. Our results suggest that graph analysis approaches based on eigen space analysis of graph residuals are  feasible even for graphs of these sizes.  :::::::  Driving Big Data With Big Compute Chansup Byun*, MIT Lincoln Laboratory Abstract: Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and  processing large volumes of data.  Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters  provide high parallel efficiency for compute intensive workloads.  Bringing the big data and big compute communities together is an active area of  research.  The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds.  LLGrid  MapReduce allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster.  D4M  (Dynamic Distributed Dimensional Data Model) provided a high level distributed arrays interface to the Apache Accumulo database.  The  accessibility of these technologies is assessed by measuring the effort to use these tools and is typically a few lines of code.  The performance is  assessed by measuring the insert rate into the Accumulo database.  Using these tools a database insert rate of 3M inserts/second has been  achieved on an 8 node cluster.  :::::::  Benchmarking LAMMPS on Utility Server GPUs Antoinetee Silas*, USACE-ERDC-ITL Abstract: The High Performance Computing Modernization Program’s Technology Insertion (TI) uses a variety of application benchmark codes to assist in  determining which HPC vendors are capable of providing the best computing machinery. LAMMPS is a key molecular dynamics code that has  been included in the TI benchmark suite for years. With recent technological advances made using graphical processing units (GPUs), we are  now exploring the benefit of implementing the aforementioned code within the GPU environment. By way of the standard LAMMPS benchmarking  package given for a typical TI process, we will examine its performance on a utility server that houses GPUs. The resulting times given within a  CPU-only environment versus a heterogeneous environment will help determine if the effort for a successful GPU build is worthwhile. One present  benefit is that the LAMMPS code already possesses the capability to be run on a GPU with minimal coding effort. Therefore, the primary focus lies  within the test cases, computing environments, and results.  :::::::  General Purpose Computing on Graphics Processing Units: Decomposition Strategy Henry Au, Gregory Lum Space and Naval Warfare Systems Center Pacific (SSC Pacific), Pearl City, HI Abstract: This paper describes the optimization strategies when porting traditional C/C++ algorithms which run on CPU's to parallel processing  architectures found on Graphics Processing Units (GPUs).  The CUDA parallel programming architecture is also explored through the use of  NVIDIA's Visual Profiler  for performance analysis.  Real time video feeds, such as from onshore surveillance cameras, offer limited visibility when  fog, haze, smoke, or dust clouds are present.  In order to enhance the video, image processing algorithms such as the Adaptive Linear Filter  (ALF) are performed.  However, algorithms such as the ALF require large computational time thus limiting the picture quality, size of the video, or  number of video feeds being processed concurrently in real time.  The GPUs parallel processing computational power is exploited to attain speed  ups so that image processing can be performed on the fly in real time.  Thus, surveillance is enhanced by providing visual improvement for  detection and classification of objects in low-visibility conditions using the ALF.  The ALF was selected to provide an image processing context for  algorithm optimization on GPUs.  The optimization strategies being explored will be CUDA Host memory allocations, streams, and asynchronous  memory transfers.  Performance results of the ALF running on the GPU and the GPU after optimization will also be reported.  As well, GPU  limitations will also be briefly discussed in this paper as not every algorithm will benefit from execution on parallel processing architectures.  :::::::  Anatomy of a Globally Recursive Embedded LINPACK Benchmark Piotr Luszczek*, University of Tennessee Abstract: We present a complete bottom-up implementation of an embedded LINPACK benchmark on iPad 2. We use a novel formulation of a recursive LU  factorization that is recursive and parallel at the global scope. We be believe our new algorithm presents an alternative to existing linear algebra  parallelization techniques such as master-worker and DAG-based approaches. We show a assembly API that allows us a much higher level of  abstraction and provides rapid code development within the confines of mobile device SDK. We use performance modeling to help with the  limitation of the device and the limited access to device from the development environment not geared for HPC application tuning.  :::::::  STINGER: High Performance Data Structure for Streaming Graphs David Ediger*, Georgia Institute of Technology Abstract: The current research focus on “big data” problems highlights the scale and complexity of analytics required and the high rate at which data may  be changing. Future applications in this space will need to cope with high rates of change at scale. STINGER is a scalable, high performance  graph data structure that enables these applications. Key attributes of STINGER are fast insertions, deletions, and updates on graphs with  semantic information and skewed degree distributions. We demonstrate a process of algorithmic and architectural optimizations that enable high  performance on the Cray XMT family and Intel multicore servers. Our implementation of STINGER processes over 3 million updates per second  on a scale-free graph with 537 million edges.  :::::::  Cluster-based 3D Reconstruction of Aerial Video Scott Sawyer*, MIT Lincoln Laboratory Abstract: Large-scale 3D scene reconstruction using Structure from Motion (SfM) continues to be very computationally challenging despite much active  research in the area. We propose an efficient, scalable processing chain designed for cluster computing and suitable for use on aerial video. The  sparse bundle adjustment step, which is iterative and difficult to parallelize, is accomplished by partitioning the input image set, generating  independent point clouds in parallel, and then fusing the clouds and combining duplicate points. We compare this processing chain to a leading  parallel SfM implementation, which exploits fine-grained parallelism in various matrix operations and is not designed to scale beyond a multi-core  workstation with GPU. We show our cluster-based approach offers significant improvement in scalability and runtime while producing comparable  point cloud density and more accurate point location estimates. :::::::  A MATLAB-to-Target Development Workflow using Sourcery VSIPL++ Stefan Seefeld*, Mentor Graphics, Inc. Abstract: A hybrid MATLAB/C++ programming model for high performance embedded computing is presented. It is shown how the use of a common data  model and API can help not only to speed up the development process, but also to keep the original MATLAB model in sync with the evolving C++  code, and thus allowing it to remain a gold standard for the project as it evolves. Keywords: multi-language, scripting, VSIPL++, prototyping,  signal- and image-processing  :::::::  Big Ocean Data: Numerically Conservative Parallel Query Processing for CFD Applications Prof. Bill Howe, University of Washington Abstract: The ocean sciences are rapidly evolving from an expeditionary science to an observatory-based science. As a result, data integration and analysis are replacing data acquisition as the bottleneck to new discoveries --- the observatories coming online are capable of collecting data faster than we can understand it.  The applications and datasets in this domain have several properties that complicate the design of efficient abstractions.  Unlike atmospheric models, complex domains can only be modeled precisely using polyhderal meshes ("unstructured grids") rather than simpler arrays or relations.  Many common tasks are extremely sensitive to numerical errors, forcing researchers to access raw data at full resolution and implement their own data integration techniques by hand.  File-oriented standards and systems popular in this domain offer no support for parallel processing. In this talk, I'll describe an algorithm for efficient and numerically conservative regridding of arbitrary unstructured grids, and show how this primitive can be made to commute with a mesh subsetting operator. Together, these two operations appear sufficient to efficiently express a variety of tasks and idioms in ocean data analysis, including model coupling, visualization, distributed / parallel processing, adaptive mesh refinement.  Finally, I'll describe our current release of the code and how we have integrated it with the Hyrax OPeNDAP server, a popular platform for serving ocean and atmospheric data over the Internet. :::::::