© Lorem Ipsum Dolor 2010
2013 IEEE High Performance
Extreme Computing Conference
(HPEC ‘13)
Seventeenth Annual HPEC Conference
10 - 12 September 2013
Westin Hotel, Waltham, MA USA
A Nested Dissection Partitioning Method for Parallel Sparse Matrix-Vector Multiplication
Erik Boman, Sandia National Laboratories; Michael Wolf, MIT Lincoln Laboratory
Abstract: We consider how to map sparse matrices across processes to reduce communication costs in parallel sparse matrix-vector
multiplication, an ubiquitous kernel in high performance computing. Our main contributions are: (i) an exact graph model for
communication with general (two-dimensional) matrix distribution, and (ii) a recursive partitioning algorithm based on nested dissection
that approximately solves this model. We have implemented our algorithm using hypergraph partitioning software to enable a fair
comparison with existing methods. We present partitioning results for sparse structurally symmetric matrices from several application
areas. Our new method is competitive with the best 2D algorithm (fine-grain hypergraph model) in terms of communication volume, but
requires fewer messages. The nested dissection method is almost as fast to compute as 1D methods and the communication volume is
significantly reduced (up to 97%) compared to 1D layout. Further improvements in quality may be possible by small modifications to
existing nested dissection ordering software.
Novel Algebras for Advanced Analytics in Julia
Alan Edelman, MIT; Jeremy Kepner, MIT; Jeff Bezanson, ; Stefan Karpinski, ; Viral Shah,
Abstract: A linear algebraic approach to graph algorithms that exploits the sparse adjacency matrix representation of graphs can provide a
variety of benefits. These benefits include syntactic simplicity, easier implementation, and higher performance. One way to employ linear
algebra techniques for graph algorithms is to use a broader definition of matrix and vector multiplication. We demonstrate through the
use of the Julia language system how easy it is to explore semirings using linear algebraic methodologies.
Standards for Graph Algorithm Primitives
Jeremy Kepner, MIT; Timothy Mattson, Intel
Abstract: It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations
is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the
problem and announcing our intention to launch an open effort to define this standard.
High Performance Embedded Computing Module Enabled by GPU, CPU and FPGA SOCs
Jason Fritz, Colorado Engineering, Inc.; Joshua Pearson, Colorado Engineering, Inc.; Michael Bonato, Colorado Engineering, Inc.
Abstract: Applications that require high performance embedded computing platforms with low Size, Weight and Power (SWaP) are
continually searching for the balance between the number of floating point operations (FLOPs) and power consumption on small form
factors. As the computer gaming industry developed more capable Graphics Processing Units (GPUs) at lower cost other industries have
pushed them to provide general purpose processing as an alternative to Field Programmable Gate Arrays (FPGAs). System designers then
choose between FPGAs, GPUs and multi-core processors to provide the desired performance and SWaP tradeoff. Technology has now
advanced to provide all of these processing components on a small form factor modular, scalable embedded circuit card assembly (CCA)
enabled by System-On-Chip (SOC) products. In addition, the CCA does not require a backplane to interface with other modules.
Block Processor: A Resource-distributed Architecture
Zeke Wang, Zhejiang University
Abstract: We present the architecture of Block Processor, task-level coprocessor, to execute vectorizable computing task migrated from
main processor via command bus. The Block Processor is designed around 32 high-MVL block registers, which can be direct operands of
vector instruction and be local cache of the Block Processor. The corresponding unique conflictsolving mechanism scales with the various
implementations and easily supports chaining by adding extra execution states. The architecture distributes the block registers, ALUs and
control logic. We implement the Block Processor which maps efficiently into the FPGA since the FPGA also distributes its inner resource.
Each block register requires two FPGA Block RAM to be 2-read-1-write-port, 1024-depth and 32-bit-width. With the enhanced chaining and
decoupling, it might hinder the latency of vector
memory instructions and then sustain the computing abilities. With the little resource occupied, 1024-point radix-2 DIF FFT costs 11348
cycles on one Block Processor.
Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA
Da Tong, EE USC; Viktor Prasanna, University of Southern California
Abstract: Statistical information of the network traffic flows is essential for many network management and security applications in the
Internet and data centers. In this work, we propose an architecture for a dynamically configurable online statistical flow feature extractor
on FPGA. The proposed architecture computes a set of widely used statistical features of the network traffic flows on-the-fly. We design an
application specific data forwarding mechanism to handle data hazards without stalling the pipeline. We prove that our architecture can
correctly process any sequence of packets. Users can dynamically configure the feature extractor through run-time parameter. The post
place-and-route result on the state-of-the-art FPGA device show that the feature extractor can achieve a throughput of 96 Gbps for
supporting 64K concurrent flows.