A Nested Dissection Partitioning Method for Parallel Sparse Matrix-Vector MultiplicationErik Boman, Sandia National Laboratories; Michael Wolf, MIT Lincoln Laboratory Abstract: We consider how to map sparse matrices across processes to reduce communication costs in parallel sparse matrix-vector multiplication, an ubiquitous kernel in high performance computing. Our main contributions are: (i) an exact graph model for communication with general (two-dimensional) matrix distribution, and (ii) a recursive partitioning algorithm based on nested dissection that approximately solves this model. We have implemented our algorithm using hypergraph partitioning software to enable a fair comparison with existing methods. We present partitioning results for sparse structurally symmetric matrices from several application areas. Our new method is competitive with the best 2D algorithm (fine-grain hypergraph model) in terms of communication volume, but requires fewer messages. The nested dissection method is almost as fast to compute as 1D methods and the communication volume is significantly reduced (up to 97%) compared to 1D layout. Further improvements in quality may be possible by small modifications to existing nested dissection ordering software.Novel Algebras for Advanced Analytics in JuliaAlan Edelman, MIT; Jeremy Kepner, MIT; Jeff Bezanson, ; Stefan Karpinski, ; Viral Shah, Abstract: A linear algebraic approach to graph algorithms that exploits the sparse adjacency matrix representation of graphs can provide a variety of benefits. These benefits include syntactic simplicity, easier implementation, and higher performance. One way to employ linear algebra techniques for graph algorithms is to use a broader definition of matrix and vector multiplication. We demonstrate through the use of the Julia language system how easy it is to explore semirings using linear algebraic methodologies.Standards for Graph Algorithm PrimitivesJeremy Kepner, MIT; Timothy Mattson, Intel Abstract: It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.
High Performance Embedded Computing Module Enabled by GPU, CPU and FPGA SOCsJason Fritz, Colorado Engineering, Inc.; Joshua Pearson, Colorado Engineering, Inc.; Michael Bonato, Colorado Engineering, Inc. Abstract: Applications that require high performance embedded computing platforms with low Size, Weight and Power (SWaP) are continually searching for the balance between the number of floating point operations (FLOPs) and power consumption on small form factors. As the computer gaming industry developed more capable Graphics Processing Units (GPUs) at lower cost other industries have pushed them to provide general purpose processing as an alternative to Field Programmable Gate Arrays (FPGAs). System designers then choose between FPGAs, GPUs and multi-core processors to provide the desired performance and SWaP tradeoff. Technology has now advanced to provide all of these processing components on a small form factor modular, scalable embedded circuit card assembly (CCA) enabled by System-On-Chip (SOC) products. In addition, the CCA does not require a backplane to interface with other modules.Block Processor: A Resource-distributed ArchitectureZeke Wang, Zhejiang University Abstract: We present the architecture of Block Processor, task-level coprocessor, to execute vectorizable computing task migrated from main processor via command bus. The Block Processor is designed around 32 high-MVL block registers, which can be direct operands of vector instruction and be local cache of the Block Processor. The corresponding unique conflictsolving mechanism scales with the various implementations and easily supports chaining by adding extra execution states. The architecture distributes the block registers, ALUs and control logic. We implement the Block Processor which maps efficiently into the FPGA since the FPGA also distributes its inner resource. Each block register requires two FPGA Block RAM to be 2-read-1-write-port, 1024-depth and 32-bit-width. With the enhanced chaining and decoupling, it might hinder the latency of vectormemory instructions and then sustain the computing abilities. With the little resource occupied, 1024-point radix-2 DIF FFT costs 11348 cycles on one Block Processor.Dynamically Configurable Online Statistical Flow Feature Extractor on FPGADa Tong, EE USC; Viktor Prasanna, University of Southern California Abstract: Statistical information of the network traffic flows is essential for many network management and security applications in the Internet and data centers. In this work, we propose an architecture for a dynamically configurable online statistical flow feature extractor on FPGA. The proposed architecture computes a set of widely used statistical features of the network traffic flows on-the-fly. We design an application specific data forwarding mechanism to handle data hazards without stalling the pipeline. We prove that our architecture can correctly process any sequence of packets. Users can dynamically configure the feature extractor through run-time parameter. The post place-and-route result on the state-of-the-art FPGA device show that the feature extractor can achieve a throughput of 96 Gbps for supporting 64K concurrent flows.