2021 IEEE High Performance Extreme Computing Virtual Conference 20 - 24 September 2021
Thursday, September 23 4-V: Sponsor Spotlight Session (10:30-11:00) Session Chair(s): Albert Reuther Invited Talk: HPE Sponsor Spotlight Bill Mannel (VP/GM HPC)
Home Monday, Sept 20 Tuesday, Sept 21 Wednesday, Sept 22 Thursday, Sept 23 Friday, Sept 24 Subscribe to HPEC 2022
HLS Portability from Intel to Xilinx: A Case Study Zhili Xiao (Washington University in St. Louis)*; Roger Chamberlain (Washington University in St. Louis); Anthony M Cabrera (Oak Ridge National Laboratory) Field-programmable gate arrays (FPGAs) are a hardware accelerator option that is growing in popularity. However, FPGAs are notoriously hard to program. To this end, high-level synthesis (HLS) tools have been developed to allow programmers to design hardware accelerators with FPGAs using familiar software languages. The two largest FPGA vendors, Intel and Xilinx, support both C/C++ and OpenCL C to construct kernels. However, little is known about the portability of designs between these two platforms. In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We conduct a case study, porting the Needleman-Wunsch application from the Rodinia benchmark suite written in Intel OpenCL C to Xilinx platforms. We use OpenCL C kernels optimized for Intel FPGA platforms as a starting point and first perform a minimum effort port to a Xilinx FPGA, also using OpenCL C. We find that simply porting one-to-one optimizations is not enough to enable portable performance. We then seek to improve the performance of those kernels using Xilinx C/C++. With rewriting the kernel for burst transfer and other optimizations, we are able to reduce the execution time from an initial 294~s to 2.2~s. Software-Hardware Co-Optimization on Partial-Sum Problem for PIM-based Neural Network Accelerator Qizhe Wu (ustc)*; Xi Jin (University of Science and Technology of China) The crossbar architecture, which is comprised of novel memristor devices, enables high-speed and energy-efficient processing-in-memory (PIM) for neural network computing. However, because to the limitations of the manufacturing process, it is difficult to fabricate huge arrays. As a consequence, the neural network’s vector-matrix-multiplication (VMM) must split the operands into several arrays to get the partial-sum and then add up the partial results. The neural network (NN) training process, which is often influenced by device variations and ADC quantization noise in the PIM system does not perceive the partial-sum process. As a consequence, when inferring NN models directly on the PIM platform without taking partial-sum into account, accuracy suffers significantly. This makes it difficult to apply PIM computing to large-scale neural networks. In particular, our work makes the following contributions: (i) We conducted research on the partial-sum issue for crossbar architecture while computing high channel convolution (Conv), and got three lessons as a result. (ii) To address this issue, we offer techniques for avoiding or minimizing partial-sum at the software and hardware levels, respectively. At the software level, we utilized group Conv rather than conventional Conv; at the hardware level, we presented a new architecture for adapting depthwise separable Conv. Experiments were conducted using the Cifar10 dataset and the VGG8 network on RRAM crossbar architecture. Results show improvements of 15.53%, 14.55% in accuracy, and 0.28×, 0.94× in energy efficiency on software and hardware levels, respectively, when compared to the conventional PIM scheme. GCN Inference Acceleration using High-Level Synthesis Yi-Chien Lin (University of Southern California)*; Bingyi Zhang (University of Southern California); Viktor K Prasanna (Unversity of Southern California) GCN (Graph Convolutional Network) has become a promising solution for many  applications, such as recommendation systems, social data mining, etc. Many of these applications requires low latency GCN inference. In this paper, we provide a case study of a GCN inference acceleration on FPGA. We explore high-level synthesis programming model to achieve low- latency inference. First, we propose a partition-centric mapping strategy to map the execution tasks of GCN onto FPGA to exploit data reuse, which reduces external memory access overhead. Second, we provide HLS-based kernel design with improved memory performance and achieve massive data parallelism. Third, we perform design space exploration to facilitate feasible pre-placement which avoids potential Place-and-Route (PnR) failures. We evaluate our design on a state-of-the-art FPGA platform using three commonly used datasets: Reddit, Yelp and Amazon-2M. We compare our design with two state-of- the-art libraries PyTorch-Geometric (PyG) and Deep Graph Library (DGL) running on high-end CPU and GPU by evaluating their latency and energy efficiency to perform full-batch GCN inference on a two-layer Vanilla-GCN model. Compared with PyG CPU version, our design reduces the latency by 59.95x and is 96.22x more energy efficient on the average. Compared with DGL, our design achieves 2.9x-6.4x speedup and is 5.87x more energy efficient compared with the CPU version. Compared with the DGL GPU version, although the latency of our design is 1.67x - 2.5x than DGL GPU, our design is 1.8x more energy efficient. AI Accelerator Survey and Trends Albert Reuther (MIT Lincoln Laboratory)*; Siddharth Samsi (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory); Vijay Gadepally (MIT Lincoln Laboratory); Peter Michaleas (MIT Lincoln Laboratory); Michael S Jones (MIT Lincoln Laboratory) Over the past several years, new machine learning accelerators were being announced and released every month for a variety of applications from speech recognition, video object detection, assisted driving, and many data center applications. This paper updates the survey of AI accelerators and processors from past two years. This paper collects and summarizes the cur- rent commercial accelerators that have been publicly announced with peak performance and power consumption numbers. The performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this plot are again discussed and analyzed. This year, we also compile a list of benchmarking performance results and compute the computational efficiency with respect to peak performance. Survey and Future Trends for FPGA Cloud Architectures Hafsah Shahzad (Boston University); Ahmed Sanaullah (Red Hat Inc.); Martin Herbordt (Boston University)* In the last five years, FPGA presence in the cloud has gone from near zero (except for deeply embedded devices) to a large fraction of all high-end FPGAs sold. This is because FPGAs offer uniquely the performance, power, and flexibility needed to support the diversity and dynamicity of cloud workloads. We begin by observing that, although FPGAs are widespread, they cannot be randomly deployed as part of cloud infrastructure. Any FPGA cloud architecture must satisfy a number of constraints placed by the cloud provider. As a result, FPGA use in the cloud is non-uniformly distributed and motivated by the specific advantages and limitations that each unique architecture offers. In this survey, we provide an exploration and analysis of the trends in existing cloud FPGA architectures that highlight this complex relationship between architectures and system requirements. This allows us to identify novel architectures that are likely to offer substantial benefits for cloud workloads. Tutorial Session: 4-T (12:15-15:45): Python GraphBLAS Tutorial Chair/Host: Tim Mattson & Scott Mcmillan
4-2: Advanced Processor Architectures 2 Session (12:30-13:45) Session Co-Chairs: Mark Barnell & Wei Zhang
System-Level Modeling of GPU/FPGA Clusters for Molecular Dynamics Simulations Chunshu Wu (Boston University)*; Sahan Bandara (Boston University); Tong Geng (Pacific Northwest National Laboratory); Vipin Sachdeva (Silicon Therapeutics); Woody Sherman (Silicon Therapeutics); Martin Herbordt (Boston University) FPGA-accelerated molecular dynamics (MD) research dates back to almost two decades ago and is still being actively studied. MD on FPGA clusters, however, is still in its initial phase with only small systems built and limited performance studies. Given the cost of building accelerator clusters, and (as we show) the number of plausible architectures, a thorough study is needed. In particular, we investigate both FPGA and GPU/FPGA hybrid clusters. The latter are possibly attractive given the broad availability of GPU clusters and use of GPUs for MD, but the current inability of GPUs to scale for certain critical domains. In this work, we model four promising MD accelerator platforms, including FPGA-only systems with homogeneous and heterogeneous nodes, an existing FPGA-GPU hybrid system (the Cygnus supercomputer), and a synthesis of the commercially available Nvidia DGX-1/DGX-2 products with an FPGA cluster. The models are compared and evaluated, and we find that each of the platforms is suitable for some circumstances. Rapid Configuration of Asynchronous Recurrent Neural Networks for ASIC Implementations Spencer Nelson (University of Arkansas); Wassim Khalil (University of Arkansas); Sangyun Kim (University of Arkansas); Jia Di (University of Arkansas)*; Zhe Zhou (Peking University, CECA); Zhihang Yuan (peking university); Guangyu Sun (Peking University) Shortened time-to-market has always been a critical goal for semiconductor IC design companies. With the surge of artificial intelligence, ASIC implementations of various machine learning algorithms are still desired in many applications (e.g., edge computing) over the FPGA counterparts. In order to speed up the development time, this paper analyzes the architecture of gated Recurrent Neural Networks (RNNs) and creates and utilizes generic or flexible asynchronous Multi-Threshold NULL Convention Logic (MTNCL) components to facilitate the rapid design and implementation of a variety of asynchronous ASIC RNN implementations. The design methodology has been demonstrated with the TSMC 65nm bulk CMOS process. The design is simulated at the transistor level for power and speed, and the development time is reported to demonstrate the design effort needed for similar implementations. Deluge: Achieving Superior Efficiency, Throughput, and Scalability with Actor Based Streaming on Migrating Threads Brian A Page (University of Notre Dame)*; Peter Kogge (University of Notre Dame) Applications where streams of data are passed through large data structures are becoming of increasing importance. For instance network intrusion detection and cyber security as a whole rely on real time analysis of network traffic. Unfortunately, when implemented on conventional architectures such applications become horribly inefficient, especially when attempts are made to scale up performance via some sort of parallelism. An earlier paper discussed streaming anomaly detection within a stream having an unbounded range of keys on the Lucata migrating thread architecture. In this paper we introduce Deluge, a new implementation that addresses several inadequacies of previous designs and seeks to more directly target the hardware efficiencies inherent to migratory execution within a PGAS address space. Deluge achieves major improvements in hardware efficiency, throughput, and scalability over previous implementations.
4-3: Case Studies & Benchmarking 2 Session (14:15-15:30) Session Co-Chairs: Chansup Byun & Xiaobai Sun
Invited Talk: The Importance of Computing Power and Algorithms Dr. Neil Thompson (MIT CSAIL) Benchmarking the Processing of Aircraft Tracks with Triples Mode and Self-Scheduling Andrew Weinert (MIT Lincoln Laboratory)*; Marc Brittain (MIT Lincoln Laboratory); Ngaire Underhill (MIT Lincoln Laboratory); Christine Serres (MIT Lincoln Laboratory) As unmanned aircraft systems (UASs) continue to integrate into the U.S. National Airspace System (NAS), there is a need to quantify the risk of airborne collisions between unmanned and manned aircraft to support regulation and standards development. Developing and certifying collision avoidance systems often rely on the extensive use of Monte Carlo collision risk analysis simulations using probabilistic models of aircraft flight. To train these models, high performance computing resources are required. We've prototyped a high performance computing workflow designed and deployed on the Lincoln Laboratory Supercomputing Center to process billions of observations of aircraft. However, the prototype has various computational and storage bottlenecks that limited rapid or more comprehensive analyses and models. In response, we’ve developed a novel workflow to take advantage of various job launch and task distribution technologies to improve performance. The workflow was benchmarked using two datasets of observations of aircraft, including a new dataset focused on the environment around aerodromes. Optimizing how the workflow was parallelized drastically reduced the execution time from weeks to days. Solving sparse linear systems with approximate inverse preconditioners on analog devices Vasileios Kalantzis (IBM Research); Anshul Gupta (IBM Research)*; Lior Horesh (IBM Research); Tomasz Nowicki (IBM Research AI); Mark S Squillante (IBM Research); Chai Wah Wu (IBM); Tayfun Gokmen (IBM Research AI); Haim Avron (Tel Aviv University) Speaker Slides Sparse linear system solvers are computationally expensive kernels that lie at the heart of numerous applications. This paper proposes a preconditioning framework that combines approximate inverses with stationary iterations to substantially reduce the time and energy requirements of this task by utilizing a hybrid architecture that combines conventional digital microprocessors with analog crossbar array accelerators. Our analysis and experiments with a simulator for analog hardware demonstrate that an order of magnitude speedup is readily attainable despite the noise in analog computations. Implications of Reduced Communication Precision in a Collocated Discontinuous Galerkin Finite Element Framework Marcin Rogowski (King Abdullah University of Science and Technology)*; Lisandro Dalcin (King Abdullah University of Science and Technology); Matteo Parsani (King Abdullah University of Science and Technology); David Keyes (KAUST) Compute capability of high-performance hardware has been growing at immense rates, increasing over 130x in the last decade. Communication bandwidth, however, only grew by a factor of 6x in the same time, leading to a significant decrease in the byte-to-flop metric. This trend leads us to the situation where, in many cases, computation is virtually free, and the dominant cost of a parallel application comes from its communication cost. We expect this trend to continue and, hence, the parallel application wall-clock time to be increasingly correlated with the amount of data transferred between the nodes involved. In order to alleviate this communication bottleneck, we test several communication-reducing schemes based on the idea of using higher precision for the inner cells and lower precision communication. For every approach, we report the resulting network traffic and weigh it against the decreased accuracy. We perform our experiments in a collocated Discontinuous Galerkin finite element method framework (DG-FEM) applied in Computational Fluid Dynamics (CFD). First, we present a parametric study using the method of manufactured solutions on a 3D compressible Navier--Stokes supersonic cube. Using this method allows us to quantify communication reducing schemes' impact on the error in test cases representing a range of solution polynomial degrees and problem sizes. Finally, we verify the findings on a full-scale CFD problem, flow around the delta wing, and report on methods' consistency as the number of processes and the number of halo elements change. Performance-Portable Sparse Tensor Decomposition Kernels on Emerging Parallel Architectures S. Isaac Geronimo Anderson (University of Oregon)*; Keita Teranishi (Sandia National Laboratories); Daniel Dunlavy (Sandia National Laboratories); Jee Choi (University of Oregon) We leverage the Kokkos library to study performance portability of parallel sparse tensor decompositions on CPU and GPU architectures. Our result shows that with a single implementation Kokkos can deliver performance comparable to hand-tuned code for simple array operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems, and superior performance for the MTTKRP kernel on CPUs.
4-4: Case Studies & Benchmarking 3 Session (15:45-17:00) Session Co-Chairs: Darrell Ricke & Xiaobai Sun
Invited Talk: Redefining Disease Definition with Machine Learning Dr. John Reynders A High-Performance Heterogeneous Critical Path Analysis Framework Yasin Zamani (University of Utah)*; Tsung-Wei Huang (University of Utah) Emphasis on static timing analysis (STA) has shifted from graph-based analysis (GBA) to path-based analysis (PBA) for reducing unwanted slack pessimism. However, it is extremely time-consuming for a PBA engine to analyze a large set of critical paths. Recent years have seen many parallel PBA applications, but most of them are limited to CPU parallelism and do not scale beyond a few threads. To overcome this challenge, we propose in this paper a high-performance graphics processing unit (GPU)-accelerated PBA framework that efficiently analyzes the timing of a generated critical path set. We represent the path set in three dimensions, timing test, critical path, and pin, to structure the granularity of parallelism scalable to arbitrary problem sizes. In addition, we leverage task- based parallelism to decompose the PBA workload into CPU-GPU dependent tasks where kernel computation and data processing overlap efficiently. Experimental results show that our framework applied to an important PBA application can speed up the state-of-the-art baseline up to 10× on a million-gate design. Boundary Integral Solver Approaches for Particle Accelerator Simulation Problems and Deployment on NERSC Hardware M. Harper Langston (Reservoir Labs, Inc)*; Julia Wei (Reservoir Labs, Inc.); Pierre-David Letourneau (Reservoir Labs, Inc.); Matthew J. Morse (Reservoir Labs); Larry Weintraub (Reservoir Labs, Inc.); Aimee Nogoy (Reservoir labs, Inc.); Noah Amsel (Reservoir Labs, inc.); Richard Lethin (Reservoir Labs) The MACH-B (Multipole Accelerator Codes for Hadron Beams) project is developing a Fast Multipole Method (FMM)-based tool for higher fidelity modeling of particle accelerators for high-energy physics within the next generation of Fermilab's Synergia simulation package. MACH-B incorporates (1) highly-scalable, high-performance and generally-applicable FMM-based algorithms to accurately model space-charge effects in high-intensity hadron beams and (2) boundary integral approaches to handle singular effects near the beam pipe using advanced quadratures. MACH-B will allow for more complex beam dynamics simulations that more accurately capture bunch effects and predict beam loss. Further, by introducing an abstraction layer to hide FMM implementation and parallelization complexities, MACH-B removes one of the key impediments to the adoption of FMMs by the accelerator physics community. In this work, we focus on the following results for the boundary integral solver components of the MACH-B project: (1) Study of the relative accuracies of the hedgehog boundary integral solver when evaluating potential and gradient solutions to Laplace's equation with Dirichlet boundary conditions; (2) Study of a single-bunch, Gaussian-distributed set of charges within a conducting pipe, using an embedded boundary solver. Results show the ability to simulate charge densities inside of a pipe-shaped object, running simulations on a collection of NERSC's Cori Cray XC40 Intel Xeon ``Haswell'' processor nodes  and Reservoir's internal computational resources. This work was supported by the U.S. Department of Energy as part of the SBIR Phase I Project DE-SC0020934.  This research further used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231. Pragmatic Benchmarking for Research Computing Dennis V Milechin (Boston University)*; Ahmed Aly (Boston University); Josh Bevan (Boston University); Charlie Jahnke (Boston University); Yun Shen (Boston University); Brian Gregor (Boston University) The Research Computing Services (RCS) group at Boston University (BU), developed a benchmark suite to evaluate the performance of newer hardware under consideration for purchase for the BU Shared Computing Cluster (SCC). The custom suite of benchmarks is used to generate performance metrics that are representative of the highly diverse types of jobs run on the cluster. The results of the benchmarks are used to make informed decisions about hardware upgrades in order to provide the best balance of performance and value for cluster users. In this paper we discuss the present reasons for creating a custom benchmark suite, the general architecture of the suite, and provide sample results from selected benchmarks that we ran on our cluster. Performance Study of GPU applications using SYCL and CUDA on Tesla V100 GPU Goutham Kalikrishna Reddy Kuncham (Tata Consultancy Services)*; Rahul Vaidya (TCS); Mahesh Barve (TCS) SYCL standard enables single-source programs to run on heterogeneous platforms consisting of CPUs, GPUs, FPGAs across different hardware vendors. SYCL combines modern C++ features along with OpenCL’s portability. SYCL runtime is also capable of targeting the CUDA backend directly on NVIDIA GPUs. This approach can potentially improve the performance of SYCL on NVIDIA devices. Although NVIDIA GPUs can be targeted via OpenCL backend, their features and capabilities are limited, and the performance is inadequate. In this study, we compare the performance of the Nvidia V100 GPU using SYCL and CUDA. For performance evaluation, we selected three GPU applications: BabelStream, Mixbench, and Tiled Matrix-Multiplication. We conducted extensive tests to understand the performance in terms of DRAM bandwidth, kernel execution time, compilation time, and throughput. As per our study, the performance of SYCL and CUDA were found to be similar. However, in some cases, CUDA outperformed SYCL.
4-S1: Graph Challenge Special (17:30-19:30) Organizer(s): Jeremy Kepner
4-1: Advanced Processor Architectures 1 Session (11:00-12:15) Session Co-Chairs: Mark Barnell & Manoj Kumar
Invited Talk: Cloud-Scaling and HPC-Enabled Next-Gen ASIC Verification Serge Leef (DARPA)
Thursday, Sept 23
2021 Abstract Book
Thursday, September 23 4-V: Sponsor Spotlight Session (10:30-11:00) Session Chair(s): Albert Reuther Invited Talk: HPE Sponsor Spotlight Bill Mannel (VP/GM HPC)
HLS Portability from Intel to Xilinx: A Case Study Zhili Xiao (Washington University in St. Louis)*; Roger Chamberlain (Washington University in St. Louis); Anthony M Cabrera (Oak Ridge National Laboratory) Field-programmable gate arrays (FPGAs) are a hardware accelerator option that is growing in popularity. However, FPGAs are notoriously hard to program. To this end, high-level synthesis (HLS) tools have been developed to allow programmers to design hardware accelerators with FPGAs using familiar software languages. The two largest FPGA vendors, Intel and Xilinx, support both C/C++ and OpenCL C to construct kernels. However, little is known about the portability of designs between these two platforms. In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We conduct a case study, porting the Needleman-Wunsch application from the Rodinia benchmark suite written in Intel OpenCL C to Xilinx platforms. We use OpenCL C kernels optimized for Intel FPGA platforms as a starting point and first perform a minimum effort port to a Xilinx FPGA, also using OpenCL C. We find that simply porting one-to-one optimizations is not enough to enable portable performance. We then seek to improve the performance of those kernels using Xilinx C/C++. With rewriting the kernel for burst transfer and other optimizations, we are able to reduce the execution time from an initial 294~s to 2.2~s. Software-Hardware Co-Optimization on Partial-Sum Problem for PIM-based Neural Network Accelerator Qizhe Wu (ustc)*; Xi Jin (University of Science and Technology of China) The crossbar architecture, which is comprised of novel memristor devices, enables high-speed and energy-efficient processing-in-memory (PIM) for neural network computing. However, because to the limitations of the manufacturing process, it is difficult to fabricate huge arrays. As a consequence, the neural network’s vector-matrix-multiplication (VMM) must split the operands into several arrays to get the partial-sum and then add up the partial results. The neural network (NN) training process, which is often influenced by device variations and ADC quantization noise in the PIM system does not perceive the partial-sum process. As a consequence, when inferring NN models directly on the PIM platform without taking partial-sum into account, accuracy suffers significantly. This makes it difficult to apply PIM computing to large-scale neural networks. In particular, our work makes the following contributions: (i) We conducted research on the partial-sum issue for crossbar architecture while computing high channel convolution (Conv), and got three lessons as a result. (ii) To address this issue, we offer techniques for avoiding or minimizing partial-sum at the software and hardware levels, respectively. At the software level, we utilized group Conv rather than conventional Conv; at the hardware level, we presented a new architecture for adapting depthwise separable Conv. Experiments were conducted using the Cifar10 dataset and the VGG8 network on RRAM crossbar architecture. Results show improvements of 15.53%, 14.55% in accuracy, and 0.28×, 0.94× in energy efficiency on software and hardware levels, respectively, when compared to the conventional PIM scheme. GCN Inference Acceleration using High-Level Synthesis Yi-Chien Lin (University of Southern California)*; Bingyi Zhang (University of Southern California); Viktor K Prasanna (Unversity of Southern California) GCN (Graph Convolutional Network) has become a promising solution for many  applications, such as recommendation systems, social data mining, etc. Many of these applications requires low latency GCN inference. In this paper, we provide a case study of a GCN inference acceleration on FPGA. We explore high-level synthesis programming model to achieve low- latency inference. First, we propose a partition-centric mapping strategy to map the execution tasks of GCN onto FPGA to exploit data reuse, which reduces external memory access overhead. Second, we provide HLS-based kernel design with improved memory performance and achieve massive data parallelism. Third, we perform design space exploration to facilitate feasible pre-placement which avoids potential Place- and-Route (PnR) failures. We evaluate our design on a state-of- the-art FPGA platform using three commonly used datasets: Reddit, Yelp and Amazon-2M. We compare our design with two state-of-the-art libraries PyTorch-Geometric (PyG) and Deep Graph Library (DGL) running on high-end CPU and GPU by evaluating their latency and energy efficiency to perform full- batch GCN inference on a two-layer Vanilla-GCN model. Compared with PyG CPU version, our design reduces the latency by 59.95x and is 96.22x more energy efficient on the average. Compared with DGL, our design achieves 2.9x-6.4x speedup and is 5.87x more energy efficient compared with the CPU version. Compared with the DGL GPU version, although the latency of our design is 1.67x - 2.5x than DGL GPU, our design is 1.8x more energy efficient. AI Accelerator Survey and Trends Albert Reuther (MIT Lincoln Laboratory)*; Siddharth Samsi (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory); Vijay Gadepally (MIT Lincoln Laboratory); Peter Michaleas (MIT Lincoln Laboratory); Michael S Jones (MIT Lincoln Laboratory) Over the past several years, new machine learning accelerators were being announced and released every month for a variety of applications from speech recognition, video object detection, assisted driving, and many data center applications. This paper updates the survey of AI accelerators and processors from past two years. This paper collects and summarizes the cur- rent commercial accelerators that have been publicly announced with peak performance and power consumption numbers. The performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this plot are again discussed and analyzed. This year, we also compile a list of benchmarking performance results and compute the computational efficiency with respect to peak performance. Survey and Future Trends for FPGA Cloud Architectures Hafsah Shahzad (Boston University); Ahmed Sanaullah (Red Hat Inc.); Martin Herbordt (Boston University)* In the last five years, FPGA presence in the cloud has gone from near zero (except for deeply embedded devices) to a large fraction of all high-end FPGAs sold. This is because FPGAs offer uniquely the performance, power, and flexibility needed to support the diversity and dynamicity of cloud workloads. We begin by observing that, although FPGAs are widespread, they cannot be randomly deployed as part of cloud infrastructure. Any FPGA cloud architecture must satisfy a number of constraints placed by the cloud provider. As a result, FPGA use in the cloud is non-uniformly distributed and motivated by the specific advantages and limitations that each unique architecture offers. In this survey, we provide an exploration and analysis of the trends in existing cloud FPGA architectures that highlight this complex relationship between architectures and system requirements. This allows us to identify novel architectures that are likely to offer substantial benefits for cloud workloads. Tutorial Session: 4-T (12:15-15:45): Python GraphBLAS Tutorial Chair/Host: Tim Mattson & Scott Mcmillan
4-2: Advanced Processor Architectures 2 Session (12:30-13:45) Session Co-Chairs: Mark Barnell & Wei Zhang
System-Level Modeling of GPU/FPGA Clusters for Molecular Dynamics Simulations Chunshu Wu (Boston University)*; Sahan Bandara (Boston University); Tong Geng (Pacific Northwest National Laboratory); Vipin Sachdeva (Silicon Therapeutics); Woody Sherman (Silicon Therapeutics); Martin Herbordt (Boston University) FPGA-accelerated molecular dynamics (MD) research dates back to almost two decades ago and is still being actively studied. MD on FPGA clusters, however, is still in its initial phase with only small systems built and limited performance studies. Given the cost of building accelerator clusters, and (as we show) the number of plausible architectures, a thorough study is needed. In particular, we investigate both FPGA and GPU/FPGA hybrid clusters. The latter are possibly attractive given the broad availability of GPU clusters and use of GPUs for MD, but the current inability of GPUs to scale for certain critical domains. In this work, we model four promising MD accelerator platforms, including FPGA-only systems with homogeneous and heterogeneous nodes, an existing FPGA- GPU hybrid system (the Cygnus supercomputer), and a synthesis of the commercially available Nvidia DGX-1/DGX-2 products with an FPGA cluster. The models are compared and evaluated, and we find that each of the platforms is suitable for some circumstances. Rapid Configuration of Asynchronous Recurrent Neural Networks for ASIC Implementations Spencer Nelson (University of Arkansas); Wassim Khalil (University of Arkansas); Sangyun Kim (University of Arkansas); Jia Di (University of Arkansas)*; Zhe Zhou (Peking University, CECA); Zhihang Yuan (peking university); Guangyu Sun (Peking University) Shortened time-to-market has always been a critical goal for semiconductor IC design companies. With the surge of artificial intelligence, ASIC implementations of various machine learning algorithms are still desired in many applications (e.g., edge computing) over the FPGA counterparts. In order to speed up the development time, this paper analyzes the architecture of gated Recurrent Neural Networks (RNNs) and creates and utilizes generic or flexible asynchronous Multi- Threshold NULL Convention Logic (MTNCL) components to facilitate the rapid design and implementation of a variety of asynchronous ASIC RNN implementations. The design methodology has been demonstrated with the TSMC 65nm bulk CMOS process. The design is simulated at the transistor level for power and speed, and the development time is reported to demonstrate the design effort needed for similar implementations. Deluge: Achieving Superior Efficiency, Throughput, and Scalability with Actor Based Streaming on Migrating Threads Brian A Page (University of Notre Dame)*; Peter Kogge (University of Notre Dame) Applications where streams of data are passed through large data structures are becoming of increasing importance. For instance network intrusion detection and cyber security as a whole rely on real time analysis of network traffic. Unfortunately, when implemented on conventional architectures such applications become horribly inefficient, especially when attempts are made to scale up performance via some sort of parallelism. An earlier paper discussed streaming anomaly detection within a stream having an unbounded range of keys on the Lucata migrating thread architecture. In this paper we introduce Deluge, a new implementation that addresses several inadequacies of previous designs and seeks to more directly target the hardware efficiencies inherent to migratory execution within a PGAS address space. Deluge achieves major improvements in hardware efficiency, throughput, and scalability over previous implementations.
4-3: Case Studies & Benchmarking 2 Session (14:15-15:30) Session Co-Chairs: Chansup Byun & Xiaobai Sun
Invited Talk: The Importance of Computing Power and Algorithms Dr. Neil Thompson (MIT CSAIL) Benchmarking the Processing of Aircraft Tracks with Triples Mode and Self-Scheduling Andrew Weinert (MIT Lincoln Laboratory)*; Marc Brittain (MIT Lincoln Laboratory); Ngaire Underhill (MIT Lincoln Laboratory); Christine Serres (MIT Lincoln Laboratory) As unmanned aircraft systems (UASs) continue to integrate into the U.S. National Airspace System (NAS), there is a need to quantify the risk of airborne collisions between unmanned and manned aircraft to support regulation and standards development. Developing and certifying collision avoidance systems often rely on the extensive use of Monte Carlo collision risk analysis simulations using probabilistic models of aircraft flight. To train these models, high performance computing resources are required. We've prototyped a high performance computing workflow designed and deployed on the Lincoln Laboratory Supercomputing Center to process billions of observations of aircraft. However, the prototype has various computational and storage bottlenecks that limited rapid or more comprehensive analyses and models. In response, we’ve developed a novel workflow to take advantage of various job launch and task distribution technologies to improve performance. The workflow was benchmarked using two datasets of observations of aircraft, including a new dataset focused on the environment around aerodromes. Optimizing how the workflow was parallelized drastically reduced the execution time from weeks to days. Solving sparse linear systems with approximate inverse preconditioners on analog devices Vasileios Kalantzis (IBM Research); Anshul Gupta (IBM Research)*; Lior Horesh (IBM Research); Tomasz Nowicki (IBM Research AI); Mark S Squillante (IBM Research); Chai Wah Wu (IBM); Tayfun Gokmen (IBM Research AI); Haim Avron (Tel Aviv University) Speaker Slides Sparse linear system solvers are computationally expensive kernels that lie at the heart of numerous applications. This paper proposes a preconditioning framework that combines approximate inverses with stationary iterations to substantially reduce the time and energy requirements of this task by utilizing a hybrid architecture that combines conventional digital microprocessors with analog crossbar array accelerators. Our analysis and experiments with a simulator for analog hardware demonstrate that an order of magnitude speedup is readily attainable despite the noise in analog computations. Implications of Reduced Communication Precision in a Collocated Discontinuous Galerkin Finite Element Framework Marcin Rogowski (King Abdullah University of Science and Technology)*; Lisandro Dalcin (King Abdullah University of Science and Technology); Matteo Parsani (King Abdullah University of Science and Technology); David Keyes (KAUST) Compute capability of high-performance hardware has been growing at immense rates, increasing over 130x in the last decade. Communication bandwidth, however, only grew by a factor of 6x in the same time, leading to a significant decrease in the byte-to-flop metric. This trend leads us to the situation where, in many cases, computation is virtually free, and the dominant cost of a parallel application comes from its communication cost. We expect this trend to continue and, hence, the parallel application wall-clock time to be increasingly correlated with the amount of data transferred between the nodes involved. In order to alleviate this communication bottleneck, we test several communication-reducing schemes based on the idea of using higher precision for the inner cells and lower precision communication. For every approach, we report the resulting network traffic and weigh it against the decreased accuracy. We perform our experiments in a collocated Discontinuous Galerkin finite element method framework (DG-FEM) applied in Computational Fluid Dynamics (CFD). First, we present a parametric study using the method of manufactured solutions on a 3D compressible Navier--Stokes supersonic cube. Using this method allows us to quantify communication reducing schemes' impact on the error in test cases representing a range of solution polynomial degrees and problem sizes. Finally, we verify the findings on a full-scale CFD problem, flow around the delta wing, and report on methods' consistency as the number of processes and the number of halo elements change. Performance-Portable Sparse Tensor Decomposition Kernels on Emerging Parallel Architectures S. Isaac Geronimo Anderson (University of Oregon)*; Keita Teranishi (Sandia National Laboratories); Daniel Dunlavy (Sandia National Laboratories); Jee Choi (University of Oregon) We leverage the Kokkos library to study performance portability of parallel sparse tensor decompositions on CPU and GPU architectures. Our result shows that with a single implementation Kokkos can deliver performance comparable to hand-tuned code for simple array operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems, and superior performance for the MTTKRP kernel on CPUs.
4-4: Case Studies & Benchmarking 3 Session (15:45-17:00) Session Co-Chairs: Darrell Ricke & Xiaobai Sun
Invited Talk: Redefining Disease Definition with Machine Learning Dr. John Reynders A High-Performance Heterogeneous Critical Path Analysis Framework Yasin Zamani (University of Utah)*; Tsung-Wei Huang (University of Utah) Emphasis on static timing analysis (STA) has shifted from graph-based analysis (GBA) to path-based analysis (PBA) for reducing unwanted slack pessimism. However, it is extremely time-consuming for a PBA engine to analyze a large set of critical paths. Recent years have seen many parallel PBA applications, but most of them are limited to CPU parallelism and do not scale beyond a few threads. To overcome this challenge, we propose in this paper a high-performance graphics processing unit (GPU)-accelerated PBA framework that efficiently analyzes the timing of a generated critical path set. We represent the path set in three dimensions, timing test, critical path, and pin, to structure the granularity of parallelism scalable to arbitrary problem sizes. In addition, we leverage task-based parallelism to decompose the PBA workload into CPU-GPU dependent tasks where kernel computation and data processing overlap efficiently. Experimental results show that our framework applied to an important PBA application can speed up the state-of-the-art baseline up to 10× on a million- gate design. Boundary Integral Solver Approaches for Particle Accelerator Simulation Problems and Deployment on NERSC Hardware M. Harper Langston (Reservoir Labs, Inc)*; Julia Wei (Reservoir Labs, Inc.); Pierre-David Letourneau (Reservoir Labs, Inc.); Matthew J. Morse (Reservoir Labs); Larry Weintraub (Reservoir Labs, Inc.); Aimee Nogoy (Reservoir labs, Inc.); Noah Amsel (Reservoir Labs, inc.); Richard Lethin (Reservoir Labs) The MACH-B (Multipole Accelerator Codes for Hadron Beams) project is developing a Fast Multipole Method (FMM)-based tool for higher fidelity modeling of particle accelerators for high- energy physics within the next generation of Fermilab's Synergia simulation package. MACH-B incorporates (1) highly- scalable, high-performance and generally-applicable FMM- based algorithms to accurately model space-charge effects in high-intensity hadron beams and (2) boundary integral approaches to handle singular effects near the beam pipe using advanced quadratures. MACH-B will allow for more complex beam dynamics simulations that more accurately capture bunch effects and predict beam loss. Further, by introducing an abstraction layer to hide FMM implementation and parallelization complexities, MACH-B removes one of the key impediments to the adoption of FMMs by the accelerator physics community. In this work, we focus on the following results for the boundary integral solver components of the MACH-B project: (1) Study of the relative accuracies of the hedgehog boundary integral solver when evaluating potential and gradient solutions to Laplace's equation with Dirichlet boundary conditions; (2) Study of a single-bunch, Gaussian- distributed set of charges within a conducting pipe, using an embedded boundary solver. Results show the ability to simulate charge densities inside of a pipe-shaped object, running simulations on a collection of NERSC's Cori Cray XC40 Intel Xeon ``Haswell'' processor nodes  and Reservoir's internal computational resources. This work was supported by the U.S. Department of Energy as part of the SBIR Phase I Project DE-SC0020934.  This research further used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231. Pragmatic Benchmarking for Research Computing Dennis V Milechin (Boston University)*; Ahmed Aly (Boston University); Josh Bevan (Boston University); Charlie Jahnke (Boston University); Yun Shen (Boston University); Brian Gregor (Boston University) The Research Computing Services (RCS) group at Boston University (BU), developed a benchmark suite to evaluate the performance of newer hardware under consideration for purchase for the BU Shared Computing Cluster (SCC). The custom suite of benchmarks is used to generate performance metrics that are representative of the highly diverse types of jobs run on the cluster. The results of the benchmarks are used to make informed decisions about hardware upgrades in order to provide the best balance of performance and value for cluster users. In this paper we discuss the present reasons for creating a custom benchmark suite, the general architecture of the suite, and provide sample results from selected benchmarks that we ran on our cluster. Performance Study of GPU applications using SYCL and CUDA on Tesla V100 GPU Goutham Kalikrishna Reddy Kuncham (Tata Consultancy Services)*; Rahul Vaidya (TCS); Mahesh Barve (TCS) SYCL standard enables single-source programs to run on heterogeneous platforms consisting of CPUs, GPUs, FPGAs across different hardware vendors. SYCL combines modern C++ features along with OpenCL’s portability. SYCL runtime is also capable of targeting the CUDA backend directly on NVIDIA GPUs. This approach can potentially improve the performance of SYCL on NVIDIA devices. Although NVIDIA GPUs can be targeted via OpenCL backend, their features and capabilities are limited, and the performance is inadequate. In this study, we compare the performance of the Nvidia V100 GPU using SYCL and CUDA. For performance evaluation, we selected three GPU applications: BabelStream, Mixbench, and Tiled Matrix-Multiplication. We conducted extensive tests to understand the performance in terms of DRAM bandwidth, kernel execution time, compilation time, and throughput. As per our study, the performance of SYCL and CUDA were found to be similar. However, in some cases, CUDA outperformed SYCL.
4-S1: Graph Challenge Special (17:30-19:30) Organizer(s): Jeremy Kepner
4-1: Advanced Processor Architectures 1 Session (11:00-12:15) Session Co-Chairs: Mark Barnell & Manoj Kumar
Invited Talk: Cloud-Scaling and HPC-Enabled Next-Gen ASIC Verification Serge Leef (DARPA)