IEEE High Performane Extreme Computing

IEEE Nondiscrimination Policy

2021 IEEE High Performance Extreme Computing Virtual Conference 20 - 24 September 2021

Wednesday, September 22

Welcome

Organizers

Advisory Board

Technical Committee

3-1: AI / Machine Learning 1 Session (11:00-12:15) Session Co-Chairs: Siddarth Samsi & Sanmukh Rao Kuppannagari

Efficient Neighbor-Sampling-based GNN Training on CPU-FPGA Heterogeneous Platform Bingyi Zhang (University of Southern California)*; Sanmukh Rao Kuppannagari (University of Southern California); Rajgopal Kannan (Army Research Lab-West); Viktor K Prasanna (Unversity of Southern California) Graph neural networks (GNNs) become increasingly important in many real-world applications. However, training GNN on the large-scale real-world graphs is still challenging. Many sampling-based GNN training algorithms have been proposed to facilitate mini-batch style training process. The well-known Neighbor-Sampling-base (NS) GNN training algorithms, such as GraphSAGE, have shown their great advantages in terms of accuracy, generalization, and scalability on large-scale graphs. Nevertheless, efficient hardware acceleration for such algorithms have not been systematically studied. In this paper, we perform experimental study to understand the computation characteristics of NS GNN training. The evaluation results show that \emph{ neighbor sampling} and \emph{feature aggregation} take the majority if the execution time due to the irregular memory accesses and extensive memory traffic. Then, we propose a system design for NS GNN training by exploiting the CPU-FPGA heterogeneous platform. We develop an optimized parallel neighbor sampling implementation and develop an efficient FPGA accelerator to enable high-throughput GNN training. We propose the neighbor sharing and task pipelining techniques to improve the training throughput. We implement a prototype system on an FPGA-equiped server. The evaluation results demonstrate our CPU-FPGA design achieves $12-21\times$ speedup than CPU-only platform and $0.4-3.2\times$ speedup than CPU-GPU platform. Moreover, our FPGA accelerator achieves $2.3\times$ more energy efficiency than the GPU board. Serving Machine Learning Inference Using Heterogeneous Hardware Baolin Li (Northeastern University)*; Vijay Gadepally (MIT Lincoln Laboratory); Siddharth Samsi (MIT Lincoln Laboratory); Mark Veillette (MIT Lincoln Laboratory); Devesh Tiwari (Northeastern University) The growing popularity of machine learning algorithms and the wide availability of hardware accelerators have brought up new challenges on inference serving. This paper explores the opportunity to serve inference queries with a heterogeneous system. The system has a central optimizer that allocates heterogeneous hardware resources to cooperatively serve queries. The optimizer supports both energy minimization and throughput maximization while satisfying a latency target. The optimized heterogeneous serving system is evaluated against a homogeneous system, on two representative real-world applications of radar nowcasting and object detection. Our evaluation results show that the power-optimized heterogeneous system can achieve up to 36% of power saving, and the throughput-optimized heterogeneous system can increase query throughput by up to 53%. Improved Compression for Word Embeddings by Scaling Principal Components Joseph P McDonald (MIT Lincoln Laboratory)*; Siddharth Samsi (MIT Lincoln Laboratory); Daniel Edelman (Massachusetts Institute of Technology); Jeremy Kepner (MIT Lincoln Laboratory); Chansup Byun (MIT Lincoln Laboratory); Vijay Gadepally (MIT Lincoln Laboratory) Word embeddings have been adopted as a fundamental component of many natural language processing applications for their ability to capture meaningful semantic relationships. However they often present a significant computational bottleneck due to memory requirements. In this article we present a postprocessing technique for embeddings, based on modifying their principal components, that enables compression while maintaining comparable if not better performance relative to the original embedding. Specifically, our technique can reduce the overall memory footprint of popular embeddings such as GloVe and word2vec by 50% while maintaining the same performance on different metrics, including commonly used similarity and analogy tasks as well as on end-to-end tasks such as text classification. Compared to the original embeddings and previous postprocessing methods, this approach improves accuracy on these tasks and leads to better semantic vector representations particularly when using compressed versions of these vectors for memory and performance savings. While compressing the original vectors is possible, this approach outperforms these as well as other postprocessing methods across a range of compressed sizes. Instance Segmentation of Neuronal Nuclei Leveraging Domain Adaptation Kevin Brady (MIT Lincoln Laboratory)*; Pooya Khorrami (MIT Lincoln Laboratory); Lars Gjesteby (MIT Lincoln Laboratory); Laura Brattain (MIT Lincoln Laboratory) The detection and localization of individual cell nuclei in dense neural scenes collected by microscopy traditionally depends on human-expert-intensive manual markup for training and evaluating automatic algorithms. These approaches are expensive, time-intensive, and require domain expertise. To develop automatic approaches, the annotated content needs to match the collection conditions (e.g. stain, cell-type) and small changes to these conditions often requires additional matching annotated content. Our approach leverages supervised domain adaptation with an application to the instance segmentation of nuclei in the brain. The efficacy of this approach is demonstrated experimentally by characterizing the performance of adapting models learned on content not well matched to the target domain. Quantitative results demonstrate performance improvements relative to previous related work. Even Faster SNN Simulation with Lazy+Event-driven Plasticity and Shared Atomics Dennis Bautembach (FORTH)*; Iason Oikonomidis (FORTH); Antonis A Argyros (CSD-UOC and ICS-FORTH) We present two novel optimizations that accelerate clock-based spiking neural network (SNN) simulators. The first one targets spike timing dependent plasticity (STDP). It combines lazy- with event-driven plasticity and efficiently facilitates the computation of pre- and post-synaptic spikes using bitfields and integer intrinsics. It offers higher bandwidth than event-driven plasticity alone and achieves a 1.5x-2x speedup over our closest competitor. The second optimization targets spike delivery. We partition our graph representation in a way that bounds the number of neurons that need be updated at any given time which allows us to perform said update in shared memory instead of global memory. This is 2x-2.5x faster than our closest competitor. Both optimizations represent the final evolutionary stages of years of iteration on STDP and spike delivery inside "Spice" (/spaIk/), our state of the art SNN simulator. The proposed optimizations are not exclusive to our graph representation or pipeline but are applicable to a multitude of simulator designs. We evaluate our performance on three well-established models and compare ourselves against three other state of the art simulators.

3-2: AI / Machine Learning 2 Session (12:30-13:45) Session Co-Chairs: Siddarth Samsi & Julie Mullen

Model Quantization and Synthetic Aperture Data Analyses Increasing Throughput and Energy Efficiency Mark Barnell (Air Force Research Laboratory)*; Darrek Isereau (-); Courtney Raymond (-); Anthony Salmin (SRC); Daniel Brown (SRC, Inc.) New model quantization techniques have been researched to inform future information/data processing approaches. This hardware evaluation and the associated data exploitation methods developed support systems that require high performance computing (HPC) and machine learning (ML) models where significant throughput is desired. Specifically, these applications include, but are not limited to, low cost compute that supports the detection and classification of objects in a scene, where it is not feasible to spend valuable resources on HPC capabilities. Additional applications include data exploitation upstream, near sensors, where size, weight and power (SWAP) are constrained. This research included the analyses of representative data, synthetic aperture radar (SAR) imagery called the Moving and Stationary Target Acquisition and Recognition (MSTAR) data. The NVIDIA Tesla, Xavier, and Titan compute architectures were used and analyzed as part of this research. These graphics processing units (GPU) represent architectures that span various operating powers (~10 to over a few 100 Watts). Additionally, the energy utilization per frame was determined and analyzed, e.g., the energy use of the Tesla went from 104.4 to 50.68 micro-Joules/frame when quantization was reduced to an 8-bit integer. The Tesla architecture also improved processing throughput from 448 frames per second (FPS) to 1085 FPS when quantized to 8-bit integers. An important part of this new research showed the compute systems retained SAR detection/classification performance of over 97% mean average precision (mAP) on the MSTAR imagery data after quantization – thereby retaining its capability to detect and classify objects. Non-Volatile Memory Accelerated Posterior Estimation Andrew E Wood (Boston University)*; Moshik Hershcovitch (IBM Research); Daniel G Waddington (IBM Research); Sarel Cohen (Hasso Plattner Institute); Sang Chin (Boston University) Bayesian inference allows machine learning models to express uncertainty. Current machine learning models use only a single learnable parameter combination when making predictions, and as a result are highly overconfident when their predictions are wrong. To use more learnable parameter combinations efficiently, these samples must be drawn from the posterior distribution. Unfortunately computing the posterior directly is infeasible, so often researchers approximate it with a well known distribution such as a Gaussian. In this extended abstract, we show that through the use of high capacity persistent storage, models whose posterior distribution was too big to approximate are now feasible, leading to improved predictions in downstream tasks. A Survey: Handling Irregularities in Neural Network Acceleration with FPGAs Tong Geng (Pacific Northwest National Laboratory)*; Chunshu Wu (Boston University); Cheng Tan (Pacific Northwest National Laboratory); Chenhao Xie (Pacific Northwest National Laboratory); Anqi Guo (Boston University); Pouya Haghi (Boston University); Sarah Yuan He (Boston College); Jiajia Li (Pacific Northwest National Laboratory ); Martin Herbordt (Boston University); Ang Li (Pacific Northwest National Laboratory) In the last decade, Artificial Intelligence (AI) through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Many types of DNNs have been and continue to be developed, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs). The overall problem for all of these Neural Networks (NNs) is that their target applications generally pose stringent constraints on latency and throughput, while also having strict accuracy requirements. There have been many previous efforts in creating hardware to accelerate NNs. The problem designers face is that optimal NN models typically have significant irregularities, making them hardware-unfriendly. In this paper, we first define the problems in NN acceleration by characterizing common irregularities in NN processing into 4 types; then we summarize the existing works that handle the four types of irregularities efficiently using hardware, especially FPGAs; finally, we provide a new vision of next-generation FPGA-based NN acceleration: that the emerging heterogeneity in the next-generation FPGAs is the key to achieving higher performance. Classification frameworks comparison on 3D point clouds Francis P Medina (Yeshiva University)*; Randy Paffenroth (Worcester Polytechnic Institute) We present a preliminary comparison study for various classification frameworks that includes several types of feature engineering. The presentation is this paper is based on our work in [10]. We demonstrate that providing context by augmenting each point in the data about its neighboring points can improve the performance of downstream learning algorithms. We also experiment with several dimension reduction strategies, ranging from Principal Component Analysis (PCA) to neural network based auto-encoders, and demonstrate how they affect classification performance in LiDAR point clouds. For example, we observe that by combining feature engineering with a dimension reduction method such as PCA, there is an improvement in the accuracy of the classification with respect to doing a straightforward classification with the raw data. We test such classification strategies in 3D Lidar point clouds. Index Terms—Classification, 3D point clouds, dimension reduction, PCA, auto-encoders, neural networks, machine learning.

3-3: AI / Machine Learning 3 Session (14:15-15:30) Session Co-Chairs: Julie Mullen & Sanmukh Rao Kuppannagari

Toward HDL Extensions for Rapid AI/ML Accelerator Generation Ryan Kabrick (Tactical Computing Labs)*; David Donofrio (Tactical Computing Labs); John Leidel (Tactical Computing Labs) StoneCutter, a language construct and compiler embedded in the OpenSoC System Architect family of tools, is designed to provide software architects the ability to rapidly prototype instruction set extensions and hardware accelerators. The StoneCutter compilation flow ingests high level syntax and outputs optimized and pipelined Chisel HDL for further compilation to platform-specific RTL. However, unlike other HDL approaches, StoneCutter is rooted in the notion that users define syntactic blocks that map directly to individual instruction definitions as opposed to classic finite state machines. When integrated with the adjacent System Architect design flow, StoneCutter provides a familiar, C-like language construct by which to develop the implementation for individual, programmable instructions. The LLVM-based StoneCutter compiler performs individual instruction and whole-ISA optimizations in order to generate a high performance, Chisel HDL representation of the target design. Utilizing the existing Chisel tools, users can also generate C++ cycle accurate simulation models as well as Verilog representations of the target design. As a result, the StoneCutter language and associated tooling provides a very rapid, instruction set-centric design environment for rapid development and experimentation. This work describes initial efforts to extend the StoneCutter infrastructure in order to encapsulate linear algebraic constructs for direct compilation into optimized AI/ML instructions. This functionality provides users and architects the ability to utilize the StoneCutter high-level language constructs to develop target and domain specific AI/ML instructions using optimized linear algebraic constructs compiled directly to target-specific RTL. This enables users to create highly optimized AI/ML hardware implementations with minimal effort in traditional hardware develop flows. Filtered Tensor Construction and Decomposition for Drug Repositioning Dimitri Leggas (Reservoir Labs, Inc.)*; Muthu M Baskaran (Reservoir Labs); James Ezick (Reservoir Labs); Brendan von Hofe (Reservoir Labs, Inc.) Drug repositioning (also called "drug repurposing") is a drug development strategy that saves time and money by finding new uses for existing drugs. While a variety of computational approaches to drug repositioning exist, recent work has shown that tensor decomposition, an unsupervised learning technique for finding latent structure in multidimensional data, is a useful tool for drug repositioning. The known relationships between drugs, targets, and diseases can easily be encoded as a tensor, and by learning a low-rank representation of this tensor, decompositions can complete missing entries and therefore predict novel drug-disease associations. Multiple recent works, in the context of cancer and COVID-19 drug discovery, have used joint tensor decompositions to suggest drug repositioning candidates. While these methods make high-quality predictions, they rely on specialized decompositions formulated for specific problems. In this work, we use ENSIGN, a suite of tensor decomposition tools, to show that CP tensor decompositions of a single tensor encoding drug- target-disease associations are capable of predicting verifiable drug repositioning candidates. Because the tensors generated by drug repositioning problems are sparse, we introduce a filtered tensor construction to limit the span of the tensor without losing information needed to learn the relevant associations. We show that our method predicts verifiable novel drug-disease associations in cancer and COVID-19 data. The simplicity of our approach makes it an attractive tool for biomedical researchers looking for out-of-the-box solutions, and ENSIGN brings an added level of usability and scalability. Machine Learning Fairness is Computationally Difficult and Algorithmically Unsatisfactorily Solved Mike H.M. Teodorescu (Boston College)*; Xinyu Yao (Rice University) The main purpose of the paper is to analyze the computational difficulties of selecting the suitable classification algorithms that satisfy specific ethical criteria, when real data is used in training. Employing an imbalanced credit decision dataset largely used for credit scoring and applying a set of algorithms and several fairness criteria, we show that many typical classification algorithms do not satisfy in a reasonable manner more than one fairness criterion when considering more than one protected attribute. This adds a layer of difficulty to the ones represented by the need of large databases and data- and computationally-intensive decision-making systems as used in domains such as credit scoring and hiring. A novel analysis of this study is directly relating ML/AI fairness criteria and computational complexity. We reframe the problem of complexity by connecting it to the search of an ethically acceptable solution instead of just an accurate solution. The results suggest the continued need for human input in fairness decisions, especially when deciding tradeoffs between fairness criteria. Using Computation Effectively for Scalable Poisson Tensor Factorization: Comparing Methods Beyond Computational Efficiency Jeremy M Myers (College of William & Mary, Sandia National Laboratories)*; Daniel Dunlavy (Sandia National Laboratories) Poisson Tensor Factorization (PTF) is an important data analysis method for analyzing patterns and relationships in multiway count data. In this work, we consider several algorithms for computing a low-rank PTF of tensors with sparse count data values via maximum likelihood estimation. Such an approach reduces to solving a nonlinear, non-convex optimization problem, which can leverage considerable parallel computation due to the structure of the problem. However, since the maximum likelihood estimator corresponds to the global minimizer of this optimization problem, it is important to consider how effective methods are at both leveraging this inherent parallelism as well as computing a good approximation to the global minimizer. In this work we present comparisons of multiple methods for PTF that illustrate the tradeoffs in computational efficiency and accurately computing the maximum likelihood estimator. We present results using synthetic and real-world data tensors to demonstrate some of the challenges when choosing a method for a given tensor.

3-4: Case Studies & Benchmarking 1 Session (15:45-17:00) Session Co-Chairs: Chansup Byun & Kurt Keville

StressBench: A Configurable Full System Network and I/O Benchmark Framework Dean G Chester (University of Warwick)*; Taylor Groves (NERSC); Simon Hammond (Sandia National Laboratories); Tim Law (AWE); Steven Wright (University of York); Richard Smedley-Stevenson (AWE); Suhaib A. Fahmy (University of Warwick); Gihan Mudalige (University of Warwick); Stephen Jarvis (University of Birmingham) We present StressBench, a network benchmarking framework written for testing MPI operations and file I/O concurrently. It is designed specifically to execute MPI communication and file access patterns that are representative of real-world scientific applications. Existing tools consider either the worst case congestion with small abstract patterns or peak performance with simplistic patterns. StressBench allows for a richer study of congestion by allowing orchestration of network load scenarios that are representative of those typically seen at HPC centres, something that is difficult to achieve with existing tools. We demonstrate the versatility of the framework from microbenchmarks through to finely controlled congested runs across a cluster. Validation of the results using four proxy application communication schemes within StressBench against parent applications shows a maximum difference of 15%. Using the I/O modeling capabilities of StressBench, we are able to quantify the impact of file I/O on application traffic showing how it can be used in procurement and performance studies. Performance Evaluation of Mixed-Precision Runge-Kutta Methods Ben Burnett (University of Massachusetts Dartmouth)*; Sigal Gottlieb (UMass Dartmouth); Alfa Heryudono (UMass Dartmouth); Zachary Grant (Oak Ridge National Labs) Additive Runge-Kutta methods designed for preserving highly accurate solutions in mixed-precision computation were proposed and analyzed in [8]. These specially designed methods use reduced precision for the implicit computationsand full precision for the explicit computations. We develop a FORTRAN code to solve a nonlinear system of ordinary differential equations using the mixed precision additive Runge-Kutta (MP-ARK) methods on IBM POWER9 and Intel x86\_64 chips. The convergence, accuracy, runtime, and energy consumption of these methods is explored. We show that these MP-ARK methods efficiently produce accurate solutions with significant reductions in runtime (and by extension energy consumption). A More Portable HeFFTe: Implementing a Fallback Algorithm for Scalable Fourier Transforms Daniel Sharp (Massachusetts Institute of Technology )*; Stanimire Tomov (University of Tennessee); Miroslav K Stoyanov (Oak Ridge National Laboratory); Jack Dongarra (University of Tennessee) The Highly Efficient Fast Fourier Transform for Exascale (heFFTe) numerical library is a C++ implementation of distributed multidimensional FFTs targeting heterogeneous and scalable systems. To date, the library has relied on users to provide at least one installation from a selection of well-known libraries for the single node/MPI-rank one-dimensional FFT calculations that heFFTe is built on. In this paper, we describe the development of a CPU-based backend to heFFTe as a reference, or "stock'', implementation. This allows the user to install and run heFFTe without any external dependencies that may include restrictive licensing or mandate specific hardware. Furthermore, this stock backend was implemented to take advantage of SIMD capabilities on the modern CPU, and includes both a custom vectorized complex data-type and a run-time generated call-graph for selecting which specific FFT algorithm to call. The performance of this backend greatly increases when vectorized instructions are available and, when vectorized, it provides reasonable scalability in both performance and accuracy compared to an alternative CPU-based FFT backend. In particular, we illustrate a highly-performant O( N log N ) code that is about 10 times faster compared to non-vectorized code for the complex arithmetic, and a scalability that matches heFFTe's scalability when used with vendor or other highly-optimized 1D FFT backends. The same technology can be used to derive other Fourier-related transformations that may be even not available in vendor libraries, e.g., the discrete sine (DST) or cosine (DCT) transforms, as well as their extension to multiple dimensions and O( N log N ) timing. A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions Yingbo Ma (Julia Computing); Vaibhav Dixit (Julia Computing); Mike J Innes (Julia Computing); Xingjian Guo (New York University); Christopher V Rackauckas (Massachusetts Institute of Technology)* Derivatives of differential equation solutions are commonly for parameter estimation, fitting neural differential equations, and as model diagnostics. However, with a litany of choices and a Cartesian product of potential methods, it can be difficult for practitioners to understand which method is likely to be the most effective on their particular application. In this manuscript we investigate the performance characteristics of Discrete Local Sensitivity Analysis implemented via Automatic Differentiation (DSAAD) against continuous adjoint sensitivity analysis. Non-stiff and stiff biological and pharmacometric models, including a PDE discretization, are used to quantify the performance of sensitivity analysis methods. Our benchmarks show that on small stiff and non-stiff systems of ODEs (approximately $<100$ parameters+ODEs), forward-mode DSAAD is more efficient than both reverse-mode and continuous forward/adjoint sensitivity analysis. The scalability of continuous adjoint methods is shown to be more efficient than discrete adjoints and forward methods after crossing this size range. These comparative studies demonstrate a trade-off between memory usage and performance in the continuous adjoint methods that should be considered when choosing the technique, while numerically unstable backsolve techniques from the machine learning literature are demonstrated as unsuitable for most scientific models. The performance of adjoint methods is shown to be heavily tied to the reverse-mode AD method used for the vector-Jacobian product calculations, with tape-based AD methods shown to be 2 orders of magnitude slower on nonlinear partial differential equations than static AD techniques. In addition, these results demonstrate the out-of-the-box applicability of DSAAD to differential-algebraic equations, delay differential equations, and hybrid differential equation systems where the event timing and effects are dependent on model parameters, showcasing an ease of implementation advantage for DSAAD approaches. Together, these benchmarks provide a guide to help practitioners to quickly identify the best mixture of continuous sensitivities and automatic differentiation for their applications.

3-S1: AI Challenges Special (17:30-19:30) Organizer(s): Vijay Gadepally

The MIT Supercloud Dataset Siddharth Samsi (MIT Lincoln Laboratory)*; Matthew Weiss (MIT Lincoln Laboratory); Vijay Gadepally (MIT Lincoln Laboratory) Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frame- works, and capabilities such as Jupyter notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data. This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data. Datasets and future challenge announcements will be available via https://dcc.mit.edu. Maneuver Identification Challenge Kaira M Samuel (MIT)*; Jeremy Kepner (MIT Lincoln Laboratory) AI algorithms that identify maneuvers from trajectory data could play an important role in improving flight safety and pilot training. AI challenges allow diverse teams to work together to solve hard problems and are an effective tool for developing AI solutions. AI challenges are also a key driver of AI computational requirements. The Maneuver Identification Challenge hosted at maneuver-id.mit.edu provides thousands of trajectories collected from pilots practicing in flight simulators, descriptions of maneuvers, and examples of these maneuvers performed by experienced pilots. Each trajectory consists of positions, velocities, and aircraft orientations normalized to a common coordinate system. Construction of the data set required significant data architecture to transform flight simulator logs into AI ready data, which included using a supercomputer for deduplication and data conditioning. There are three proposed challenges. The first challenge is separating physically plausible (good) trajectories from unfeasible (bad) trajectories. Human labeled good and bad trajectories are provided to aid in this task. Subsequent challenges are to label trajectories with their intended maneuvers and to assess the quality of those maneuvers.

Invited Talk: Robust Neural Differential Models for Navigation and Beyond Maj Andrew Bowne (USAF Al Accelerator)

Invited Talk: Objective Performance Prediction & Optimization Using Physiological and Cognitive Metrics Capt Kyle “Gouge” McAlpin (USAF Al Accelerator)

Invited Talk: Large-Scale simulation of mechanical instabilities in soft materials Prof. Raul Radovitzky (MIT AeroAstro)

3-S2: OpenSuperComputing BoF Special (17:30-19:30) Organizer(s): Kurt Keville Invited Talk: Introduction to the Open-Source FPGA Foundation Pierre-Emmanuel Gaillardon (University of Utah) Invited Talk: An HPEC Retrospective Guarav Mitra (Texas Instruments)

2021 Abstract Book

IEEE Nondiscrimination Policy

Wednesday, September 22

9/20

9/21

9/22

9/23

9/24

3-1: AI / Machine Learning 1 Session (11:00-12:15) Session Co-Chairs: Siddarth Samsi & Sanmukh Rao Kuppannagari

Efficient Neighbor-Sampling-based GNN Training on CPU- FPGA Heterogeneous Platform Bingyi Zhang (University of Southern California)*; Sanmukh Rao Kuppannagari (University of Southern California); Rajgopal Kannan (Army Research Lab-West); Viktor K Prasanna (Unversity of Southern California) Graph neural networks (GNNs) become increasingly important in many real-world applications. However, training GNN on the large- scale real-world graphs is still challenging. Many sampling-based GNN training algorithms have been proposed to facilitate mini- batch style training process. The well-known Neighbor-Sampling- base (NS) GNN training algorithms, such as GraphSAGE, have shown their great advantages in terms of accuracy, generalization, and scalability on large-scale graphs. Nevertheless, efficient hardware acceleration for such algorithms have not been systematically studied. In this paper, we perform experimental study to understand the computation characteristics of NS GNN training. The evaluation results show that \emph{ neighbor sampling} and \emph{feature aggregation} take the majority if the execution time due to the irregular memory accesses and extensive memory traffic. Then, we propose a system design for NS GNN training by exploiting the CPU-FPGA heterogeneous platform. We develop an optimized parallel neighbor sampling implementation and develop an efficient FPGA accelerator to enable high-throughput GNN training. We propose the neighbor sharing and task pipelining techniques to improve the training throughput. We implement a prototype system on an FPGA- equiped server. The evaluation results demonstrate our CPU- FPGA design achieves $12-21\times$ speedup than CPU-only platform and $0.4-3.2\times$ speedup than CPU-GPU platform. Moreover, our FPGA accelerator achieves $2.3\times$ more energy efficiency than the GPU board. Serving Machine Learning Inference Using Heterogeneous Hardware Baolin Li (Northeastern University)*; Vijay Gadepally (MIT Lincoln Laboratory); Siddharth Samsi (MIT Lincoln Laboratory); Mark Veillette (MIT Lincoln Laboratory); Devesh Tiwari (Northeastern University) The growing popularity of machine learning algorithms and the wide availability of hardware accelerators have brought up new challenges on inference serving. This paper explores the opportunity to serve inference queries with a heterogeneous system. The system has a central optimizer that allocates heterogeneous hardware resources to cooperatively serve queries. The optimizer supports both energy minimization and throughput maximization while satisfying a latency target. The optimized heterogeneous serving system is evaluated against a homogeneous system, on two representative real-world applications of radar nowcasting and object detection. Our evaluation results show that the power-optimized heterogeneous system can achieve up to 36% of power saving, and the throughput-optimized heterogeneous system can increase query throughput by up to 53%. Improved Compression for Word Embeddings by Scaling Principal Components Joseph P McDonald (MIT Lincoln Laboratory)*; Siddharth Samsi (MIT Lincoln Laboratory); Daniel Edelman (Massachusetts Institute of Technology); Jeremy Kepner (MIT Lincoln Laboratory); Chansup Byun (MIT Lincoln Laboratory); Vijay Gadepally (MIT Lincoln Laboratory) Word embeddings have been adopted as a fundamental component of many natural language processing applications for their ability to capture meaningful semantic relationships. However they often present a significant computational bottleneck due to memory requirements. In this article we present a postprocessing technique for embeddings, based on modifying their principal components, that enables compression while maintaining comparable if not better performance relative to the original embedding. Specifically, our technique can reduce the overall memory footprint of popular embeddings such as GloVe and word2vec by 50% while maintaining the same performance on different metrics, including commonly used similarity and analogy tasks as well as on end-to-end tasks such as text classification. Compared to the original embeddings and previous postprocessing methods, this approach improves accuracy on these tasks and leads to better semantic vector representations particularly when using compressed versions of these vectors for memory and performance savings. While compressing the original vectors is possible, this approach outperforms these as well as other postprocessing methods across a range of compressed sizes. Instance Segmentation of Neuronal Nuclei Leveraging Domain Adaptation Kevin Brady (MIT Lincoln Laboratory)*; Pooya Khorrami (MIT Lincoln Laboratory); Lars Gjesteby (MIT Lincoln Laboratory); Laura Brattain (MIT Lincoln Laboratory) The detection and localization of individual cell nuclei in dense neural scenes collected by microscopy traditionally depends on human-expert-intensive manual markup for training and evaluating automatic algorithms. These approaches are expensive, time- intensive, and require domain expertise. To develop automatic approaches, the annotated content needs to match the collection conditions (e.g. stain, cell-type) and small changes to these conditions often requires additional matching annotated content. Our approach leverages supervised domain adaptation with an application to the instance segmentation of nuclei in the brain. The efficacy of this approach is demonstrated experimentally by characterizing the performance of adapting models learned on content not well matched to the target domain. Quantitative results demonstrate performance improvements relative to previous related work. Even Faster SNN Simulation with Lazy+Event-driven Plasticity and Shared Atomics Dennis Bautembach (FORTH)*; Iason Oikonomidis (FORTH); Antonis A Argyros (CSD-UOC and ICS-FORTH) We present two novel optimizations that accelerate clock-based spiking neural network (SNN) simulators. The first one targets spike timing dependent plasticity (STDP). It combines lazy- with event-driven plasticity and efficiently facilitates the computation of pre- and post-synaptic spikes using bitfields and integer intrinsics. It offers higher bandwidth than event-driven plasticity alone and achieves a 1.5x-2x speedup over our closest competitor. The second optimization targets spike delivery. We partition our graph representation in a way that bounds the number of neurons that need be updated at any given time which allows us to perform said update in shared memory instead of global memory. This is 2x- 2.5x faster than our closest competitor. Both optimizations represent the final evolutionary stages of years of iteration on STDP and spike delivery inside "Spice" (/spaIk/), our state of the art SNN simulator. The proposed optimizations are not exclusive to our graph representation or pipeline but are applicable to a multitude of simulator designs. We evaluate our performance on three well-established models and compare ourselves against three other state of the art simulators.

3-2: AI / Machine Learning 2 Session (12:30-13:45) Session Co-Chairs: Siddarth Samsi & Julie Mullen

Model Quantization and Synthetic Aperture Data Analyses Increasing Throughput and Energy Efficiency Mark Barnell (Air Force Research Laboratory)*; Darrek Isereau (-); Courtney Raymond (-); Anthony Salmin (SRC); Daniel Brown (SRC, Inc.) New model quantization techniques have been researched to inform future information/data processing approaches. This hardware evaluation and the associated data exploitation methods developed support systems that require high performance computing (HPC) and machine learning (ML) models where significant throughput is desired. Specifically, these applications include, but are not limited to, low cost compute that supports the detection and classification of objects in a scene, where it is not feasible to spend valuable resources on HPC capabilities. Additional applications include data exploitation upstream, near sensors, where size, weight and power (SWAP) are constrained. This research included the analyses of representative data, synthetic aperture radar (SAR) imagery called the Moving and Stationary Target Acquisition and Recognition (MSTAR) data. The NVIDIA Tesla, Xavier, and Titan compute architectures were used and analyzed as part of this research. These graphics processing units (GPU) represent architectures that span various operating powers (~10 to over a few 100 Watts). Additionally, the energy utilization per frame was determined and analyzed, e.g., the energy use of the Tesla went from 104.4 to 50.68 micro- Joules/frame when quantization was reduced to an 8-bit integer. The Tesla architecture also improved processing throughput from 448 frames per second (FPS) to 1085 FPS when quantized to 8-bit integers. An important part of this new research showed the compute systems retained SAR detection/classification performance of over 97% mean average precision (mAP) on the MSTAR imagery data after quantization – thereby retaining its capability to detect and classify objects. Non-Volatile Memory Accelerated Posterior Estimation Andrew E Wood (Boston University)*; Moshik Hershcovitch (IBM Research); Daniel G Waddington (IBM Research); Sarel Cohen (Hasso Plattner Institute); Sang Chin (Boston University) Bayesian inference allows machine learning models to express uncertainty. Current machine learning models use only a single learnable parameter combination when making predictions, and as a result are highly overconfident when their predictions are wrong. To use more learnable parameter combinations efficiently, these samples must be drawn from the posterior distribution. Unfortunately computing the posterior directly is infeasible, so often researchers approximate it with a well known distribution such as a Gaussian. In this extended abstract, we show that through the use of high capacity persistent storage, models whose posterior distribution was too big to approximate are now feasible, leading to improved predictions in downstream tasks. A Survey: Handling Irregularities in Neural Network Acceleration with FPGAs Tong Geng (Pacific Northwest National Laboratory)*; Chunshu Wu (Boston University); Cheng Tan (Pacific Northwest National Laboratory); Chenhao Xie (Pacific Northwest National Laboratory); Anqi Guo (Boston University); Pouya Haghi (Boston University); Sarah Yuan He (Boston College); Jiajia Li (Pacific Northwest National Laboratory ); Martin Herbordt (Boston University); Ang Li (Pacific Northwest National Laboratory) In the last decade, Artificial Intelligence (AI) through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Many types of DNNs have been and continue to be developed, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs). The overall problem for all of these Neural Networks (NNs) is that their target applications generally pose stringent constraints on latency and throughput, while also having strict accuracy requirements. There have been many previous efforts in creating hardware to accelerate NNs. The problem designers face is that optimal NN models typically have significant irregularities, making them hardware-unfriendly. In this paper, we first define the problems in NN acceleration by characterizing common irregularities in NN processing into 4 types; then we summarize the existing works that handle the four types of irregularities efficiently using hardware, especially FPGAs; finally, we provide a new vision of next-generation FPGA-based NN acceleration: that the emerging heterogeneity in the next-generation FPGAs is the key to achieving higher performance. Classification frameworks comparison on 3D point clouds Francis P Medina (Yeshiva University)*; Randy Paffenroth (Worcester Polytechnic Institute) We present a preliminary comparison study for various classification frameworks that includes several types of feature engineering. The presentation is this paper is based on our work in [10]. We demonstrate that providing context by augmenting each point in the data about its neighboring points can improve the performance of downstream learning algorithms. We also experiment with several dimension reduction strategies, ranging from Principal Component Analysis (PCA) to neural network based auto-encoders, and demonstrate how they affect classification performance in LiDAR point clouds. For example, we observe that by combining feature engineering with a dimension reduction method such as PCA, there is an improvement in the accuracy of the classification with respect to doing a straightforward classification with the raw data. We test such classification strategies in 3D Lidar point clouds. Index Terms—Classification, 3D point clouds, dimension reduction, PCA, auto-encoders, neural networks, machine learning.

3-3: AI / Machine Learning 3 Session (14:15- 15:30) Session Co-Chairs: Julie Mullen & Sanmukh Rao Kuppannagari

Toward HDL Extensions for Rapid AI/ML Accelerator Generation Ryan Kabrick (Tactical Computing Labs)*; David Donofrio (Tactical Computing Labs); John Leidel (Tactical Computing Labs) StoneCutter, a language construct and compiler embedded in the OpenSoC System Architect family of tools, is designed to provide software architects the ability to rapidly prototype instruction set extensions and hardware accelerators. The StoneCutter compilation flow ingests high level syntax and outputs optimized and pipelined Chisel HDL for further compilation to platform-specific RTL. However, unlike other HDL approaches, StoneCutter is rooted in the notion that users define syntactic blocks that map directly to individual instruction definitions as opposed to classic finite state machines. When integrated with the adjacent System Architect design flow, StoneCutter provides a familiar, C-like language construct by which to develop the implementation for individual, programmable instructions. The LLVM-based StoneCutter compiler performs individual instruction and whole-ISA optimizations in order to generate a high performance, Chisel HDL representation of the target design. Utilizing the existing Chisel tools, users can also generate C++ cycle accurate simulation models as well as Verilog representations of the target design. As a result, the StoneCutter language and associated tooling provides a very rapid, instruction set-centric design environment for rapid development and experimentation. This work describes initial efforts to extend the StoneCutter infrastructure in order to encapsulate linear algebraic constructs for direct compilation into optimized AI/ML instructions. This functionality provides users and architects the ability to utilize the StoneCutter high-level language constructs to develop target and domain specific AI/ML instructions using optimized linear algebraic constructs compiled directly to target-specific RTL. This enables users to create highly optimized AI/ML hardware implementations with minimal effort in traditional hardware develop flows. Filtered Tensor Construction and Decomposition for Drug Repositioning Dimitri Leggas (Reservoir Labs, Inc.)*; Muthu M Baskaran (Reservoir Labs); James Ezick (Reservoir Labs); Brendan von Hofe (Reservoir Labs, Inc.) Drug repositioning (also called "drug repurposing") is a drug development strategy that saves time and money by finding new uses for existing drugs. While a variety of computational approaches to drug repositioning exist, recent work has shown that tensor decomposition, an unsupervised learning technique for finding latent structure in multidimensional data, is a useful tool for drug repositioning. The known relationships between drugs, targets, and diseases can easily be encoded as a tensor, and by learning a low- rank representation of this tensor, decompositions can complete missing entries and therefore predict novel drug-disease associations. Multiple recent works, in the context of cancer and COVID-19 drug discovery, have used joint tensor decompositions to suggest drug repositioning candidates. While these methods make high-quality predictions, they rely on specialized decompositions formulated for specific problems. In this work, we use ENSIGN, a suite of tensor decomposition tools, to show that CP tensor decompositions of a single tensor encoding drug-target-disease associations are capable of predicting verifiable drug repositioning candidates. Because the tensors generated by drug repositioning problems are sparse, we introduce a filtered tensor construction to limit the span of the tensor without losing information needed to learn the relevant associations. We show that our method predicts verifiable novel drug-disease associations in cancer and COVID-19 data. The simplicity of our approach makes it an attractive tool for biomedical researchers looking for out-of-the-box solutions, and ENSIGN brings an added level of usability and scalability. Machine Learning Fairness is Computationally Difficult and Algorithmically Unsatisfactorily Solved Mike H.M. Teodorescu (Boston College)*; Xinyu Yao (Rice University) The main purpose of the paper is to analyze the computational difficulties of selecting the suitable classification algorithms that satisfy specific ethical criteria, when real data is used in training. Employing an imbalanced credit decision dataset largely used for credit scoring and applying a set of algorithms and several fairness criteria, we show that many typical classification algorithms do not satisfy in a reasonable manner more than one fairness criterion when considering more than one protected attribute. This adds a layer of difficulty to the ones represented by the need of large databases and data- and computationally-intensive decision-making systems as used in domains such as credit scoring and hiring. A novel analysis of this study is directly relating ML/AI fairness criteria and computational complexity. We reframe the problem of complexity by connecting it to the search of an ethically acceptable solution instead of just an accurate solution. The results suggest the continued need for human input in fairness decisions, especially when deciding tradeoffs between fairness criteria. Using Computation Effectively for Scalable Poisson Tensor Factorization: Comparing Methods Beyond Computational Efficiency Jeremy M Myers (College of William & Mary, Sandia National Laboratories)*; Daniel Dunlavy (Sandia National Laboratories) Poisson Tensor Factorization (PTF) is an important data analysis method for analyzing patterns and relationships in multiway count data. In this work, we consider several algorithms for computing a low-rank PTF of tensors with sparse count data values via maximum likelihood estimation. Such an approach reduces to solving a nonlinear, non-convex optimization problem, which can leverage considerable parallel computation due to the structure of the problem. However, since the maximum likelihood estimator corresponds to the global minimizer of this optimization problem, it is important to consider how effective methods are at both leveraging this inherent parallelism as well as computing a good approximation to the global minimizer. In this work we present comparisons of multiple methods for PTF that illustrate the tradeoffs in computational efficiency and accurately computing the maximum likelihood estimator. We present results using synthetic and real- world data tensors to demonstrate some of the challenges when choosing a method for a given tensor.

3-4: Case Studies & Benchmarking 1 Session (15:45- 17:00) Session Co-Chairs: Chansup Byun & Kurt Keville

3-S1: AI Challenges Special (17:30-19:30) Organizer(s): Vijay Gadepally

The MIT Supercloud Dataset Siddharth Samsi (MIT Lincoln Laboratory)*; Matthew Weiss (MIT Lincoln Laboratory); Vijay Gadepally (MIT Lincoln Laboratory) Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frame- works, and capabilities such as Jupyter notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data. This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data. Datasets and future challenge announcements will be available via https://dcc.mit.edu. Maneuver Identification Challenge Kaira M Samuel (MIT)*; Jeremy Kepner (MIT Lincoln Laboratory) AI algorithms that identify maneuvers from trajectory data could play an important role in improving flight safety and pilot training. AI challenges allow diverse teams to work together to solve hard problems and are an effective tool for developing AI solutions. AI challenges are also a key driver of AI computational requirements. The Maneuver Identification Challenge hosted at maneuver- id.mit.edu provides thousands of trajectories collected from pilots practicing in flight simulators, descriptions of maneuvers, and examples of these maneuvers performed by experienced pilots. Each trajectory consists of positions, velocities, and aircraft orientations normalized to a common coordinate system. Construction of the data set required significant data architecture to transform flight simulator logs into AI ready data, which included using a supercomputer for deduplication and data conditioning. There are three proposed challenges. The first challenge is separating physically plausible (good) trajectories from unfeasible (bad) trajectories. Human labeled good and bad trajectories are provided to aid in this task. Subsequent challenges are to label trajectories with their intended maneuvers and to assess the quality of those maneuvers.

Invited Talk: Large-Scale simulation of mechanical instabilities in soft materials Prof. Raul Radovitzky (MIT AeroAstro)

Invited Talk: Objective Performance Prediction & Optimization Using Physiological and Cognitive Metrics Capt Kyle “Gouge” McAlpin (USAF Al Accelerator)

Invited Talk: Robust Neural Differential Models for Navigation and Beyond Maj Andrew Bowne (USAF Al Accelerator)

Welcome

Organizers

Advisory Board

Technical Committee