2021 IEEE High Performance Extreme Computing Virtual Conference 20 - 24 September 2021
1-V: Sponsor Spotlight Session (10:30-11:00) Session Chair(s): Albert Reuther Invited Talk: OneAPI Bhavesh Patel (Dell / EMC) 1-1: Multicore Software Technologies Session (11:00-12:15) Session Co-Chairs: Dan Campbell & Tze Meng Low
Home Monday, Sept 20 Tuesday, Sept 21 Wednesday, Sept 22 Thursday, Sept 23 Friday, Sept 24 Subscribe to HPEC 2022
Node-Based Job Scheduling for Large Scale Simulations of Short Running Jobs Chansup Byun (MIT Lincoln Laboratory)*; Vijay Gadepally (MIT Lincoln Laboratory); Albert Reuther (MIT Lincoln Laboratory); Peter Michaleas (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory) Diverse workloads such as interactive supercomputing, big data analysis, and large-scale AI algorithm development, requires a high-performance scheduler.  This paper presents a novel node-based scheduling approach for large scale simulations of short running jobs on MIT SuperCloud systems, that allows the resources to be fully utilized for both long running batch jobs while simultaneously providing fast launch and release of large-scale short running jobs. The node-based scheduling approach has demonstrated up to 100 times faster scheduler performance that other state-of-the-art systems. Modeling Data Movement Performance on Heterogeneous Architectures Amanda J Bienz (University of New Mexico)*; Luke Olson (University of Illinois at Urbana-Champaign); William Gropp (University of Illinois at Urbana-Champaign); Shelby Lockhart (University of Illinois at Urbana-Champaign) The cost of data movement on parallel systems varies greatly with machine architecture, job partition, and nearby jobs.  Performance models that accurately capture the cost of data movement provide a tool for analysis, allowing for communication bottlenecks to be pinpointed.   Modern heterogeneous architectures yield increased variance in data movement as there are a number of viable paths for inter-GPU communication.  In this paper, we present performance models for the various paths of inter- node communication on modern heterogeneous architectures, including the trade-off between GPUDirect communication and copying to CPUs.  Furthermore, we present a novel optimization for inter-node communication based on these models, utilizing all available CPU cores per node.  Finally, we show associated performance improvements for MPI collective operations. Toward Performance Portable Programming for Heterogeneous Systems on a Chip: A Case Study with Qualcomm Snapdragon SoC Anthony M Cabrera (Oak Ridge National Laboratory); Seth Hitefield (Oak Ridge National Laboratory); Jungwon Kim (Oak Ridge National Laboratory); Seyong Lee (Oak Ridge National Laboratory)*; Narasinga Miniskar (Oak Ridge National Laboratory); Jeffrey Vetter (Oak Ridge National Laboratory) Future heterogeneous domain-specific systems on a chip (DSSoCs) will be extraordinarily complex in terms of processors, memory hierarchies, and interconnection networks. To manage this complexity, architects, system software designers, and application developers need programming technologies that are flexible, accurate, efficient, and productive. These technologies must be as independent of any one specific architecture as is practical because the sheer dimensionality and scale of the complexity will not allow porting and optimizing applications for each given DSSoC. To address these issues, the authors are developing Cosmic Castle, a performance portable programming toolchain for streaming applications on heterogeneous architectures. The primary focus of Cosmic Castle is on enabling efficient and performant code generation through the smart compiler and intelligent runtime system. This paper presents the preliminary evaluation of the authors’ ongoing work toward Cosmic Castle. Specifically, this paper details the code-porting efforts and evaluates various benchmarks on the Qualcomm Snapdragon SoC using tools developed through Cosmic Castle. Efficient Scheduling of Dependent Tasks in Many-Core Real-Time System Using a Hardware Scheduler Amin Norollah (Iran university of science and technology)*; Zahra Kazemi (Université Grenoble Alpes); Niloufar Sayadi (Shahid Beheshti University); Hakem Beitollahi (Iran university of science and technology); Mahdi Fazeli (Bogazici University); David Hely (Université Grenoble Alpes) This paper proposes an efficient hardware scheduler for scheduling dependent tasks in real-time many-core systems. The main idea behind the proposed scheduler is that the operating system selects tasks that can be scheduled with the Earliest Deadline First (EDF) algorithm and groups the related tasks according to their dependency. It then transfers the group information and scheduling specifications of each task to the scheduling hardware. The operating system uses the software EDF algorithm and manages the scheduling and assignment of the task to each processing core according to the dependencies of each task in the many-core system. The simulation results through comparison with previous work confirm the proposed hardware scheduler increases feasible tasks by 2.1 times, decreases miss tasks by 3.43 times, while also considering dependencies between tasks. Runtime Dependency Inversion For Scalable Programming Areg Melik-Adamyan (Intel Corporation)*; Kshitij Doshi (Intel Labs) For decades, programs have been written with the implicit notion that the hardware on which they are to be executed is fixed in quantity and capabilities, and that required adaptation to those quantity and capabilities of hardware. Virtualization and cloud computing made it possible to have elasticity of demand and this fostered the emergence of the cloud-native programming models. The developers now focus on APIs, resiliency, extensibility, service level functionality, etc. Applications run mainly as functions and containers on top of some largely standardized runtime and the runtime capabilities represent the underlying computational resources in a predetermined manner, to which the application software adapts. This limits proper use of programming models, as underlying layer abstractions leak to the upper layer and forces application to adapt to underlying layer the same way as with bare hardware. In this paper, we introduce a Runtime Dependency Inversion Principle, an idea in which the infrastructure-as-code paradigm also covers runtimes, thus allowing an application to compose, on demand, the runtime best matched to its execution requirements, so that an application uses the programming model that best fits the intent instead of being constrained by the runtime supported model of programming.
1-2: General Purpose GPU Computing Session (12:30-13:45) Session Co-Chairs: Dan Campbell & Darrell Ricke
IRIS: A Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems Jungwon Kim (Oak Ridge National Laboratory)*; Seyong Lee (Oak Ridge National Laboratory); Beau Johnston (Oak Ridge National Laboratory ); Jeffrey Vetter (Oak Ridge National Laboratory) Across embedded, mobile, enterprise, and high performance computing systems, computer architectures are becoming more heterogeneous and complex. This complexity is causing a crisis in programming systems and performance portability. Several programming systems are working to address these challenges, but the increasing architectural diversity is forcing software stacks and applications to be specialized for each architecture. As we show, all of these approaches critically depend on their runtime system for discovery, execution, scheduling, and data orchestration. To address this challenge, we believe that a more agile and proactive runtime system is essential to increase performance portability and improve user productivity. To this end, we have designed and implemented IRIS: a portable runtime system exploiting multiple heterogeneous programming systems. IRIS can discover available resources, manage multiple diverse programming systems (e.g., CUDA, Hexagon, HIP, Level Zero, OpenCL, OpenMP) simultaneously in the same execution, respect data dependencies, orchestrate data movement proactively, and provide for user-configurable scheduling. Our evaluation on three architectures, ranging from Qualcomm Snapdragon to a Summit supercomputer node, shows that IRIS improves portability across a wide range of diverse heterogeneous architectures with negligible overhead.  Performance Portability of an SpMV Kernel Across Scientific Computing and Data Science Applications Stephen Olivier (Sandia National Laboratories)*; Nathan Ellingwood (Sandia National Laboratories); Jonathan Berry (Sandia National Laboratory); Daniel Dunlavy (Sandia National Laboratories) Both the data science and scientific computing communities are embracing GPU acceleration for their most demanding workloads.  For scientific computing applications, the  massive volume of code and diversity of hardware platforms at supercomputing centers has motivated a strong effort toward performance portability. This property of a program, denoting its ability to perform well on multiple architectures and varied datasets, is heavily dependent on the choice of parallel programming model and which features of the programming model are used. In this paper, we evaluate performance portability in the context of a data science workload in contrast to a scientific computing workload, evaluating the same sparse matrix kernel on both. Among our implementations of the kernel in different performance-portable programming models, we find that many struggle to consistently achieve performance improvements using the GPU compared to simple one-line OpenMP parallelization on high-end multicore CPUs.  We show one that does, and its performance approaches and sometimes even matches that of vendor-provided GPU math libraries. Are van Emde Boas trees viable on the GPU? Benedikt Mayr (Graz University of Technology); Alexander Weinrauch (Graz University of Technology)*; Mathias Parger (Graz University of Technology); Markus Steinberger (Graz University of Technology) Van Emde Boas trees show an asymptotic query complexity surpassing the performance of traditional data structure for performing search queries in large data sets. However, their implementation and large constant overheads prohibit their widespread use. In this work, we ask, whether van Emde Boas trees are viable on the GPU. We presents a novel algorithm to construct a van Emde Boas tree utilizing the parallel compute power and memory bandwidth of modern GPUs. By analyzing the structure of a sorted data set, our method is able to build a van Emde Boas tree efficiently in parallel with little thread synchronization. We compare querying data using a van Emde Boas tree and binary search on the GPU and show that for very large data sets, the van Emde Boas tree outperforms a binary search by up to 1.2x while similarly increasing space requirements. Overall, we confirm that van Emde Boas trees are viable for certain scenarios on the GPU.
1-3: Quantum & Novel Computing Session (14:15-15:30) Session Co-Chairs: Patrick Dreher & Steve Reinhardt
Invited Talk: How Does Nature Compute Reality Dr. Sadas Shankar (Stanford Unvi) Optimized Quantum Circuit Generation with SPIRAL Scott A Mionis (Carnegie Mellon University)*; Franz Franchetti (CMU); Jason Larkin (Carnegie Mellon University Software Engineering Institute) Quantum computers have been at the bleeding edge of computing technology for nearly 40 years and promise to make several classically-difficult problems tractable. Specifically, these systems have been theorized to violate the security of RSA encryption and possibly even the extended Church-Turing thesis. In an effort to harness the revolutionary potential of these systems, research has largely focused around the physical construction of at-scale quantum devices. However, despite many hardware and algorithm breakthroughs, the software infrastructure that compiles programs for these devices requires further development in order to produce sufficiently scalable code. In the near term, Noisy Intermediate-Scale Quantum (NISQ) devices maintain only sparse connectivity between qubits, meaning that quantum programs assuming dense connectivity must be efficiently routed onto the hardware. If done naively, this process often requires the compiler to insert an overwhelming number of data movement operations; these alone can violate the practicability of the program since quantum states are fragile and decohere rapidly if the critical path is too long. Existing approaches have made great strides in making this process more efficient, but have not yet fully capitalized on relevant advancements in the classical domain due to input representations that often limit the scope of program transformations that can be applied. In this work we present a novel approach to compiling more efficient quantum programs. Rather than mapping individual programs onto quantum hardware, we capture the input as an abstract mathematical transform and generate a multitude of architecture-compliant programs directly from that specification, allowing us to search over a much larger space of implementations to select the best output. Specifically, this is achieved by casting circuit generation as a sparse matrix factorization task and recursively searching over a host of divide- and-conquer decomposition rules. By generating programs from the top down, we retain high-level information about the program we are compiling and can explore algorithm-specific rewrites that are nearly impossible to recognize given only a program stream. We implement the proposed framework with SPIRAL, a code generation platform founded on the GAP computer algebra system; we ultimately demonstrate that SPIRAL is a promising supplemental tool for future quantum frameworks. Interrogating the performance of quantum annealing for the solution of steady-state subsurface flow Jessie M Henderson (Los Alamos National Laboratory)*; Daniel O'Malley (Los Alamos National Laboratory); Hari S Viswanathan (Los Alamos National Laboratory) In the 1990s, the principles of quantum mechanics were leveraged to describe how discrete combinatorial optimization problems could be solved using a method called quantum annealing. D-Wave Systems, Inc. has since implemented commercial embodiments. We focus on the use of that hardware to solve discrete hydrologic inverse problems, which can become computationally intense at even moderate sizes. In particular, we interrogate whether contemporary quantum annealers with more qubits, lower noise, and reverse annealing features can improve performance. We find modest improvements in speed and accuracy are possible; however, those benefits are currently constrained to a regime of very rapid, approximate solutions on small meshes. As quantum annealers continue to improve, we expect further encroachment on the performance of classical numerical solvers.
1-4: High Performance Data Analysis Session (15:45-17:00)
Session Co-Chairs: Franz Franchetti & Muthu Baskaran
An interface for multidimensional arrays in Arkouda Mitesh Kothari (Verizon Media); Richard Vuduc (Georgia Institute of Technology)* This paper describes an interface and prototype implementation that would extend Arkouda to support multidimensional arrays. It is implemented as a prototype layer that sits independently and on top of Arkouda. The technical focus of the prototype is on the representation and manipulation of multidimensional index sets, and the current implementation relies on a naïve linearization of multi-index representations into 1-D indices, which are then easy to map into Arkouda's existing infrastructure. The aim of the prototype is serve as a starting point for an end-user and developer community discussion about what a multidimensional array interface ought to look like for Arkouda. Knowledge-guided Tensor Decomposition for Baselining and Anomaly Detection Dimitri Leggas (Reservoir Labs, Inc.); Christopher Coley (US Air Force Academy)*; Teresa M Ranadive (Laboratory for Physical Sciences) We introduce a flexible knowledge-guided penalty for incorporating known or expected patterns of activity into tensor decomposition. Our modified tensor decomposition both enables efficient identification of semantically-related patterns across data sets and provides a means for identifying anomalous patterns. Specifically, we modify the loss function for a CP tensor decomposition such that a subset of the columns of the factor matrices are guided to align with the provided knowledge. We validate the effectiveness of the method for separating baseline patterns from anomalous patterns in the context of cyber network traffic logs. After performing daily decompositions of tensors formed from network logs, we derive a set of expected components describing baseline behavior. We then decompose a new tensor, created from network logs and containing a known anomaly, providing the baseline as knowledge guidance. Notably, the anomalous behavior appears among the unguided components, resulting in drastic reduction in the search space for anomalies. Additionally, we show that our knowledge-guided decomposition is robust to incorrect knowledge in that any enforced components that are not found in the original data have low weight. Our method, implemented in the tensor software package ENSIGN, is an efficient routine that reduces the post-processing needed for an anomaly detection workflow. Privateer: Multi-versioned Memory-mapped Data Stores for High-Performance Data Science Karim Youssef (Virginia Tech)*; Keita Iwabuchi (Lawrence Livermore National Laboratory); Wu-chun Feng (Virginia Tech); Roger Pearce (Lawrence Livermore National Laboroatry) The exponential growth in dataset sizes necessitates the use of high-performance computing (HPC) for large-scale data science. Furthermore, the sizes of these datasets shift the performance bottleneck from the compute subsystem towards the memory and I/O subsystems. To address this shift, modern HPC clusters are equipped with low-latency and high-bandwidth storage devices, such as non-volatile memory, and, in turn, redesigned I/O subsystems to improve performance. However, an overlooked bottleneck arises due to the size of these storage devices being dwarfed by the storage footprint of large-scale data science applications. For instance, applications that process and store consistent snapshots of incrementally growing data streams require a significant storage footprint that far outstrips the size of the storage devices. To address this bottleneck, we present Privateer, a generalpurpose data store that optimizes the tradeoff between storage space utilization and I/O performance. Privateer uses memorymapped I/O with private mapping and an optimized writeback mechanism to maximize write parallelism and eliminate redundant writes; it also uses content-addressable storage to optimize storage space via de-duplication. We evaluate the effectiveness of Privateer by using it as the data-store management layer of Metall, a persistent C++ data structure allocator. Using a microbenchmark that incrementally constructs and stores snapshots of an incremental graph data structure, Privateer can reduce the storage space used by approximately 30% while delivering comparable performance to the baseline Metall implementation. An All-at-Once CP Decomposition Method for Count Tensors Teresa M Ranadive (Laboratory for Physical Sciences)*; Muthu M Baskaran (Reservoir Labs) CANDECOMP/PARAFAC (CP) tensor decomposition is a popular method for detecting latent behaviors in real-world data sets.  As data sets grow larger and more elaborate, more sophisticated CP decomposition algorithms are required to enable these discoveries.  Data sets from many applications can be represented as count tensors.  To decompose count tensors, one should minimize the sum of the generalized Kullback-Leibler divergences from each tensor entry to each corresponding decomposition entry.   Most often, this is done using the algorithm CP-APR (CP-Alternating Poisson Regression).  In view of the fact that all-at-once optimization algorithms for related CP decomposition problems often achieve better decomposition accuracy than alternating algorithms like CP-APR, we develop CP-POPT-GDGN, an all-at-once algorithm for count tensor decomposition that utilizes a generalized damped Gauss-Newton method.  We then implement a highly efficient version of CP-POPT-GDGN into the tensor package ENSIGN.  After decomposing several tensors formed from real--world data sets, many of which are related to network traffic, we see that CP-POPT-GDGN typically outperforms CP-APR, both in terms of decomposition accuracy and latent behavior detection. Embedded Compute Matrix Processing and FFTs using Floating Point FPGAs Michael A Parker (Raytheon)* The techniques in the paper are applicable to next generation antenna array processing, in such applications as 5G wireless and radar. Advances in data convertor technology is allowing for direct RF sampling at very high rates, and is allowing for large arrays to be fully digitized, removing the need for analog combining and data compression methods. This does require much higher levels of digital signal processing to support the aggregate data rates from these large digitized arrays, which can be several orders of magnitude larger than existing systems.  In addition, due to the high data rates from such arrays, it is very advantageous to implement the digital signal processing as close to the array, in devices well suited for embedded computing such as the FPGA. Two representative algorithms often used in array processing will be described, the QR Decomposition and FFT, both implemented in single precision floating point to support the dynamic range requirements of array processing. 1-S1: Scaling HPC Education Special (17:30-19:30) Organizer(s): Julie Mullen & Lauren Milechin Theme 1: Training a Broader HPC Audience (Panel) Theme 2: Evaluating Workshops and Informal Learning (Talks) 1-S2: BRAIDS Special (17:30-19:30) Organizer(s): Courtland VanDam Invited Talk: Pushing the Cyber Horizon Past Protected Enclaves Dr. Sandeep Pisharody (MIT Lincoln Laboratory) Invited Talk: Applying Al to Improve Cyber Situational Awareness Dr. Steven McElwee (PJM Interconnection) Invited Talk: Entity Resolution for the Cyber Domain Trang Nguyen (MIT Lincoln Laboratory) Invited Talk: A First Step Toward Relative Network Baselining Raul Harnasch (MIT Lincoln Laboratory) 
Monday, September 20
2021 Abstract Book
1-V: Sponsor Spotlight Session (10:30- 11:00) Session Chair(s): Albert Reuther Invited Talk: OneAPI Bhavesh Patel (Dell / EMC) 1-1: Multicore Software Technologies Session (11:00-12:15) Session Co-Chairs: Dan Campbell & Tze Meng Low
Node-Based Job Scheduling for Large Scale Simulations of Short Running Jobs Chansup Byun (MIT Lincoln Laboratory)*; Vijay Gadepally (MIT Lincoln Laboratory); Albert Reuther (MIT Lincoln Laboratory); Peter Michaleas (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory) Diverse workloads such as interactive supercomputing, big data analysis, and large-scale AI algorithm development, requires a high-performance scheduler.  This paper presents a novel node-based scheduling approach for large scale simulations of short running jobs on MIT SuperCloud systems, that allows the resources to be fully utilized for both long running batch jobs while simultaneously providing fast launch and release of large-scale short running jobs. The node-based scheduling approach has demonstrated up to 100 times faster scheduler performance that other state-of-the-art systems. Modeling Data Movement Performance on Heterogeneous Architectures Amanda J Bienz (University of New Mexico)*; Luke Olson (University of Illinois at Urbana-Champaign); William Gropp (University of Illinois at Urbana-Champaign); Shelby Lockhart (University of Illinois at Urbana-Champaign) The cost of data movement on parallel systems varies greatly with machine architecture, job partition, and nearby jobs.  Performance models that accurately capture the cost of data movement provide a tool for analysis, allowing for communication bottlenecks to be pinpointed.   Modern heterogeneous architectures yield increased variance in data movement as there are a number of viable paths for inter-GPU communication.  In this paper, we present performance models for the various paths of inter-node communication on modern heterogeneous architectures, including the trade-off between GPUDirect communication and copying to CPUs.  Furthermore, we present a novel optimization for inter-node communication based on these models, utilizing all available CPU cores per node.  Finally, we show associated performance improvements for MPI collective operations. Toward Performance Portable Programming for Heterogeneous Systems on a Chip: A Case Study with Qualcomm Snapdragon SoC Anthony M Cabrera (Oak Ridge National Laboratory); Seth Hitefield (Oak Ridge National Laboratory); Jungwon Kim (Oak Ridge National Laboratory); Seyong Lee (Oak Ridge National Laboratory)*; Narasinga Miniskar (Oak Ridge National Laboratory); Jeffrey Vetter (Oak Ridge National Laboratory) Future heterogeneous domain-specific systems on a chip (DSSoCs) will be extraordinarily complex in terms of processors, memory hierarchies, and interconnection networks. To manage this complexity, architects, system software designers, and application developers need programming technologies that are flexible, accurate, efficient, and productive. These technologies must be as independent of any one specific architecture as is practical because the sheer dimensionality and scale of the complexity will not allow porting and optimizing applications for each given DSSoC. To address these issues, the authors are developing Cosmic Castle, a performance portable programming toolchain for streaming applications on heterogeneous architectures. The primary focus of Cosmic Castle is on enabling efficient and performant code generation through the smart compiler and intelligent runtime system. This paper presents the preliminary evaluation of the authors’ ongoing work toward Cosmic Castle. Specifically, this paper details the code-porting efforts and evaluates various benchmarks on the Qualcomm Snapdragon SoC using tools developed through Cosmic Castle. Efficient Scheduling of Dependent Tasks in Many-Core Real-Time System Using a Hardware Scheduler Amin Norollah (Iran university of science and technology)*; Zahra Kazemi (Université Grenoble Alpes); Niloufar Sayadi (Shahid Beheshti University); Hakem Beitollahi (Iran university of science and technology); Mahdi Fazeli (Bogazici University); David Hely (Université Grenoble Alpes) This paper proposes an efficient hardware scheduler for scheduling dependent tasks in real-time many-core systems. The main idea behind the proposed scheduler is that the operating system selects tasks that can be scheduled with the Earliest Deadline First (EDF) algorithm and groups the related tasks according to their dependency. It then transfers the group information and scheduling specifications of each task to the scheduling hardware. The operating system uses the software EDF algorithm and manages the scheduling and assignment of the task to each processing core according to the dependencies of each task in the many-core system. The simulation results through comparison with previous work confirm the proposed hardware scheduler increases feasible tasks by 2.1 times, decreases miss tasks by 3.43 times, while also considering dependencies between tasks. Runtime Dependency Inversion For Scalable Programming Areg Melik-Adamyan (Intel Corporation)*; Kshitij Doshi (Intel Labs) For decades, programs have been written with the implicit notion that the hardware on which they are to be executed is fixed in quantity and capabilities, and that required adaptation to those quantity and capabilities of hardware. Virtualization and cloud computing made it possible to have elasticity of demand and this fostered the emergence of the cloud-native programming models. The developers now focus on APIs, resiliency, extensibility, service level functionality, etc. Applications run mainly as functions and containers on top of some largely standardized runtime and the runtime capabilities represent the underlying computational resources in a predetermined manner, to which the application software adapts. This limits proper use of programming models, as underlying layer abstractions leak to the upper layer and forces application to adapt to underlying layer the same way as with bare hardware. In this paper, we introduce a Runtime Dependency Inversion Principle, an idea in which the infrastructure-as-code paradigm also covers runtimes, thus allowing an application to compose, on demand, the runtime best matched to its execution requirements, so that an application uses the programming model that best fits the intent instead of being constrained by the runtime supported model of programming.
Monday, September 20