2021
IEEE High Performance Extreme Computing
Virtual Conference
20 - 24 September 2021
1-V: Sponsor Spotlight Session (10:30-11:00)
Session Chair(s): Albert Reuther
Invited Talk: OneAPI
Bhavesh Patel (Dell / EMC)
1-1: Multicore Software Technologies Session (11:00-12:15)
Session Co-Chairs: Dan Campbell & Tze Meng Low
Node-Based Job Scheduling for Large Scale Simulations of Short Running Jobs
Chansup Byun (MIT Lincoln Laboratory)*; Vijay Gadepally (MIT Lincoln Laboratory); Albert Reuther (MIT Lincoln Laboratory); Peter
Michaleas (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory)
Diverse workloads such as interactive supercomputing, big data analysis, and large-scale AI algorithm development, requires a
high-performance scheduler. This paper presents a novel node-based scheduling approach for large scale simulations of short
running jobs on MIT SuperCloud systems, that allows the resources to be fully utilized for both long running batch jobs while
simultaneously providing fast launch and release of large-scale short running jobs. The node-based scheduling approach has
demonstrated up to 100 times faster scheduler performance that other state-of-the-art systems.
Modeling Data Movement Performance on Heterogeneous Architectures
Amanda J Bienz (University of New Mexico)*; Luke Olson (University of Illinois at Urbana-Champaign); William Gropp (University of
Illinois at Urbana-Champaign); Shelby Lockhart (University of Illinois at Urbana-Champaign)
The cost of data movement on parallel systems varies greatly with machine architecture, job partition, and nearby jobs.
Performance models that accurately capture the cost of data movement provide a tool for analysis, allowing for communication
bottlenecks to be pinpointed. Modern heterogeneous architectures yield increased variance in data movement as there are a
number of viable paths for inter-GPU communication. In this paper, we present performance models for the various paths of inter-
node communication on modern heterogeneous architectures, including the trade-off between GPUDirect communication and
copying to CPUs. Furthermore, we present a novel optimization for inter-node communication based on these models, utilizing all
available CPU cores per node. Finally, we show associated performance improvements for MPI collective operations.
Toward Performance Portable Programming for Heterogeneous Systems on a Chip: A Case Study with Qualcomm
Snapdragon SoC
Anthony M Cabrera (Oak Ridge National Laboratory); Seth Hitefield (Oak Ridge National Laboratory); Jungwon Kim (Oak Ridge
National Laboratory); Seyong Lee (Oak Ridge National Laboratory)*; Narasinga Miniskar (Oak Ridge National Laboratory); Jeffrey
Vetter (Oak Ridge National Laboratory)
Future heterogeneous domain-specific systems on a chip (DSSoCs) will be extraordinarily complex in terms of processors, memory
hierarchies, and interconnection networks. To manage this complexity, architects, system software designers, and application
developers need programming technologies that are flexible, accurate, efficient, and productive. These technologies must be as
independent of any one specific architecture as is practical because the sheer dimensionality and scale of the complexity will not
allow porting and optimizing applications for each given DSSoC. To address these issues, the authors are developing Cosmic
Castle, a performance portable programming toolchain for streaming applications on heterogeneous architectures. The primary
focus of Cosmic Castle is on enabling efficient and performant code generation through the smart compiler and intelligent runtime
system. This paper presents the preliminary evaluation of the authors’ ongoing work toward Cosmic Castle. Specifically, this paper
details the code-porting efforts and evaluates various benchmarks on the Qualcomm Snapdragon SoC using tools developed
through Cosmic Castle.
Efficient Scheduling of Dependent Tasks in Many-Core Real-Time System Using a Hardware Scheduler
Amin Norollah (Iran university of science and technology)*; Zahra Kazemi (Université Grenoble Alpes); Niloufar Sayadi (Shahid
Beheshti University); Hakem Beitollahi (Iran university of science and technology); Mahdi Fazeli (Bogazici University); David Hely
(Université Grenoble Alpes)
This paper proposes an efficient hardware scheduler for scheduling dependent tasks in real-time many-core systems. The main
idea behind the proposed scheduler is that the operating system selects tasks that can be scheduled with the Earliest Deadline
First (EDF) algorithm and groups the related tasks according to their dependency. It then transfers the group information and
scheduling specifications of each task to the scheduling hardware. The operating system uses the software EDF algorithm and
manages the scheduling and assignment of the task to each processing core according to the dependencies of each task in the
many-core system. The simulation results through comparison with previous work confirm the proposed hardware scheduler
increases feasible tasks by 2.1 times, decreases miss tasks by 3.43 times, while also considering dependencies between tasks.
Runtime Dependency Inversion For Scalable Programming
Areg Melik-Adamyan (Intel Corporation)*; Kshitij Doshi (Intel Labs)
For decades, programs have been written with the implicit notion that the hardware on which they are to be executed is fixed in
quantity and capabilities, and that required adaptation to those quantity and capabilities of hardware. Virtualization and cloud
computing made it possible to have elasticity of demand and this fostered the emergence of the cloud-native programming models.
The developers now focus on APIs, resiliency, extensibility, service level functionality, etc. Applications run mainly as functions and
containers on top of some largely standardized runtime and the runtime capabilities represent the underlying computational
resources in a predetermined manner, to which the application software adapts. This limits proper use of programming models, as
underlying layer abstractions leak to the upper layer and forces application to adapt to underlying layer the same way as with bare
hardware. In this paper, we introduce a Runtime Dependency Inversion Principle, an idea in which the infrastructure-as-code
paradigm also covers runtimes, thus allowing an application to compose, on demand, the runtime best matched to its execution
requirements, so that an application uses the programming model that best fits the intent instead of being constrained by the
runtime supported model of programming.
1-2: General Purpose GPU Computing Session (12:30-13:45)
Session Co-Chairs: Dan Campbell & Darrell Ricke
IRIS: A Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems
Jungwon Kim (Oak Ridge National Laboratory)*; Seyong Lee (Oak Ridge National Laboratory); Beau Johnston (Oak Ridge
National Laboratory
); Jeffrey Vetter (Oak Ridge National Laboratory)
Across embedded, mobile, enterprise, and high performance computing systems, computer architectures are becoming more
heterogeneous and complex. This complexity is causing a crisis in programming systems and performance portability. Several
programming systems are working to address these challenges, but the increasing architectural diversity is forcing software
stacks and applications to be specialized for each architecture. As we show, all of these approaches critically depend on their
runtime system for discovery, execution, scheduling, and data orchestration. To address this challenge, we believe that a more
agile and proactive runtime system is essential to increase performance portability and improve user productivity. To this end,
we have designed and implemented IRIS: a portable runtime system exploiting multiple heterogeneous programming
systems. IRIS can discover available resources, manage multiple diverse programming systems (e.g., CUDA, Hexagon, HIP,
Level Zero, OpenCL, OpenMP) simultaneously in the same execution, respect data dependencies, orchestrate data
movement proactively, and provide for user-configurable scheduling. Our evaluation on three architectures, ranging from
Qualcomm Snapdragon to a Summit supercomputer node, shows that IRIS improves portability across a wide range of
diverse heterogeneous architectures with negligible overhead.
Performance Portability of an SpMV Kernel Across Scientific Computing and Data Science Applications
Stephen Olivier (Sandia National Laboratories)*; Nathan Ellingwood (Sandia National Laboratories); Jonathan Berry (Sandia
National Laboratory); Daniel Dunlavy (Sandia National Laboratories)
Both the data science and scientific computing communities are embracing GPU acceleration for their most demanding
workloads. For scientific computing applications, the massive volume of code and diversity of hardware platforms at
supercomputing centers has motivated a strong effort toward performance portability. This property of a program, denoting its
ability to perform well on multiple architectures and varied datasets, is heavily dependent on the choice of parallel
programming model and which features of the programming model are used. In this paper, we evaluate performance
portability in the context of a data science workload in contrast to a scientific computing workload, evaluating the same sparse
matrix kernel on both. Among our implementations of the kernel in different performance-portable programming models, we
find that many struggle to consistently achieve performance improvements using the GPU compared to simple one-line
OpenMP parallelization on high-end multicore CPUs. We show one that does, and its performance approaches and
sometimes even matches that of vendor-provided GPU math libraries.
Are van Emde Boas trees viable on the GPU?
Benedikt Mayr (Graz University of Technology); Alexander Weinrauch (Graz University of Technology)*; Mathias Parger (Graz
University of Technology); Markus Steinberger (Graz University of Technology)
Van Emde Boas trees show an asymptotic query complexity surpassing the performance of traditional data structure for
performing search queries in large data sets. However, their implementation and large constant overheads prohibit their
widespread use. In this work, we ask, whether van Emde Boas trees are viable on the GPU. We presents a novel algorithm to
construct a van Emde Boas tree utilizing the parallel compute power and memory bandwidth of modern GPUs. By analyzing
the structure of a sorted data set, our method is able to build a van Emde Boas tree efficiently in parallel with little thread
synchronization. We compare querying data using a van Emde Boas tree and binary search on the GPU and show that for
very large data sets, the van Emde Boas tree outperforms a binary search by up to 1.2x while similarly increasing space
requirements. Overall, we confirm that van Emde Boas trees are viable for certain scenarios on the GPU.
1-3: Quantum & Novel Computing Session (14:15-15:30)
Session Co-Chairs: Patrick Dreher & Steve Reinhardt
Invited Talk: How Does Nature Compute Reality
Dr. Sadas Shankar (Stanford Unvi)
Optimized Quantum Circuit Generation with SPIRAL
Scott A Mionis (Carnegie Mellon University)*; Franz Franchetti (CMU); Jason Larkin (Carnegie Mellon University Software
Engineering Institute)
Quantum computers have been at the bleeding edge of computing technology for nearly 40 years and promise to make
several classically-difficult problems tractable. Specifically, these systems have been theorized to violate the security of RSA
encryption and possibly even the extended Church-Turing thesis. In an effort to harness the revolutionary potential of these
systems, research has largely focused around the physical construction of at-scale quantum devices. However, despite many
hardware and algorithm breakthroughs, the software infrastructure that compiles programs for these devices requires further
development in order to produce sufficiently scalable code. In the near term, Noisy Intermediate-Scale Quantum (NISQ)
devices maintain only sparse connectivity between qubits, meaning that quantum programs assuming dense connectivity must
be efficiently routed onto the hardware. If done naively, this process often requires the compiler to insert an overwhelming
number of data movement operations; these alone can violate the practicability of the program since quantum states are
fragile and decohere rapidly if the critical path is too long. Existing approaches have made great strides in making this process
more efficient, but have not yet fully capitalized on relevant advancements in the classical domain due to input representations
that often limit the scope of program transformations that can be applied. In this work we present a novel approach to
compiling more efficient quantum programs. Rather than mapping individual programs onto quantum hardware, we capture the
input as an abstract mathematical transform and generate a multitude of architecture-compliant programs directly from that
specification, allowing us to search over a much larger space of implementations to select the best output. Specifically, this is
achieved by casting circuit generation as a sparse matrix factorization task and recursively searching over a host of divide-
and-conquer decomposition rules. By generating programs from the top down, we retain high-level information about the
program we are compiling and can explore algorithm-specific rewrites that are nearly impossible to recognize given only a
program stream. We implement the proposed framework with SPIRAL, a code generation platform founded on the GAP
computer algebra system; we ultimately demonstrate that SPIRAL is a promising supplemental tool for future quantum
frameworks.
Interrogating the performance of quantum annealing for the solution of steady-state subsurface flow
Jessie M Henderson (Los Alamos National Laboratory)*; Daniel O'Malley (Los Alamos National Laboratory); Hari S
Viswanathan (Los Alamos National Laboratory)
In the 1990s, the principles of quantum mechanics were leveraged to describe how discrete combinatorial optimization
problems could be solved using a method called quantum annealing. D-Wave Systems, Inc. has since implemented
commercial embodiments. We focus on the use of that hardware to solve discrete hydrologic inverse problems, which can
become computationally intense at even moderate sizes. In particular, we interrogate whether contemporary quantum
annealers with more qubits, lower noise, and reverse annealing features can improve performance. We find modest
improvements in speed and accuracy are possible; however, those benefits are currently constrained to a regime of very rapid,
approximate solutions on small meshes. As quantum annealers continue to improve, we expect further encroachment on the
performance of classical numerical solvers.
1-4: High Performance Data Analysis Session (15:45-17:00)
Session Co-Chairs: Franz Franchetti & Muthu Baskaran
An interface for multidimensional arrays in Arkouda
Mitesh Kothari (Verizon Media); Richard Vuduc (Georgia Institute of Technology)*
This paper describes an interface and prototype implementation that would extend Arkouda to support multidimensional
arrays. It is implemented as a prototype layer that sits independently and on top of Arkouda. The technical focus of the
prototype is on the representation and manipulation of multidimensional index sets, and the current implementation relies on
a naïve linearization of multi-index representations into 1-D indices, which are then easy to map into Arkouda's existing
infrastructure. The aim of the prototype is serve as a starting point for an end-user and developer community discussion
about what a multidimensional array interface ought to look like for Arkouda.
Knowledge-guided Tensor Decomposition for Baselining and Anomaly Detection
Dimitri Leggas (Reservoir Labs, Inc.); Christopher Coley (US Air Force Academy)*; Teresa M Ranadive (Laboratory for
Physical Sciences)
We introduce a flexible knowledge-guided penalty for incorporating known or expected patterns of activity into tensor
decomposition. Our modified tensor decomposition both enables efficient identification of semantically-related patterns across
data sets and provides a means for identifying anomalous patterns. Specifically, we modify the loss function for a CP tensor
decomposition such that a subset of the columns of the factor matrices are guided to align with the provided knowledge. We
validate the effectiveness of the method for separating baseline patterns from anomalous patterns in the context of cyber
network traffic logs. After performing daily decompositions of tensors formed from network logs, we derive a set of expected
components describing baseline behavior. We then decompose a new tensor, created from network logs and containing a
known anomaly, providing the baseline as knowledge guidance. Notably, the anomalous behavior appears among the
unguided components, resulting in drastic reduction in the search space for anomalies. Additionally, we show that our
knowledge-guided decomposition is robust to incorrect knowledge in that any enforced components that are not found in the
original data have low weight. Our method, implemented in the tensor software package ENSIGN, is an efficient routine that
reduces the post-processing needed for an anomaly detection workflow.
Privateer: Multi-versioned Memory-mapped Data Stores for High-Performance Data Science
Karim Youssef (Virginia Tech)*; Keita Iwabuchi (Lawrence Livermore National Laboratory); Wu-chun Feng (Virginia Tech);
Roger Pearce (Lawrence Livermore National Laboroatry)
The exponential growth in dataset sizes necessitates the use of high-performance computing (HPC) for large-scale data
science. Furthermore, the sizes of these datasets shift the performance bottleneck from the compute subsystem towards the
memory and I/O subsystems. To address this shift, modern HPC clusters are equipped with low-latency and high-bandwidth
storage devices, such as non-volatile memory, and, in turn, redesigned I/O subsystems to improve performance. However, an
overlooked bottleneck arises due to the size of these storage devices being dwarfed by the storage footprint of large-scale
data science applications. For instance, applications that process and store consistent snapshots of incrementally growing
data streams require a significant storage footprint that far outstrips the size of the storage devices.
To address this bottleneck, we present Privateer, a generalpurpose data store that optimizes the tradeoff between storage
space utilization and I/O performance. Privateer uses memorymapped I/O with private mapping and an optimized writeback
mechanism to maximize write parallelism and eliminate redundant writes; it also uses content-addressable storage to
optimize storage space via de-duplication. We evaluate the effectiveness of Privateer by using it as the data-store
management layer of Metall, a persistent C++ data structure allocator. Using a microbenchmark that incrementally constructs
and stores snapshots of an incremental graph data structure, Privateer can reduce the storage space used by approximately
30% while delivering comparable performance to the baseline Metall implementation.
An All-at-Once CP Decomposition Method for Count Tensors
Teresa M Ranadive (Laboratory for Physical Sciences)*; Muthu M Baskaran (Reservoir Labs)
CANDECOMP/PARAFAC (CP) tensor decomposition is a popular method for detecting latent behaviors in real-world data
sets. As data sets grow larger and more elaborate, more sophisticated CP decomposition algorithms are required to enable
these discoveries. Data sets from many applications can be represented as count tensors. To decompose count tensors,
one should minimize the sum of the generalized Kullback-Leibler divergences from each tensor entry to each corresponding
decomposition entry. Most often, this is done using the algorithm CP-APR (CP-Alternating Poisson Regression). In view of
the fact that all-at-once optimization algorithms for related CP decomposition problems often achieve better decomposition
accuracy than alternating algorithms like CP-APR, we develop CP-POPT-GDGN, an all-at-once algorithm for count tensor
decomposition that utilizes a generalized damped Gauss-Newton method. We then implement a highly efficient version of
CP-POPT-GDGN into the tensor package ENSIGN. After decomposing several tensors formed from real--world data sets,
many of which are related to network traffic, we see that CP-POPT-GDGN typically outperforms CP-APR, both in terms of
decomposition accuracy and latent behavior detection.
Embedded Compute Matrix Processing and FFTs using Floating Point FPGAs
Michael A Parker (Raytheon)*
The techniques in the paper are applicable to next generation antenna array processing, in such applications as 5G wireless
and radar. Advances in data convertor technology is allowing for direct RF sampling at very high rates, and is allowing for
large arrays to be fully digitized, removing the need for analog combining and data compression methods. This does require
much higher levels of digital signal processing to support the aggregate data rates from these large digitized arrays, which
can be several orders of magnitude larger than existing systems. In addition, due to the high data rates from such arrays, it is
very advantageous to implement the digital signal processing as close to the array, in devices well suited for embedded
computing such as the FPGA. Two representative algorithms often used in array processing will be described, the QR
Decomposition and FFT, both implemented in single precision floating point to support the dynamic range requirements of
array processing.
1-S1: Scaling HPC Education Special (17:30-19:30)
Organizer(s): Julie Mullen & Lauren Milechin
Theme 1: Training a Broader HPC Audience (Panel)
Theme 2: Evaluating Workshops and Informal Learning (Talks)
1-S2: BRAIDS Special (17:30-19:30)
Organizer(s): Courtland VanDam
Invited Talk: Pushing the Cyber Horizon Past Protected Enclaves
Dr. Sandeep Pisharody (MIT Lincoln Laboratory)
Invited Talk: Applying Al to Improve Cyber Situational Awareness
Dr. Steven McElwee (PJM Interconnection)
Invited Talk: Entity Resolution for the Cyber Domain
Trang Nguyen (MIT Lincoln Laboratory)
Invited Talk: A First Step Toward Relative Network Baselining
Raul Harnasch (MIT Lincoln Laboratory)
Monday, September 20
2021 Abstract Book