28th Annual
IEEE High Performance Extreme Computing Virtual Conference
23 - 27 September 2024

IEEE HPEC 2024

All times are EDT (UTC/GMT -04 hours)

Speaker/Presenting Author in Italics

 

Friday, September 27

 

5-K: Keynote Session (10:30-11:00)

Co-Chairs: J. Kepner & A. Reuther

Keynote Talk: AI/ML Applications for Global Security

Eric Evans (MIT)

5-1: High Performance Computing 1 Session (11:00-12:15)

Co-Chairs: D. Ricke & B. Raut

Cycle-Stealing in Load-Imbalanced HPC Applications [Outstanding Student Paper Award]

Po Hao Chen (Brown University), Akshaya Bali, Shining Yang, Pouya Haghi, Carlton Knox, Benjamin Li (Boston Univ.), Amr Abouelmagd, Anthony Skjellum (Tennessee Tech), Martin Herbordt (Boston Univ.)
It is practical to steal cycles when Message Passing Interface (MPI) programs are load imbalanced, either from natural algorithmic load, because of collective communication and/or process skew, and/or because of variable message loads and needs for message progress within the MPI implementation. We introduce TimeLord, a runtime library to provide fungibility for compute-cycles without significantly degrading the nominal performance of the parallel application. We propose three means to exploit wasted cycles in LAMMPS, miniFE, and miniAMR. The results show, on average, 40% of the runtime can be used to execute extra computations and that runtime overheads are less than 4%. We envisage that these extra computations be related to the target application; here, for simplicity and proof-of-concept, we assume arbitrary commodity applications. Additionally, TimeLord uncovers inefficiencies in the MPI progress engine and improves baseline performance. We explore some fundamental issues of load imbalance and its interaction with system software, including the predictability of extra time and the relative benefits of prediction versus preemption. TimeLord requires no changes to the MPI application, MPI implementation, or other system components.
Tightly-Coupled FPGA Accelerator for Molecular Dynamics Simulation: Hardware-Software Co-Design and Fine-Grained Task Management [Outstanding Student Paper Award]

Zekang Cheng, Zerong S He, Xi Jin (Univ. of Science and Technology of China)
Over the past decades, Molecular Dynamics (MD) has been extensively utilized for drug design, protein structure prediction, and system analysis in computational chemistry and biology. However, previous acceleration efforts have often faced challenges with either host-device execution modes, leading to costly communication overheads, or complete FPGA implementations that sacrifice flexibility and programmability. In this paper, we present a novel tightly-coupled MD execution framework, combining a single FPGA with a Hard-core CPU, resulting in an impressive 10x performance improvement compared to state-of-the-art CPU implementations (e.g., Gromacs). Our system exhibits characteristics reminiscent of a fine-grained CPU based MD execution with Low latency computation accelerators. To achieve such efficiency, our hardware-software co-design approach emphasizes reducing scheduling expenses through a decentralized dependency management strategy and a hardware-assisted Multi-Producer Multi-Consumer (MPMC) queue. Importantly, our methodology is not limited to MD applications alone but can be readily applied to a wide range of fine-grained task-based applications. Moreover, the MPMC queue has the potential to scale and be adapted for use in FPGA clusters of larger scale, further extending its applicability and relevance in diverse computing environments.
MST in Incremental Graphs Through Tree Contractions

Akanksha Dwivedi, Sameer Sharma, Dip Sankar Banerjee (IIT Jodhpur)
Dynamic graphs, characterized by rapid changes in their topological structure through adding or deleting edges or vertices, pose significant challenges for algorithm design. This paper presents a parallel algorithm for dynamic graphs in a batched setting. We employ a popular tree contraction mechnism to create a hierarchical representation of the input graph that allows the identification of localized areas that further enable maintaining a critical graph primitive such as the minimum spanning tree (MST) without requiring re-computation from scratch. We perform experiments to demonstrate the application of our algorithm in real-world graphs where batch-dynamic algorithms on large trees are essential for incremental updates. We show experimental validations on GPUs where our proposed technique can provide up to 3.43x speedup over equivalent parallel implementations on shared memory CPUs. Additionally, our methods can provide up to 4.23x speedup over conventional parallel computation from scratch.
Syndeo: Portable Ray Clusters with Secure Containerization

William Li, Rodney S Lafuente Mercado, Jaime Pena, Ross E Allen (MIT Lincoln Laboratory)
We present Syndeo: a software framework for container orchestration of Ray on Slurm. In general the idea behind Syndeo is to write code once and deploy anywhere. Specifically, Syndeo is designed to addresses the issues of portability, scalability, and security for parallel computing. The design is portable because the containerized Ray code can be re-deployed on Amazon Web Services, Microsoft Azure, Google Cloud, or Alibaba Cloud. The process is scalable because we optimize for multi-node, high-throughput computing. The process is secure because users are forced to operate with unprivileged profiles meaning administrators control the access permissions. We demonstrate Syndeo’s portable, scalable, and secure design by deploying containerized parallel workflows on Slurm for which Ray does not officially support.
Evaluating One-Sided Communication on Graph500 with MPI-RMA and OpenSHMEM

Jefferson Boothe (Univ. of Pittsburgh), Alan George (NSF Center for High Performance Reconfigurable Computing)
While traditionally utilized for two-sided and collective communication, the latest MPI standards support remote memory access (RMA) between processes to enable one-sided communication. This paradigm is typically associated with fine-grained communication and irregular memory accesses. Many graph analysis problems feature such irregular memory and communication patterns, making them a good choice for performance evaluation. Graph500 is a popular benchmark built upon breadth-first search (BFS) on an undirected graph, which is well known for its sparse data accesses and fine-grained communication. In this research, we analyze and compare the scalability of multiple implementations of BFS using MPI-RMA against previously developed OpenSHMEM-based implementations optimized to maximize the benefits of one-sided communication. Additionally, we evaluate these implementations against the state-of-the-art MPI reference code using different numbers of processing elements and various problem sizes. Our experimental evaluation shows consistently improved performance with MPI-RMA over the best OpenSHMEM implementation on Graph500’s BFS kernel with scales up to 32 nodes on the Pittsburgh Supercomputing Center Bridges-2 Regular Memory partition and University of Pittsburgh Center for Research Computing (Pitt CRC) MPI Cluster. Due to the nature of graph processing having a higher ratio of communication to computation, the communication latency hiding aspects of one-sided communication could not be fully exploited. While we demonstrate MPI-RMA to achieve ∼1.8× better performance over the MPI reference implementation on 32 nodes when only using 4 cores per node, the reference version was more performant in the majority of configurations tested. It is concluded that while one-sided communication has shown promising performance on some large-scale computing tasks, it remains difficult from a development standpoint to leverage the one-sided benefits on more complex kernels.

5-P1 (12:15-13:15): Poster Session 5-1

Chair(s)/Host(s): TBD

Synthesizing Numerical Linear Algebra using Julia [Outstanding Short Paper Award]

Sophie Xuan, Rabab MA Alomairy, Evelyne Ringoot, Felipe Tome, Julian Samaroo, Alan Edelman (MIT)
This paper presents work in progress to implement generic Julia functions for dense numerical linear algebra that support various data types and hardware (CPU and GPU) with a single API. This implementation transcends the traditional type-specific and hardware-specific LAPACK/BLAS libraries, by leveraging the Julia programming language and its LLVM-based compilation process, without sacrificing performance. This work aims to demonstrate the potential of the Julia language to advance the field of high-performance computing by providing future-proof, efficient, and generic alternatives to legacy libraries.
Towards LibraryX: A Framework for Cross-Library Call Optimization [Outstanding Short Paper Award]

Sanil Rao, Anant Prakash, Franz Franchetti (Carnegie Mellon Univ.)
In scientific computing, key algorithms are implemented using domain specific libraries. In many cases these algorithms cannot be expressed using just one domain library, instead having to use library calls from various domains. Cross library implementations are productive but suffer in performance because library developers cannot optimize library calls outside their domain. Additionally, automated solutions like compilers struggle to optimize these cases as they cannot cross the library call boundary. This leaves manually implementing optimized algorithms for idealized performance, hindering readability and productivity. We propose LibraryX as a framework to express key algorithms using domain libraries while automatically providing high-performance implementations done by hand. LibraryX uses a combination of library call semantic capture, abstraction lifting/code generation, and runtime compilation to provide optimized implementations without modifying the source application. We demonstrate LibraryX using the example of FFT convolution.
Compressed Cannon’s Algorithm

Louis Jencka, Amanda J Bienz (Univ. of New Mexico)
Parallel matrix-matrix multiplication (GEMM) is a linear-algebra operation underpinning many applications in high-performance and scientific computing. With strong scaling, GEMM’s communication latency among processes, nodes, and accelerators can be a leading bottleneck in performance.  Data-compression is a common solution for reducing transport costs but is often not applied for this purpose in HPC, where low-latency & high-throughput networks are difficult for compressors to match. However, some parallel GEMM algorithms — such as Cannon’s method — exhibit a communication pattern where data is rarely altered as it travels between processes, providing an opportunity for a simple compression strategy: compress submatrices once at the outset, and then circulate the compressed version. This minimizes compression, which is typically significantly slower than decompression, at a price of additional memory utilization. In this work, we have measured and modeled scaling studies of this strategy, using six existing and widely used lossless compressors (NDZip, ZStandard, Zip, BloscLZ, ZFP, & LZ4). We find most of these compressors can improve communication latency, with speedups of up to 1.2 times observed. Additionally, we evaluate and discuss the accuracy of our model against our measured results.
Multiplication of Sparse Matrices and their Transpose using Compressed Sparse Diagonals

Sardar Anisul Haque (Oryx Universal College in Partnership with LJMU (UK)), Mohammad Tanvir Parvez (Qassim Univ.), Shahadat Hossain (Univ. of Northern British Columbia)
Matrix-matrix multiplication is one of the most important kernel in linear algebra operations with a multitude of applications in scientific and engineering computing. Sparse matrix computation on modern High-performance Computing (HPC) architecture presents challenges such as load balancing, data locality optimization, and computational scalability. Data structures to store sparse matrices are designed to minimize overhead information as well as to optimize the operations count and memory access. In this study, we present a new data structure, “compressed sparse diagonal” \textbf{(CSD)}, to efficiently store and compute with general sparse matrices. The CSD builds upon the previously developed diagonal storage – an orientation-independent uniform scheme to compute with “structured” matrices\cite{hossain2019computing}. Compared with the widely used compressed sparse row/column (CSR/CSC), the CSD scheme avoids explicit transposition operation when multiplying a matrix with its transpose. The results from preliminary numerical experiments with the aforementioned types of matrices demonstrate the CSD scheme’s effectiveness in matrix-transposed matrix multiplication.
Augmenting HPC Profilers with Analysis Capabilities

Abhishek N Patil, Shamjith K V, Senthil Kumar RK, Dr. S D Sudarsan (C-DAC)
This paper focuses on framework designed and developed by integrating multiple profiler modules to profile HPC applications. There is a constant demand for application performance on High Performance Computing (HPC) systems.  Multiple software tools are currently available in the performance profiling realm to help application developers to profile and identify performance bottlenecks. Application developers strive to achieve the optimum execution of applications on available hardware resources using the profiling tools of their choice. They have to comprehend all the information provided by the tools and identify the bottlenecks to arrive at possible modifications. To assimilate the bottlenecks information from the profiling tool’s outputs and representations, one requires experience and expertise in the application algorithm, programming methodologies, domain knowledge, and hardware resources. Many a time it is challenging for the developers to make changes in the application program based on the profiler output. We have developed a profile and analysis framework to identify the bottlenecks in an HPC application and help the developers improve application performance. The framework is developed by augmenting the existing opensource profilers with analysing capabilities for bottleneck identification and potential performance suggestions. This paper describes the architecture of the software framework and the benefits of the adopted mechanism.

5-2: High Performance Computing 2 Session (12:30-13:45)

Co-Chairs: D. Ricke & J. Mullen

An Efficient Multi-DNN Accelerator Based on Multiple Systolic Array

Jianjun Chen, Han Jiao, Wenjin Huang, Yihua Huang (Sun Yat-Sen Univ.)
With the rapid evolution of artificial intelligence (AI) technology, the scope of AI application has extended beyond individual Deep Neural Network (DNN) models, leading to a growing emphasis on Multi-DNN scenario. Google’s TPU v1, a prominent DNN hardware accelerator, employs a systolic array as its core computing module, showcasing high data reuse and computational parallelism, thereby enabling efficient processing of DNN tasks. However, applying the fixed size of monolithic systolic array in Multi-DNN scenario leads to low hardware utilization due to the computational heterogeneity of DNN tasks, and it lacks the capability to concurrently execute multiple DNN tasks. To address these issues, we propose a multi-core accelerator, with each core designed based on systolic array. Additionally, we design a flexible Networks-on-Chip (NoC) specifically tailored for the systolic array, allowing combinations between systolic arrays to form a new systolic array with different sizes and shapes to adapt DNN tasks with varying computational characteristics. In addition, to fully leverage the composability of the multi-systolic array blocks, we design a DNN layer execution compiler and a core allocation compiler, both significantly improving performance. The experimental results show that our design significantly outperforms TPU-like architecture and DM-NPU, achieving an average reduction of 65.5\% and 34.8\% in the average normalized turnaround time (ANTT), and an average improvement of 104.3\% and 30.1\% in the system throughput (STP) under the multitasking scenario.
JACC.shared: Leveraging HPC Metaprogramming and Performance Portability for Computations That Use Shared Memory GPUs

Pedro Valero-Lara, William Godoy, Keita Teranishi, Jeffrey Vetter (ORNL)
In this work, we present JACC.shared, a new feature of Julia for ACCelerators (JACC), which is the performanceportable and metaprogramming model of the just-in-time and LLVM-based Julia language. This new feature allows JACC applications to leverage the high-performance computing (HPC) capabilities of high bandwidth, on-chip GPU memory. Historically, exploiting high-bandwidth, shared-memory GPUs has not been a priority for high-level programming solutions. JACC.shared covers that gap for the first time, thereby providing a highlevel, portable, and easy-to-use solution for programmers to exploit this memory and supporting all current major accelerator architectures. Well-known HPC and AI workloads, such as multi/hyperspectral imaging and AI convolutions, have been used to evaluate JACC.shared on two exascale GPU architectures hosted by some of the most powerful US Department of Energy supercomputers: Perlmutter (NVIDIA A100) and Frontier (AMD MI250X). The performance evaluation reports speedup of up<br/>to 3.5x by adding only one line of code to the base codes, thus providing important accelerators in a simple, portable, and transparent way and elevating the programming productivity and performance-portability capabilities for Julia/JACC HPC, AI, and scientific applications.
OCO-GAT: An Accelerator for Graph Attention Network with Optimized Calculation Order

Qi Liu, Wenjin Huang, Wenlu Peng, Yihua Huang (Sun Yat-Sen Univ.)
The Graph Attention Network (GAT) introduces an attention mechanism to focus on the most significant aspects of the data and utilizes weighted sums for aggregation. As a result, GAT exhibits superior performance in tasks involving graph data compared to previous Graph Neural Networks (GNNs). However, this also introduces a more complex computational process and stronger data dependencies, making previous GNN accelerators inadequate for GAT inference.<br/>Therefore, we propose an optimized calculation order for GAT along with the corresponding accelerator architecture, OCO-GAT, specifically tailored for GAT inference, which includes efficient pipeline design. Additionally, we introduce a distributed fine-grained on-chip storage scheme, ensuring computational parallelism while mitigating significant growth in storage resource consumption. We deployed OCO-GAT on Xilinx Alveo U250 FPGA and the experimental results demonstrate that optimized calculation order reduces the workload of division operations by an average of 17.2% across four datasets. OCO-GAT achieves high energy efficiency and improves inference performance from 1.2 times to 301.1 times compared to CPU, GPU, and three peer works.
Task-Level Parallelism for the Multifrontal Method in Tightly Coupled CPU-FPGA Architectures

Zerong S He, Zekang Cheng, Zhongguang Xu, Xi Jin (Univ. of Science and Technology of China)
The multifrontal method is an efficient direct method for solving sparse linear systems. This algorithm transforms the factorization of a large and sparse matrix into a sequence of dense matrix operations. Some of these dense operations can be accelerated with FPGAs, while others are suitable for CPUs. In this paper, we propose a tightly coupled heterogeneous domain-specific architecture (DSA) for the efficient execution of the multifrontal algorithm. This tightly coupled hardware architecture ensures low communication overhead between the CPU and FPGA. Each matrix operation is assigned to the CPU or FPGA based on its characteristics. Furthermore, we apply the task-based scheduling model for multi-core architectures to our heterogeneous DSA. Considering the difference in the storage structure of the CPU and FPGA, we modify the original task graph and propose a data-centric task graph that is more suitable for the FPGA. Based on this scheduling model, we propose two optimizations to further improve performance. Finally, we test the scheduling overhead of the system and determine the finest task granularity accordingly. We evaluate our architecture on Xilinx ZCU102. Compared to MUMPS, for a set of 5 matrices, our implementation can achieve an average of 4.3× performance improvement with just 10% of the computing power.
LLload: An Easy-to-Use HPC Utilization Tool

Chansup Byun, Albert Reuther, Julie Mullen, LaToya Anderson, William Arcand, Bill Bergeron, David Bestor, Alexander Bonn, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Piotr Luszczek, Peter Michaleas (MIT Lincoln Laboratory), Lauren Milechin (MIT), Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner (MIT Lincoln Laboratory)
The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying opportunities for better utilization of compute resources. LLload can be used to monitor jobs both programmatically and interactively. LLload can characterize users’ jobs using various LLload options to achieve better efficiency. This information can be used to inform the user to optimize HPC workloads and improve both CPU and GPU utilization. This includes improvements using judicious oversubscription of the computing resources. Preliminary results suggest significant improvement in GPU utilization and overall throughput performance with GPU overloading in some cases. By enabling users to observe and fix incorrect job submission and/or inappropriate execution setups, LLload can increase the resource usage and improve the overall throughput performance. LLload is a light-weight, easy-to-use tool for both HPC users and HPC systems engineers to monitor HPC workloads to improve system utilization and efficiency.

5-P2 (13:45-14:45): Poster Session 5-2

Chair(s)/Host(s): TBD

A Framework for Analyzing the Performance of Sparse Matrix and Graph Operations

Khaled Abdelaal, Richard M Veras (Univ. of Oklahoma)
Thorough performance analysis is crucial for developing and optimizing algorithms, particularly for compute-intensive operations like those in scientific libraries such as the Basic Linear Algebra Subprograms (BLAS). While dense computations can predict performance based on dataset dimensions and strides, sparse matrix operations require more detailed analysis due to their complexity. Standard performance evaluations for sparse operations use a canonical set of matrices, but generalizing these results to new datasets is limited. This paper presents a framework to evaluate sparse matrix and graph operations by visualizing performance through parameterized graph models. It assesses different parameter sets and noise sources on performance, providing a modular and extensible approach that includes system-wide performance considerations, allowing users to integrate various graph model generators, operation implementations, and noise types.
Efficient Eigenvalue Computation of Parahermitian Matrices Using Neural Networks

Diyari A. Hassan (Qaiwan Intl. Univ.), Yunus Egi, Soydan Redif (American Univ. of Middle East)
In recent years, computing the eigenvalue decomposition of a polynomial matrix has become increasingly essential in many areas of adaptive signal processing. However, traditional iterative algorithms, such as sequential matrix diagonalisation (SMD), often incur high computational costs. This paper proposes two artificial neural network (ANN)-based approaches for computing polynomial eigenvalues estimated by the SMD. By utilizing feed-forward and convolutional neural network models, we significantly reduce computational costs, including CPU time and floating-point operations per second (FLOPS). The results demonstrate that ANN-based polynomial eigenvalue decomposition (PEVD) offers a more efficient solution for large matrix computations compared to traditional methods.
Towards a RISC-V Instruction Set Extension for Multi-word Arithmetic

Youngjin Eum, Naifeng Zhang, Larry Tang, Franz Franchetti (Carnegie Mellon Univ.)
Multi-word arithmetic is a method of doing calculations with data which are bigger than what the machine registers can hold. For example, we could do 128-bit arithmetic with a 64-bit machine by using this technique. One application of multi-word arithmetic is in cryptography, since working with large numbers is a critical security feature. Currently, many CPUs and GPUs support multi-word arithmetic by providing a single carry flag. This is useful for implementing simple 128-bit operations such as addition and subtraction, but is limited with more complicated operations such as multiplication or modulo. Our key contributions are an extension to the RISC-V ISA to accelerate multi-word arithmetic incorporating multiple carry bits to enable modulo and multiplication, and an implementation of the ISA using the Chipyard framework and the RoCC interface, seeing 1.3x speedup in clock cycles with a number theoretic transform benchmark compared to an implementation without the accelerator.
AstraMQ: Distributed MQTT Broker

Rohan M Doshi, Sanika Inamdar, Tanmay Karmarkar, Madhuri S Wakode (Pune Inst. Of Computer Tech.)
Distributed MQTT Broker offers a scalable, fault-tolerant solution for efficient message brokering in IoT deployments. This architecture integrates a custom load balancer, MQTT broker nodes, and Redis Streams. The load balancer ensures a single point of entry and load distribution among broker nodes, enhancing system reliability. MQTT broker nodes handle standard operations and utilize Redis Streams for asynchronous message passing and state synchronization, ensuring decoupled communication and fan-out delivery. This system also explores alternative approaches using etcd with Raft for client state synchronization and gRPC with protocol buffers for message passing, as well as Kafka for message storage and inter-broker communication. AstraMQ addresses the limitations of existing distributed brokers by leveraging open-source components and the efficiency of Golang, providing a robust, cost-effective solution for large-scale IoT applications.
Computational and Numerical Properties of a Broadband Subspace-Based Likelihood Ratio Test

Cornelius Pahalson, Louise Crockett, Stephan Weiss (Univ. of Strathclyde)
This paper investigates the performance of a likelihood ratio test in combination with a polynomial subspace projection approach to detect weak transient signals in broadband array data. Based on previous empirical evidence that a likelihood ratio test is advantageously applied in a lower-dimensional subspace, we present analysis that highlights how the polynomial subspace projection whitens a crucial part of the signals, enabling a detector to operate with a shortened temporal window. This reduction in temporal correlation, together with a spatial compaction of the data, also leads to both computational and numerical advantages over a likelihood ratio test that is directly applied to the array data. The results of our analysis are illustrated by examples and simulations.

5-3: High Performance Computing 3 Session (14:15-15:30)

Co-Chairs: J. Mullen & TBD

Persistent and Partitioned MPI for Stencil Communication

Gerald Collom (Univ. of New Mexico), Jason Burmark, Olga Pearce (LLNL), Amanda J Bienz (Univ. of New Mexico)
Many parallel applications rely on iterative stencil operations, whose performance are dominated by communication costs at large scales. Several MPI optimizations, such as persistent and partitioned communication, reduce overheads and improve communication efficiency through amortized setup costs and reduced synchronization of threaded sends. This paper presents the performance of stencil communication in the Comb benchmarking suite when using non-blocking, persistent, and partitioned communication routines. The impact of each optimization is analyzed at various scales. Further, the paper presents an analysis of the impact of process count, thread count, and message size on partitioned communication routines. Measured timings show that persistent MPI communication can provide a speedup of up to 37% over the baseline MPI communication, and partitioned MPI communication can provide a speedup of up to 68%.
HPC Network Simulation Tuning via Automatic Extraction of Hardware Parameters

Joshua Suetterlein, Stephen Young, Jesun S Firoz, Joseph Manzano, Nathan Tallent, Ryan Friese, Kevin Barker, Timothy Stavenger (PNNL)
Popular HPC network interconnection simulators such as SST/macro provide a variety of configurable parameters to explore the design space of hardware components such as network interface cards (NIC), switches, and links among them. While such knobs provide flexibility to explore design trade-offs for novel hardware, manually configuring simulations for match- ing configurations of the existing hardware to focus on topology exploration can be cumbersome and error-prone, leading to widely inaccurate simulations. This challenge is compounded when specifications of various (proprietary) technologies are not readily available or intentionally omitted. In this work, we propose a framework to autotune the multiple network models’ simulation configurations within SST/macro using Tree-structured Parzen Estimator-based Bayesian opti- mization to observe the effect on simulation accuracy across different message regimes. These regimes consist of small to large message sizes and latency to bandwidth-bound messages. We provide a detailed analysis of the simulation error for four representative HPC systems. Our Bayesian optimization- based autotuning framework for network models achieves a maximum of 5x improvement in accuracy over best-effort manual configurations based on available hardware specifications.
Accelerating Multi-Agent DDPG Training on Multi-GPU Platforms

Samuel Wiggins, Viktor K Prasanna (USC)
Multi-Agent Reinforcement Learning (MARL) is a crucial technology in artificial intelligence applications. Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is a state-of-the-art MARL algorithm that has gained widespread adoption and is considered a popular baseline for comparing against novel MARL algorithms. The training process of MADDPG systems involves a high volume of data communication and computation when scaling to larger problem sizes. While state-of-the-art CPU-GPU systems provide the necessary computing power for high-throughput training, they do not efficiently utilize underlying platform resources and are unable to facilitate training using multiple GPU devices. We propose a novel accelerated system for MADDPG training that leverages multiple GPU devices while additionally increasing the utilization of CPU resources. We access our MADDPG system on multiple benchmarks on a multi-GPU platform, resulting in up to a 1.5× higher system throughput compared to state-of-the-art CPU-GPU systems.
Binary Bleed: Fast Distributed and Parallel Method for Automatic Model Selection

Ryan C Barron, Maksim Eren (LANL), Manish Bhattarai (Los Alamos National Lab), Ismael Boureima (LANL), Cynthia Matuszek (UMBC), Boian Alexandroe (LANL)
In several Machine Learning (ML) clustering and dimensionality reduction approaches, such as non-negative matrix factorization (NMF), RESCAL, and K-Means clustering, users must select a hyper-parameter k to define the number of clusters or components that yield an ideal separation of samples or clean clusters. This selection, while difficult, is crucial to avoid overfitting or underfitting the data. Several ML applications use scoring methods (e.g., Silhouette and Davies Boulding scores) to evaluate the cluster pattern stability for a specific k. The score is calculated for different trials over a range of k, and the ideal k is heuristically selected as the value before the model starts overfitting, indicated by a drop or increase in the score resembling an elbow curve plot. While the grid-search method can be used to accurately find a good k value, visiting a range of k can become time-consuming and computationally resource-intensive. In this paper, we introduce the Binary Bleed method based on binary search, which significantly reduces the k search space for these grid-search ML algorithms by truncating the target k values from the search space using a heuristic with thresholding over the scores. Binary Bleed is designed to work with single-node serial, single-node multi-processing, and distributed computing resources. In our experiments, we demonstrate the reduced search space gain over a naive sequential search of the ideal k and the accuracy of the Binary Bleed in identifying the correct k for NMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Boulding scores. We make our implementation of Binary Bleed for the NMF algorithm available on GitHub.
GPU Sharing with Triples Mode

Chansup Byun, Albert Reuther, LaToya Anderson, William Arcand, Bill Bergeron, David Bestor, Alexander Bonn, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Piotr Luszczek, Peter Michaleas (MIT Lincoln Laboratory), Lauren Milechin (MIT), Guillermo Morales, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner (MIT Lincoln Laboratory)
There is a tremendous amount of interest in AI/ML technologies due to the proliferation of generative AI applications such as ChatGPT. This trend has significantly increased demand on GPUs, which are the workhorses for training AI models. Due to the high costs of GPUs and lacking supply, it has become of interest to optimize GPU usage in HPC centers. MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed an easy-to-use GPU sharing feature supported by LLSC-developed tools including LLsub and LLMapReduce. This approach overcomes some of the limitations with the existing methods for GPU sharing. This allows users to apply GPU sharing whenever possible while they are developing their AI/ML models and/or doing parametric study on their AI models or executing other GPU applications. Based on our initial experimental results with GPU sharing, GPU sharing with triples mode is easy to use and achieved significant improvement in GPU usage and throughput performance for certain types of AI applications.

5-4: High Performance Computing 4 Session (15:45-17:30)

Co-Chairs: TBD & C. Byun

BB-CVXOPT: Basic Block Execution Count Estimation and Extrapolation using Constrained Convex Optimization

Youssef A Aly (New Mexico State Univ.), Atanu Barai, Nandakishore Santhi (LANL), Hameed Badawy (New Mexico State Univ.)
Program execution time often scales with input size, making performance prediction without execution valuable if the model is efficient and scalable. While machine learning has been used for such predictions, we propose using Constrained Convex Optimization at the finer Basic Block (BB) level. Our BB-CVXOPT system models program performance by breaking down a target program into BBs and using a numerical solver to generate polynomial equations for each BB, based on input size. These equations predict BB execution counts, which can then estimate runtime when multiplied by BB execution times for a given system. BB-CVXOPT is architecture-independent, achieving error rates as low as 1.39e-16 and MAPE below 1.0e- 07 for the largest 30% of the dataset, and a MAPE of 1.30e-03 for 65% of the dataset.
Parallel Online Directed Acyclic Graph Exploration for Atlasing Soft-Matter Assembly Configuration Spaces

Rahul Prabhu, Amit Verma, Meera Sitharam (Univ. of Florida)
The paper formalizes a version of parallel online directed acyclic graph (DAG) exploration, general enough to be readily mapped to many computational scenarios. In both the offline and online versions, vertices are weighted with the work units required for their processing, at least one parent must be completely processed before a child is processed, and at any given time only one processor can work on any given vertex. The online version has the following additional natural restriction: only after a vertex is processed, are its required work units or its children known.<br/><br/>Using the Actor Model of parallel computation, it is shown that a natural class of parallel online algorithms meets a simple competitive ratio bound. We demonstrate and focus on the problem’s occurrence in the scenario of energy landscape roadmapping or atlasing under pair-potentials, a highly compute-and-storage intensive modeling component integral to diverse applications involving soft-matter assembly. The method is experimentally validated using a C++ Actor Framework (CAF) software implementation built atop EASAL (Efficient Atlasing and Search of Assembly Landscapes), a substantial opensource software suite, running on multiple CPU cores of the HiperGator supercomputer, demonstrating linear speedup results.
Towards Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication on GPUs

Seth Kay, H. Howie Huang (George Washington Univ.)
Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in neural networks and high performance graph algorithms. Most popular solutions to SpMM AOT compilation, where the entire program is compiled before execution. With the introduction of GPU threading in SpMM there have been significant leaps in perfor- mance. However, AOT compilation for threaded SpMM has four major limitations: unnecessary memory access, greater branch overhead, unnecessary retained memory, and slow memory transfers between the host and the device. These limitations are due to the nature of AOT compilations where key information is unaccessible during the compilation time but later becomes available during the programs run-time. In this paper, we propose a SIMD GPU JIT compilation solution to SpMM, improving SpMM resource usage and run time on GPUs. The most significant advantage of our framework is that JIT compilation provides our solution with runtime information. This information is used to dynamically create precise thread structures based on input matrices. Operations are evenly dis- tributed to each thread in their smallest denominations to achieve maximum efficiency and a balanced workload distribution. Runtime information is also used to manage a self-maintained memory pool which increases memory efficiency, bandwidth, and SpMM speed. Improved forms of memory allocation and storage are used, decreasing the amount of transfers needed between the device and host, as well as increasing GPU memory bandwidth and decreasing memory access times. We conduct a performance evaluation of our framework and compare it to a AOT baseline provided by NVIDIA, a threaded matrix multiplier using the CUDA framework, adjusted for SpMM using an array in CSR format. Our JIT framework consistently delivers a significant performance enhancement over the native CUDA solution, achieving on average, a 1.3x improvement in execution time and a 3x improvement in GPU memory efficiency.
HBM-based Hardware Accelerator for GNN Sampling and Aggregation

Yuchen Gui (Univ. of Science and Technology of China), Qizhe Wu, Wei Yuan (USTC), Huawen Liang, Xiaotian Wang, Xi Jin (Univ. of Science and Technology of China)
The GNN computation process described by the message passing network mainly consists of two processes: combination and aggregation. Among them, the combination process (matrix multiplication) is consistent with the matrix multiplication computation process in traditional DNNs and does not have special characteristics of GNNs. The aggregation process, on the other hand, is unique to GNN and imposes a large random access pressure on memory. Most of the currently available accelerator solutions focus on the whole computational process of GNN or sample-aggregation on GPU/CPU, while there are fewer FPGA/ASIC-based accelerators dedicated to the sample-aggregation process. The hardware acceleration scheme proposed in this paper focuses on two aspects: First, an innovative streaming sampler is proposed to accelerate sampling from the power-law distribution characteristics of degree in GNN datasets. Second, the feature acquisition is accelerated by the high bandwidth advantage of HBM, and an innovative aggregation scheme is realized to match the readout data from HBM. The proposed hardware streaming sampler scheme is tested by FPGA and achieves a speedup of $2\times$ to $20\times$ relative to traditional FPGA-based node sampling, and the feature acquisition and aggregation accelerator achieves up to $300\times$ speedup relative to the GPU platform.
GPU Accelerated Construction of Time Respecting Data Structure for Temporal Graphs

Animan Naskar, Venkata Kalyan Tavva (IIT Ropar), Subhasis Banerjee (Shell India Markets Pvt. Ltd.)
The transformation of a temporal graph into its corresponding time-respecting graph (TRG) offers significant advantages for neighborhood search problems when compared to the traditional edge stream representations. However, constructing TRGs is a time-consuming process that can be substantially accelerated using GPU based parallelism. In this paper, we propose a novel highly parallel method to construct the time respecting graph (TRG) from the edge list by leveraging GPU parallelism. For graphs of various sizes, ranging from 0.05 to 33 million edges, we achieve a speedup of up to 387x over the sequential variant on comparable hardware.We demonstrate that our method produces TRG in the standard CSR form that can be input to high-performance graph analytics libraries such as nvGRAPH. The TRG produced is also optimized for a minimal number of vertices and edges. Our method also provides support for efficient real-time dynamic updates to the temporal graph.
Comparative Analysis of GCC and LLVM for Performance Optimization on Aarch64

Mriganka Bezbaruah, Samruddhi Dhakulkar, Prachi Pandey, Haribabu Pasupuleti, S A Kumar, S D Sudarsan (C-DAC)
The emergence of ARM-based architectures in high-performance computing (HPC) has necessitated a closer examination of compiler behaviors and optimizations. This paper investigates the performance and optimization capabilities of GCC and LLVM compilers for Aarch64 processors on HPC workloads. We focus on ARM-based HPC processors used in systems like Grace Hopper and Fujitsu A64FX, emphasizing on their implementations of the Scalable Vector Extension (SVE) and other architectural features that enhance performance. Through comprehensive benchmarking, including vectorization analysis with the Test Suite for Vectorizing Compilers (TSVC), we evaluate the performance of the compilers based on execution times, applied optimizations, and code portability across these systems. By comparing recent versions of GCC (v14) and LLVM (v18) with their previous versions, we highlight performance improvements and identify optimization gaps. Our analysis of assembly code offers insights into vectorization, register allocation, and SIMD instruction generation, providing valuable recommendations for future compiler enhancements. This study aims to give insights for the development of more efficient compilers, ultimately advancing the performance of Aarch64-based HPC systems.

IEEE HPEC 2024