28th Annual
IEEE High Performance Extreme Computing Virtual Conference
23 - 27 September 2024

All times are EDT (UTC/GMT -04 hours)

Speaker/Presenting Author in Italics

 

Thursday, September 26

4-K: Keynote Session (10:30-11:00)

Co-Chairs: J. Kepner & A. Reuther

Keynote Talk: Convergence across the Computing Continuum: The NSF Leadership Class Computing Facility meets the Edge, Interactive Computing, and Low-Precision AI

Dan Stanzione (Texas Advanced Computing Center)

4-1: AI at Scale and AI on the Edge Session (11:00-12:15)

Co-Chairs: B. Sroka & K. Gettings

Breakthrough Edge AI Inference Performance using NorthPole in 3U VPX Form Factor

Filipp Akopyan, William Risk, John Arthur, Andrew Cassidy, Michael Debole, Carlos Ortega Otero, Jun Sawada, Evan Colgan, Michael Criscolo (IBM Research), Phillip Mann (IBM), Heinz Baier, Kai Schleupen, Arnon Amir (IBM Research), Alexander Andreopoulos (IBM), Rathinakumar Appuswamy, Deepika Bablani, Peter Carlson, Pallab Datta, Steven Esser, Myron Flickner, Rajamohan Gandhasri, Guillaume Garreau, Megumi Ito, Jennifer Klamo, Jeffrey Kusnitz, Nathaniel McClatchey, Neil McGlohon, Jeffrey McKinstry, Yutaka Nakamura (IBM Research), Tapan Nayak (IBM Corporation), Jay Sivagnaname, Daniel Smith, Rafael Sousa, Brian Taba, Ignacio Terrizzano, Takanori Ueda, Dharmendra Modha (IBM Research)
We present preliminary results demonstrating AI (artificial intelligence) inference using the IBM AIU NorthPole Chip [1], [2] incorporated into a compact, rugged 3U VPX form factor module (NP-VPX) [3]. NP-VPX allows NorthPole to be used in edge applications with stringent cooling requirements, high-speed switch fabrics, and rugged environments. NP-VPX processes 965 frames per second (fps) with a Yolo-v4 network with 640 × 640 pixel images at 73.5 W at full-precision accuracy, achieving 13.2 frames / J (fps / W). NP-VPX processes over 40, 300 fps with a ResNet-50 network with 224 × 224 pixel images at 65.9 W at full-precision accuracy, achieving 611 frames/J.
Breakthrough LLM Inference Performance using NorthPole

Rathinakumar Appuswamy, Michael Debole, Brian Taba, Steven Esser, Andrew Cassidy, Arnon Amir (IBM Research), Alexander Andreopoulos (IBM), Deepika Bablani, Pallab Datta, Jeffrey Kusnitz, Nathaniel McClatchey, Neil McGlohon, Jeffrey McKinstry (IBM Research), Tapan Nayak (IBM Corporation), Daniel Smith, Rafael Sousa, Ignacio Terrizzano, Filipp Akopyan, Peter Carlson, Rajamohan Gandhasri, Guillaume Garreau, Nelson Gonzalez, Megumi Ito, Jennifer Klamo, Yutaka Nakamura, Carlos Ortega Otero, William Risk, Jun Sawada, Kai Schleupen, Jay Sivagnaname, Matthew Stallone, Takanori Ueda, Myron Flickner, John Arthur (IBM Research), Rameswar Panda, David Cox (MIT-IBM Watson AI Lab), Dharmendra Modha (IBM Research)
For a 3-billion-parameter LLM, a research prototype inference appliance with 16 IBM AIU NorthPole processors delivers a massive 28,356 tokens/second of system throughput and sub-1 ms/token (per-user) latency while consuming merely 672 W for 16 NorthPole cards in a compact 2U form factor.<br/>With a focus on low latency and high energy efficiency, when NorthPole (in 12 nm) is compared to a suite of GPUs (in 7/5/4 nm) at various power consumptions, at the lowest GPU latency, NorthPole provides 72.7× better energy metric (tokens/second/W) while providing better latency.
A Framework to Enable Algorithmic Design Choice Exploration in DNNs

Timothy Cronin, Sanmukh Kuppannagari (Case Western Reserve Univ.)
Deep learning technologies, particularly deep neural<br/>networks (DNNs), have demonstrated significant success across many domains. This success has been accompanied by substantial advancements and innovations in the algorithms behind the operations required by DNNs. These enhanced algorithms hold the potential to greatly increase the performance of DNNs. However, discovering the best performing algorithm for a DNN and altering the DNN to use such algorithm is a difficult and time consuming task. To address this, we introduce an open source framework which provides easy to use fine grain algorithmic control for DNNs, enabling algorithmic exploration and selection. Along with built-in high performance implementations of common deep learning operations, the framework enables users to implement and select their own algorithms to be utilized by the DNN. The framework’s built-in accelerated implementations are shown to yield outputs equivalent to and exhibit similar performance as implementations in PyTorch, a popular DNN framework. Moreover, the framework incurs no additional performance overhead, meaning that performance depends solely on the algorithms chosen by the user.
Benchmarking Edge AI Platforms for High-Performance ML Inference

Rakshith Jayanth, Neelesh Gupta, Viktor K Prasanna (USC)
Edge computing’s growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads on these platforms can vary significantly, especially when it comes to parallel processing, which is a critical consideration for edge deployments. To address this, we conduct a comprehensive study comparing the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions. We find that the Neural Processing Unit (NPU) excels in matrix-vector multiplication (58.6% faster) and some neural network tasks (3.2× faster for video classification and large language models). GPU outperforms in matrix multiplication (22.6% faster) and LSTM networks (2.7× faster) while CPU excels at less parallel operations like dot product. NPU-based inference offers a balance of latency and throughput at lower power consumption. GPU-based inference, though more energy- intensive, performs best with large dimensions and batch sizes.  We highlight the potential of heterogeneous computing solutions for edge AI, where diverse compute units can be strategically leveraged to boost accurate and real-time inference.
Transformers: A Graph Processing Perspective

Manish Sri Sai Surya Routhu, Sai Dheeraj Yanduru, Nathaniel K Tomczak, Sanmukh Kuppannagari (Case Western Reserve Univ.)
Transformers, a variant of AI/ML models that utilize the attention mechanism to capture interactions in sequential data, have significantly advanced a variety of scientific and engineering applications. However, attention implementations suffer from high computational and memory complexity requirements and despite efforts to reduce their complexity, the context length over which they can capture interactions is still several orders of magnitude lower than desired. The fundamental problem is that even though these models capture pairwise interactions between data elements, they are viewed as sequence of tensor operations on tensor data, thereby, limiting avenues of optimization. <br/><br/>In this work, we take the first steps towards re-imagining transformer based models as graph computing pipelines by implementing the attention mechanism. Our expectation is that this view will not only dramatically increase scalability by unlocking parallelism along the context length dimension but will also enable researchers to apply the vast library of graph analytics algorithms to obtain better insights into the inner working of these models. We implement graph algorithms for attention mechanism and conduct extensive experimentation by varying sequence length, token dimensions, and sparsity factor and observer near linear reduction in computation time with sparsity factor.

4-P1 (12:15-13:15): Poster Session 4-1

Chair(s)/Host(s): TBD

Perspective-Aware Ai (PAi) for Augmenting Critical Decision Making

Marjan Alirezaie, Daniel Platnick (Flybits), Hossein Rahnama (MIT Media Lab), Alex Pentland (MIT)
Perspective-Aware AI (PAi) is a computational innovation in human-AI interaction that allows users to view and interact through each other’s perspectives by creating personalized computational models called chronicles. Chronicles capture cognitive and behavioral tendencies from an individual’s digital footprint, enabling enhanced decision-making by recognizing and auditing biases. Utilizing federated learning, PAi preserves privacy and ensures data ownership while providing scalability and precision. This approach enhances transparency and clarity in critical decision-making across various domains, including healthcare, education, and business, promoting inclusivity and diverse viewpoints.
Evaluating the Impact of Noisy Blades on PROPELLER MRI Reconstruction Quality

Gulfam A Saju, Marjan Akhi, Yuchou Chang (UMass Dartmouth)
In clinical MRI, the management of image noise remains a challenge, particularly in Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction (PROPELLER) MRI, where the effects of Gaussian noise on image quality have not been extensively explored. To address this gap, this study investigates the impact of Gaussian noise on the quality of PROPELLER MRI images, a technique pivotal for reducing motion artifacts. Systematic introduction of Gaussian noise into the k-space data of PROPELLER blades, varying in number and intensity, allowed for the simulation of realistic clinical scenarios. The study quantified the effects on image quality using peak signal-to-noise ratio (PSNR) and visual inspections. Results demonstrated a significant decline in image quality as the number and intensity of noisy blades increased. Furthermore, it was observed that removing noisy blades from the reconstruction process could partially ameliorate image quality. These findings emphasize the need for enhanced noise management in PROPELLER MRI and suggest directions for algorithmic improvements to optimize clinical MRI imaging.
CompJouleS: Energy Estimate Tool for Machine Learning Algorithms for Multiple Applications in CPU, GPU, and FPGA Architectures

Murat Isik (Stanford Univ.), Jens E. Pedersen (SLAC National Accelerator Laboratory), Vedant Karia (Univ. of Texas at San Antonio), Sadasivan Shankar (Stanford Univ.)
We introduce CompJouleS, a multi-platform energy estimation tool designed to measure the energy cost and performance of custom machine learning algorithms across various hardware architectures, including CPU, GPU, FPGA, Hybrid, and ASICs. Current energy estimation tools lack the flexibility and precision required for accurate analysis across different layers of computing, including applications, machine learning architectures, and hardware architectures. CompJouleS addresses these limitations by combining top-down and bottom-up approaches to provide accurate and efficient energy estimates. The tool is modular, allowing for expansion to additional newer and heterogeneous architectures, and incorporates a computational complexity calculator to estimate the workload of various operations for user-defined algorithms. The first version of CompJouleS is limited to machine learning algorithms on three specific architectures, with the second version expected to extend its capabilities to user-defined algorithms including scientific computations. The paper also reviews existing energy estimation tools and methodologies, highlighting the advantages and limitations of each.
Power Efficient Deep Learning Acceleration using Intel Xeon® Processors

Xiaofei Jiang, Mona Minakshi, Rajesh Poornachandran, Shamima Najnin (Intel)
With the exponential growth of AI applications in data center, one of the foremost concerns is power consumption. Intel Optimized Power Mode (OPM) aims to lower power and reduce cooling costs when servers are not at full utilization. Most of the data center deployments keep platform workload mix to be around the 30%~40% utilization range for TCO and handling any spikes[13]. In this paper, performance/watt has been measured on Intel 5th Gen Intel Xeon Scalable Processors using both gen-AI and non-Gen-AI workloads to see the impact of OPM on power consumption. It has been seen that OPM mode yields up to 25% improvement on performance/watt at 25% server utilization. Performance or performance/watt improvement varies depending on the use cases running. Meanwhile when hitting 100% server utilization, the performance/watt using the OPM is like out-of-the-box performance.
Impact of Estimation Errors of a Matrix of Transfer Functions onto Its Analytic Singular Values and Their Potential Algorithmic Extraction

Mohammed Bakhit, Faizan Ahmad Khattak, Ian K. Proudler, Stephan Weiss (Univ. of Strathclyde)
A matrix of analytic functions $\PMatrix{A}(z)$, such as the matrix of transfer functions in a multiple-input multiple-output (MIMO) system, generally admits an analytic singular value decomposition (SVD), where the singular values themselves are functions. When evaluated on the unit circle, for the sake of analyticity, these singular values must be permitted of become negative. In this paper, we address how the estimation of such a matrix, causing a stochastic perturbation of $\PMatrix{A}(z)$, results in fundamental changes to the analytic singular values: for the perturbed system, we show that their analytic singular values lose any algebraic multiplicities and are strictly non-negative with probability one. We present examples and highlight the impact that this has on algorithmic solutions to extracting an analytic or approximate analytic SVD.
Disaggregation Patterns for Secure AI Systems

Mohamed Ghamri, Marc A Lacoste, Divi De Lacour (Orange)
Disaggregation is a growing trend in large-scale artificial intelligence (AI) systems to overcome hardware and software resource limitations and improve performance while preserving security and privacy. This paper takes a closer look at different dimensions of the concept, in AI, security and hardware. We identify two key design patterns that may be combined to build optimized disaggregated AI architectures and discuss benefits and limitations for AI and security. Using a large language model use case, we also highlight some key trade-offs between performance, resource allocation and security for different disaggregation strategies in hardware and in software.

4-2: Large AI Models Session (12:30-13:45)

Co-Chairs: N. Pitsianis & B. Sroka

MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks [Outstanding Paper Award]

Tal Kadosh (Ben-Gurion Univ., IAEC), Niranjan Hasabnis (Intel), Vy Vo (Intel Labs), Nadav Schneider (Ben-Gurion University), Neva Krien (Independent), Mihai Capotă (Intel Labs), Abdul Wasay, Guy Tamir (Intel), Theodore L Willke, Nesreen Ahmed (Intel Labs), Yuval Pinter (Ben-Gurion University), Tim Mattson (Human Learning Group), Gal Oren (Technion, Stanford Univ.)
With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing — why do we need LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks?<br/>In this line of work, we aim to question choices made by existing LLMs by developing smaller language models (LMs) for specific domains — we call them domain-specific LMs. Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder, which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes. Specifically, we pre-trained MonoCoder on an HPC-specific dataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against state-of-the-art multi-lingual LLMs. Results demonstrate that MonoCoder, although much smaller than existing LMs, outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations. In other words, results suggest that MonoCoder understands HPC code better than state-of-the-art LLMs. <br/>The sources of this work are available at our GitHub MonoCoder repository.
LLM Inference Serving: Survey of Recent Advances and Opportunities [Outstanding Paper Award]

Baolin Li, Yankai Jiang (Northeastern Univ.), Vijay Gadepally (MIT Lincoln Laboratory), Devesh Tiwari (Northeastern Univ.)
This survey offers a comprehensive overview of recent advancements in Large Language Model (LLM) serving systems, focusing on research since the year 2023. We specifically examine system-level enhancements that improve performance and efficiency without altering the core LLM decoding mechanisms. By selecting and reviewing high-quality papers from prestigious ML and system venues, we highlight key innovations and practical considerations for deploying and scaling LLMs in real-world production environments. This survey serves as a valuable resource for LLM practitioners seeking to stay abreast of the latest developments in this rapidly evolving field.
Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

Manish Bhattarai, Javier E Santos, Shawn M Jones, Ayan Biswas, Boian Alexandroe, Daniel O Malley (LANL)
The advent of large language models (LLMs) has revolutionized the field of code translation, enabling automated translation between programming languages. Despite these advancements, the accuracy and reliability of these models often falter in complex translation tasks due to a lack of contextual understanding. This paper introduces a novel approach to enhance code translation through Few-Shot Learning augmented with retrieval-based techniques. By leveraging a repository of existing code translations, we dynamically retrieve the most relevant examples to guide the model in translating new code segments. Our method, based on Retrieval-Augmented Generation (RAG), significantly improves translation quality by providing contextual examples that the model can learn from in real-time. We chose RAG over traditional fine-tuning methods due to its ability to leverage existing codebases or a locally stored corpus of code, allowing it to dynamically adapt to diverse translation tasks without the need for extensive retraining. Extensive experiments on diverse datasets, using open LLM models such as Starcoder, Llama3-70B Instruct, CodeLlama-34B Instruct, Granite-34B Code Instruct, and Mixtral-8x22B, and commercial LLM models such as GPT-3.5 turbo, and GPT-4o demonstrate the superiority of our approach over traditional zero-shot, particularly in translating between Fortran and C++.We also explored different numbers of shots (examples provided to the model during inference) — specifically 1, 2, and 3 shots — and various embedding models for RAG, including Nomic-Embed, Starencoder, and CodeBERT, to evaluate the robustness and effectiveness of our approach.
High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

Xiang Fu, Xinpeng Zhang, Jixiang Ma (Nanchang Hangkong Univ.), Peng Zhao (Microsoft), Shuai Lu (Nanchang Hangkong Univ.), Xu Liu (Univ. of Washington)
Convolution is the core component within deep neural networks and it is computationally intensive and time consuming. Tensor data layouts significantly impact convolution operations in terms of memory access and computational efficiency. Yet, there is still a lack of comprehensive performance characterization on data layouts on SIMD architectures concerning convolution methods. This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques for both direct and im2win convolutions.  We compare the optimized im2win convolution with the direct convolution and PyTorch’s im2col-based convolution across the aforementioned layouts on SIMD machines. The experiments demonstrated that the im2win convolution with the new NHWC layout achieved up to 355% performance speedup over NCHW layout.  Our optimizations also significantly improve the performance of both im2win and direct convolutions. <br/>Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine’s theoretical peak performance, respectively.

4-P2 (13:45-14:45): Poster Session 4-2

Chair(s)/Host(s): TBD

NeuroVM: Dynamic Neuromorphic Hardware Virtualization

Murat Isik (Drexel Univ.), Kayode Inadagbo (Prairie View A&M Univ.), Ismail Can Dikmen (TEMSA)
This paper introduces a novel approach in neuromorphic computing, integrating heterogeneous hardware nodes into a unified, massively parallel architecture. Our system transcends traditional single-node constraints, harnessing the neural structure and functionality of the human brain to efficiently process complex tasks. We present an architecture that dynamically virtualizes neuromorphic resources, enabling adaptable allocation and reconfiguration for various applications. Our evaluation, using diverse applications and performance metrics, provides significant insights into the system’s adaptability and efficiency. We observed scalable throughput increases across configurations of 1, 2, and 4 Virtual Machines (VMs), reaching up to 5.1 Gibibits per second (Gib/s) for different data transfer sizes. This scalability highlights the system’s proficiency in managing data-intensive tasks. Energy consumption analysis in our virtualized accelerator environment showed a near-linear growth with the addition of more NeuroVM accelerators, ranging from 25 to 45 millijoules (mJ) as the number of accelerators increased from 1 to 20. Additionally, our investigation into reconfiguration overheads revealed that partial reconfigurations significantly reduce time compared to full reconfigurations, particularly as the number of VMs increases, with time reductions evident in the logarithmic scale of time measurements.
LLMs for Closed-Library Multi-Document Query, Test Generation, and Evaluation

Claire Randolph (Dept. of the Air Force), Adam Michaleas, Darrell O Ricke (MIT Lincoln Laboratory)
Learning complex, detailed, and evolving knowledge is a challenge in multiple technical professions. Relevant source knowledge is contained within many large documents and information sources with frequent updates to these documents. Knowledge tests need to be generated on new material and existing tests revised, tracking knowledge base updates. Large Language Models (LLMs) provide a framework for artificial intelligence-assisted knowledge acquisition and continued learning. Retrieval-Augmented Generation (RAG) provides a framework to leverage available, trained LLMs combined with technical area-specific knowledge bases. Herein, two methods are introduced, which together enable effective implementation of LLM-RAG question-answering on large documents. Additionally, the AI tools for knowledge intensive tasks (AIKIT) solution is presented for working with numerous documents for training and continuing education. AIKIT is provided as a containerized open-source solution that deploys on standalone, high performance, and cloud systems. AIKIT includes LLM, RAG, vector stores, relational database with a Ruby on Rails web interface.
LLM-Based Task Planning for Navigating Companion Robot from Emotion Signals

Yuchou Chang (UMass Dartmouth), Huy Anh Pham (Intelligent Medical Objects, Inc.), Gulfam A Saju (UMass Dartmouth)
The emergence of companion robots is promising to alleviate loneliness and improve mental health. It is critical to develop accurate task plans attuned to the various emotional states of a human partner. Given the complexity and variability inherent in human mental states, manually creating plans for companion robots is not feasible. Recent framework that integrates Large Language Models (LLMs) with Planning Domain Definition Language (PDDL) for automated task planning produces precise and flexible task plans. However, this framework has not been applied to companion robots, especially those responding to emotional states. This work introduces a new task planning strategy utilizing LLM and PDDL for companion robots. Simulation results demonstrate that the proposed method enables the robot to successfully navigate and offer support in response to detected states of sadness emotion. The method can convert unstructured natural language descriptions into structured task planning information. This strategy may enhance the interaction quality of companion robots and make them more empathetic and contextually aware in their social support roles.
Large Multimodal Model for Simulating Big Training Data in Deep PROPELLER MRI

Gulfam A Saju, Marjan Akhi, Yuchou Chang (UMass Dartmouth)
This paper presents a novel approach for generating synthetic PROPELLER MRI blades using the GPT-4 large multimodal model (LMM). The approach addresses the challenge of data scarcity in PROPELLER MRI. Our method simplifies the process of data synthesis. It makes the process accessible to researchers without extensive knowledge of complex MRI algorithms. The approach involves transforming Cartesian MRI data into PROPELLER blades. We utilize a Chain-of-Thought (CoT) prompting technique to guide the model in understanding the specific requirements of PROPELLER MRI. We compare this method with traditional algorithmic approaches. The comparison demonstrates that the GPT-4 based method can produce synthetic MRI data of comparable quality but with greater efficiency and ease of use. Crucially, this study shows that LMMs have the potential to generate synthetic data without requiring extensive computational resources. This capability could greatly assist researchers in training deep learning models more easily.
Artificial Intelligence Solution on Intel Xeon Processor Power and Performance Engineering

Zhongbin Liu, Xiaofei Jiang, Jiajia Zhang (Intel)
Nowadays the major Cloud Service Providers (CSP) are critically setting up high-performance infrastructures to meet cloud customers’ diverse computing demands. To help CSP customers invest the right areas to build high-performance Xeon-based systems based on their specific usages, Intel invests significant engineering resources on Xeon products power performance features design, development, and validation, while the engineering cost is huge and not scalable.  This paper introduces an Artificial Intelligence (AI) solution named Bench Counselor for post-silicon power performance development and validation, it could suggest the most valuable hardware investment areas as per customer usage or benchmarking methodology, meanwhile reduces the engineering resources. Training AI model with historical Xeon processor performance results and system hardware configurations, the AI solution could efficiently assist power and performance engineers for outliers categorizing and debugging, also provide heuristics on the most valuable investment areas to get significant performance gain.
Boosting the Performance of Reinforcement Learning-based Task Scheduling using Offline Inference

Chedi Morchdi (Univ. of Utah), Cheng-Hsiang Chiu (Univ. of Wisconsin), Yi Zhou (Univ. of Utah), Tsung-Wei Huang (Univ. of Wisconsin)
Modern computer-aided design (CAD) tools leverage complex algorithms incorporating millions of interdependent functional tasks. Scheduling these tasks efficiently across CPUs and GPUs is paramount, as it directly governs overall performance. However, existing scheduling approaches are typically hardcoded within applications, limiting their adaptability to non- stationary computing environments. To address this challenge, a recent paper introduced a novel reinforcement learning-based online inference task scheduling algorithm. While this approach can learn to adapt the performance optimization in dynamic environments, it integrates online task execution and online task inference, leading to significant overheads, such as querying the system status for each task. To address the concern, we propose a reinforcement learning-based offline inference task scheduling system. Our system separates task execution from inference, performing the inference offline to avoid the overheads. We will evaluate our approach on a VLSI static timing analysis workload and demonstrate that our approach is consistently faster than the online inference method, albeit with slightly increased resource consumption.

4-3: Innovative Computing Session (14:15-15:30)

Co-Chairs: K. Keville & P. Luszczek

Reinforcement Learning-generated Topological Order for Dynamic Task Graph Scheduling

Cheng-Hsiang Chiu (Univ. of Wisconsin), Chedi Morchdi, Yi Zhou (Univ. of Utah), Boyang Zhang, Che Chang, Tsung-Wei Huang (Univ. of Wisconsin)
Dynamic task graph scheduling (DTGS) has become a powerful tool for parallel and heterogeneous applications, such as static timing analysis and large-scale machine learning. DTGS allows applications to define the task graph structure on- the-fly, enabling concurrent task creations and task executions. However, to schedule tasks, DTGS relies on applications to define a topological order for the task graph. Existing algorithms for generating this order primarily rely on heuristics like level- by-level sorting, which lack adaptability to dynamic computing environments. This paper proposes a novel method that leverages reinforcement learning to generate topological orders for DTGS systems. We will delve into the details of our design and present a real-world use case. For instance, when scheduling a large task graph with 3.9 million tasks and 7.4 million dependencies in a large-scale static timing analysis workload, our method achieves a speedup of up to 1.52× compared to the baseline.
FPGA Acceleration for Scalable High-Resolution OPIR Target Detection

Daniel C Stumpp (Univ. of Pittsburgh), Alan George (NSF Center for High Performance Reconfigurable Computing)
The task of infrared small-object segmentation has a wide range of applications. One such application is in military early-warning systems where overhead persistent infrared (OPIR) sensors are leveraged for target detection. Although targets of interest often manifest as dim point-source targets, making them difficult to detect, recent machine-learning algorithms have enabled significant advances in detection capability. However, these more complex algorithms and increasing sensor resolution have made high-throughput, on-orbit processing challenging. This research explores FPGA acceleration of the Multiscale Local Contrast Learning Network (MLCLNet) target-detection model. MLCLNet was selected for its combination of high detection performance and architectural simplicity. Effects of model quantization required for acceleration were evaluated and shown to be minimal and, in some cases, positive. Additionally, the Xilinx Deep Learning Processor Unit (DPU) performance for the inference task is evaluated on the Xilinx UltraScale+ and Versal AI Core device architectures. Six DPU configurations and six MLCLNet model sizes were used to parameterize inference with 128×128 subframes. This research demonstrates that the Versal DPU can perform subframe inference at up to 1626 FPS with up to 6.7× speedup over the older UltraScale+ architecture. Informed by these findings, the Versal architecture’s heterogeneous nature is leveraged to implement an accelerated end-to-end target-detection pipeline. This pipeline enables inference on large OPIR frames by batching them into appropriately overlapped subframes, preprocessing, performing inference, and postprocessing into detections. Throughput ranging from 0.96 FPS to 75.79 FPS is achieved for frame sizes ranging from 4k×4k down to 500×500. The proposed architecture’s resource utilization, latency, and power consumption are also analyzed.
Hybrid Computing Architecture Based on Analog Phase-Change Memory Chips for Deep Neural Network Training

Zhenhao Jiao (Univ. of Science and Technology of China), Tao Hong, Xiaogang Chen, Weibang Dai (Shanghai Institute of Microsystem and Information Technology, CAS), Chengcai Tu (Donghua University), Shunfen Li, Houpeng Chen, Zhitang Song (Shanghai Institute of Microsystem and Information Technology, CAS)
Deep neural networks (DNNs) have revolutionized fields like image recognition and natural language processing but face limitations with traditional von Neumann architectures due to high energy consumption and limited computing speed. We propose a hybrid architecture for DNN training combining a digital processing unit (DPU) and analog phase-change memory (PCM) chips using 40 nm CMOS technology. The DPU manages precise computations, while the PCM chip handles matrix-vector multiplication (MVM) with a novel nonlinear pulse scheme for accurate conductance tuning. Our architecture successfully trained a 3-layer fully connected neural network, achieving a classification accuracy of 97.26%, on par with software-based training. Simulations confirm the feasibility of extending this approach to more complex convolutional neural networks, demonstrating its adaptability to PCM device characteristics and potential for high-efficiency DNN training.
Exploring the Trade-off Between Repair Time and Reliability in Large Scale Cluster Computers: A Simulation-Based Approach

Leslie Horace (Georgia Inst. of Tech.), Craig Walker, William M Jones (Coastal Carolina Univ.), Nathan DeBardeleben, Vivian Hafener, Steven Senator (LANL)
As the size of high performance computing (HPC) computational clusters continues to increase in performance, scale and component count, the role that reliability and particularly the repair time plays a significant role in system specification, procurement, and ultimate operation of such systems. System administrators must find a balance among competing factors: initial capital investment, operational costs and observed system performance and utility from the end users’ perspectives are chief among them. In this paper we, explore the trade-off between reliability, performance and node repair times in large-scale high-performance computing (HPC) computational clusters using real historical workloads from Los Alamos National Laboratory (LANL). We enhance an existing cluster simulator to more quickly perform the last-scale parameter sweeps necessary to obtain meaningful results for these studies, in some cases by several orders of magnitude. Our results show that these simulations can be parameterized to identify trends that can be used to make decisions about system procurement and operation as a function of the operational parameters and constraints.
Experiences with VITIS AI for Deep Reinforcement Learning

Nabayan Chaudhury, Atharva M Gondhalekar, Wu-chun Feng (Virginia Tech)
Deep reinforcement learning has found use cases in many applications, such as natural language processing, self-driving cars, and spacecraft control applications. Many use cases of deep reinforcement learning seek to achieve inference with low latency and high accuracy. As such, this work articulates our experiences with the AMD Vitis AI toolchain to improve the latency and accuracy of inference in deep reinforcement learning. In particular, we evaluate the soft actor-critic (SAC) model that is trained to solve the MuJoCo humanoid environment, where the objective of the humanoid agent is to learn a policy that allows it to stay in motion for as long as possible without falling over. During the training phase, we prune the model using the weight sparsity pruner from the Vitis AI optimizer at different timesteps. Our experimental results show that pruning leads to an improvement in the evaluation of the reinforcement learning policy, where the trained agent can remain balanced in the environment and accumulate higher rewards, compared to a trained agent without pruning. Specifically, we observe that pruning the network during training can deliver up to 20% better mean episode length and 23% higher reward (better accuracy), compared to a network without any pruning. Additionally, there is an improvement in decision-making latency up to 20%, which is the time between the observation of the agent’s state and a control decision.

4-4: Graph Challenge Session (15:45-17:30)

Co-Chairs: J. Kepner & A. Reuther

Mercury: Efficient Subgraph Matching on GPUs with Hybrid Scheduling

Zhiheng Lin (Inst. of Computing Tech, CAS), Changjie Xu (UCAS), Ke Meng, Guangming Tan (Inst. of Computing Tech, CAS)
Subgraph matching finds all distinct subgraphs in the given data graph G that are isomorphic to the pattern graph P. It is widely used in social networks, chemoinformatics, recommendation systems, anomaly detection, and network security. Unfortunately, subgraph matching is an NP problem with a huge search space that can quickly exhaust computational resources and requires materializing a large number of intermediate results. Even with GPU acceleration, the processing time for subgraph matching tasks on large graphs often fails to meet the needs of real-world applications. Previous systems use coarse-grained thread-mapping strategies and static configuration for symmetry breaking rules and intersection kernels, which lose the opportunity to exploit the fine-grained parallelism of GPUs. In this paper, we first discuss different optimization variants from three aspects:(i) symmetry breaking, (ii) thread-mapping and (iii) intersection kernel, then we propose a novel hybrid scheduling strategy to combine these optimization variants. Based on this scheduling, we developed Mercury to enable load-balanced and efficient subgraph matching on GPUs. Experiments show that Mercury outperforms Trust, SMOG, and H-INDEX up to 3.92x, 19.7x, and 21.6x, respectively, in triangle counting. For general pattern matching tasks, it is up to 52.4x faster than G2Miner, and can scale up to 1024 GPU cards.
Towards Faster Graph Partitioning via Pre-training and Inductive Inference

Meng Qin (HKUST), Chaorui Zhang (Huawei), Yu Gao (Independent), Yibing Ding, Weipeng Jiang (Huawei), Weixi Zhang (Huawei Technologies), Wei Han (Huawei), Bo Bai (Huawei Technologies)
Graph partitioning (GP) is a classic problem that divides the node set of a graph into densely-connected blocks. Following the IEEE HPEC Graph Challenge and recent advances in pre-training techniques (e.g., large-language models), we propose PR-GPT (Pre-trained &amp; Refined Graph ParTitioning) based on a novel pre-training &amp; refinement paradigm. We first conduct the offline pre-training of a deep graph learning (DGL) model on small synthetic graphs with various topology properties. By using the inductive inference of DGL, one can directly generalize the pre-trained model (with frozen model parameters) to large graphs and derive feasible GP results. We also use the derived partition as a good initialization of an efficient GP method (e.g., InfoMap) to further refine the quality of partitioning. In this setting, the online generalization and refinement of PR-GPT can not only benefit from the transfer ability regarding quality but also ensure high inference efficiency without re-training. Based on a mechanism of reducing the scale of a graph to be processed by the refinement method, PR-GPT also has the potential to support streaming GP. Experiments on the Graph Challenge benchmark demonstrate that PR-GPT can ensure faster GP on large-scale graphs without significant quality degradation, compared with running a refinement method from scratch. We will make our code public at https://github.com/KuroginQin/PRGPT.
Distributed-Memory Sparse Deep Neural Network Inference Using Global Arrays

Sayan Ghosh, Bruce Palmer, Andres Marquez (PNNL)
Partitioned Global Address Space (PGAS) models exhibit tremendous promise in developing efficient and productive distributed-memory parallel applications. Traditionally, PGAS communication models have been applied to dense/contiguously distributed data, but most modern applications depict varied levels of sparsity. Existing PGAS models require certain adaptations to support sparse data. A number of sparse computations also require efficient support for distributed arithmetic operations, in addition to data movement. The Global Arrays toolkit from Pacific Northwest National Laboratory (PNNL) is one of the earliest PGAS models to combine one-sided data communication and distributed matrix operations, and is still used in the popular NWChem quantum chemistry suite. Recently, we have expanded the Global Arrays toolkit to support common sparse operations, like sparse matrix-dense matrix multiplies (SpMM) and sparse matrix-sparse matrix multiplication (SpGEMM). As it turns out, these operations are the backbone of sparse Deep Learning (DL); sparse deep neural networks have gained increasing attention recently in achieving speedups on inference with reduced memory footprints. Unlike scientific applications in High Performance Computing (HPC), modern (distributed-memory capable) DL toolkits often relies on non-standardized and closed-source vendor software optimizations, creating challenges in software-hardware co-design at scale.  Our goal is to support a variety of sparse matrix operations and helper functions in the newly created Sparse Global Arrays (SGA), such that it is possible to build portable and productive Machine Learning scenarios for codesign purposes. We demonstrate the usefulness of SGA by building Sparse Deep Neural Network (SpDNN) challenge scenarios as a case study. Current implementation is built on top of MPI-1 and uses CPUs to maximize the portability across platforms.
Anonymized Network Sensing Graph Challenge

Hayden R Jananthan, Michael Jones, William Arcand, David Bestor, William Bergeron, Daniel Burrill (MIT Lincoln Laboratory), Aydin Buluc (LBNL), Chansup Byun (MIT Lincoln Laboratory), Timothy Davis (Texas A&M), Vijay Gadepally (MIT Lincoln Laboratory), Daniel Grant (GreyNoise), Michael Houle, Matthew Hubbell, Piotr Luszczek, Peter Michaleas (MIT Lincoln Laboratory), Lauren Milechin (MIT), Chasen Milner, Guillermo Morales (MIT Lincoln Laboratory), Andrew Morris (GreyNoise), Julie Mullen, Ritesh Patel (MIT Lincoln Laboratory), Alex Pentland (MIT), Sandeep Pisharody, Andrew Prout, Albert Reuther, Antonio Rosa, Gabriel Wachman, Charles Yee, Jeremy Kepner (MIT Lincoln Laboratory)
The MIT/IEEE/Amazon GraphChallenge encourages<br/>community approaches to developing new solutions for analyzing graphs and sparse data derived from social media, sensor feeds, and scientific data to discover relationships between events as they unfold in the field. The anonymized network sensing Graph Challenge seeks to enable large, open, community-based approaches to protecting networks. Many large-scale networking problems can only be solved with community access to very broad data sets with the highest regard for privacy and strong community buy-in. Such approaches often require community-based data sharing. In the broader networking community (commercial, federal, and academia) anonymized source-to-destination traffic matrices with standard data sharing agreements have emerged as a data product that can meet many of these requirements. This challenge provides an opportunity to highlight novel approaches<br/>for optimizing the construction and analysis of anonymized traffic matrices using over 100 billion network packets derived from the largest Internet telescope in the world (CAIDA). This challenge specifies the anonymization, construction, and analysis of these traffic matrices. A GraphBLAS reference implementation is provided, but the use of GraphBLAS is not required in this Graph Challenge. As with prior Graph Challenges the goal is to provide a well-defined context for demonstrating innovation. Graph Challenge participants are free to select (with accompanying explanation) the Graph Challenge elements that are appropriate for highlighting their innovations.
Extracting TCPIP Headers at High Speed for the Anonymized Network Traffic Graph Challenge

Zhaoyang Han, Andrew Briasco-Stewart (Northeastern Univ.), Michael Zink (UMass Amherst), Miriam Leeser (Northeastern Univ.)
Field Programmable Gate Arrays (FPGAs) play a significant role in computationally intensive network processing due to their flexibility and efficiency. Particularly with the high-level abstraction of the P4 network programming model, FPGA shows a powerful potential for packet processing. By supporting the P4 language with FPGA processing, network researchers can create customized FPGA-based network functions and execute network tasks on accelerators directly connected to the network. A feature of the P4 language is that it is stateless; however, the FPGA implementation in this research requires state information. This is accomplished using P4 externs to describe the stateful portions of the design and to implement them on the FPGA using High-Level Synthesis (HLS). This paper demonstrates using an FPGA-based SmartNIC to efficiently extract source-destination IP address information from network packets and construct anonymized network traffic matrices for further analysis. The implementation is the first example of the combination of using P4 and HLS in developing network functions on the latest AMD FPGAs. Our design achieves a processing rate of approximately 95 Gbps with the combined use of P4 and High-level Synthesis and is able to keep up with 100 Gbps traffic received directly from the network.
Sans: Streaming Anonymized Network Sensing

Ketai Zhao, Yuhang Zhou, Hong Xu Pan, Zhibin Wang, Sheng Zhong, Chen Tian (Nanjing Univ.)
Large-scale network sensing is an important task with applications in various domains. Recently, researchers have proposed a network sensing algorithm based on GraphBLAS, which divides the input data into multiple disjoint blocks and constructs a graph for each block containing hypersparse network sensing data. However, this block-based approach may miss some anomalies between two consecutive blocks. In this paper, we aim to address this issue by developing a streaming anonymized network sensing systems, Sans. Specifically, Sans combines the advantages of directly maintaining edges in the hashtable and maintaining the vertices as well as the adjacent edges in the hashtable of list to develop a dynamic, efficient, and compressed data structure for hypersparse network sensing data. Furthermore, we develop an incremental calibration algorithm based on gradient descent by leveraging the previous analysis parameters. We also propose a parallel version of the algorithm, which supports shared-memory lock-based and distributed-memory lock-free designs. We conduct extensive experiments to evaluate the performance of the proposed streaming network sensing algorithm. The results demonstrate that Sans outperforms the static CSR (GraphBLAS) approach by one million times.
Invited Talk: Lessons Learned from Implementing the Anonymized Network Sensing Graph Challenge with GPUs and Commodity Software

Siddharth Samsi, Dan Campbell, Emanuel Scoullos, and Oded Green (NVIDIA)

IEEE HPEC 2024