28th Annual
IEEE High Performance Extreme Computing Virtual Conference
23 - 27 September 2024

All times are EDT (UTC/GMT -04 hours)

Speaker/Presenting Author in Italics

Wednesday, September 25

3-K: Keynote Session (10:30-11:00)

Co-Chairs: J. Kepner & A. Reuther

Keynote Talk: The Building Blocks of Cloud – Research Enablement

Scott Yockel (Harvard Univ.)

3-1: AI / Machine Learning 1 Session (11:00-12:15)

Co-Chairs: P. Luszczek & TBD

ModelGauge: Inference Profiling of Deep-Learning Models [Outstanding Paper Award]

Calvin B Gealy (Univ. of Pittsburgh), David Langerman (NSF SHREC), Alan George (NSF Center for High Performance Reconfigurable Computing); Identifying trends in on-device performance between different deep-learning models is often challenging given the variety of models published and the different devices used in deployment. ModelGauge is a proposed solution that reports the latency, memory, and bandwidth behavior of many different inference configurations. Utilizing ONNX Runtime for CPUs and NVIDIA TensorRT for GPUs, as well as the standardized ONNX model format for model definitions, ModelGauge can easily profile many architectures, allowing deployment engineers easy access to statistics about inference performance. To demonstrate the utility of the tool, we compare 32 different ONNX model definitions and their on-device scaling behavior on an ARM Cortex-A76 embedded CPU, AMD EPYC 9374F server CPU, NVIDIA Jetson Orin Nano embedded GPU, and NVIDIA A100 server GPU. For this study, Pearson correlation is used to show the linear relationship between a metric and a device measurement to characterize behavior and show the utility of bulk data collection. When comparing the number of floating-point operations in a model to the latency for single-image batch inference, the Pearson correlation is highest on the ARM Cortex-A76 at 0.990 and lowest on the highly parallel NVIDIA A100 at 0.388. Across all devices and models tested, the linear trend and Pearson correlation between the number of parameters in a model and the memory is consistently greater than 0.9. Additionally, we propose a new metric found by analyzing the data translation lookaside buffer load miss count and the device latency to help indicate models not using a significant amount of the device. Overall, ModelGauge is useful for gathering statistics about a variety of models across compute at many scales.
Enhanced Knowledge Graph Attention Networks for Efficient Graph Learning [Outstanding Student Paper Award]

Fernando P Vera Buschmann, Zhihui Du, David A Bader (New Jersey Inst. of Tech.); This paper presents an innovative design for Enhanced Knowledge Graph Attention Networks (EKGAT), which focuses on improving representation learning to analyze more complex relationships of graph-structured data. By integrating TransformerConv layers, the proposed EKGAT model excels in capturing complex node relationships compared to traditional KGAT models. Additionally, our EKGAT model integrates disentanglement learning techniques to segment entity representations into independent components, thereby capturing various semantic aspects more effectively. Comprehensive experiments on the Cora, PubMed, and Amazon datasets reveal substantial improvements in node classification accuracy and convergence speed. The incorporation of TransformerConv layers significantly accelerates the convergence of the training loss function while either maintaining or enhancing accuracy, which is particularly advantageous for large-scale, real-time applications. Results from t-SNE and PCA analyses vividly illustrate the superior embedding separability achieved by our model, underscoring its enhanced representation capabilities. These findings highlight the potential of EKGAT to advance graph analytics and network science, providing robust, scalable solutions for a wide range of applications, from recommendation systems and social network analysis to biomedical data interpretation and real-time big data processing.
Mobile-Optimized Vessel Segmentation for Ultrasound-Guided Surgical Procedures

Mateusz Wolak, Fin Amin, Nancy DeLosa, Brian A Telfer, Lars Gjesteby (MIT Lincoln Laboratory); Non-compressible torso hemorrhage is the leading cause of potentially survivable fatalities in civilian and battlefield trauma. An insufficient number of trauma surgeons are expected to be available in future large-scale combat operations and natural disasters, creating a need for assistive technology to enable fast and accurate vascular access in pre-hospital environments. AI-GUIDE is a handheld surgical tool designed for emergency medical operations; the prototype combines an ultrasound probe with real-time image processing software, which controls robotic needle insertion components. The goal of this work is to present an investigation of optimizations for mobile inference of the AI algorithm used in the AI-GUIDE prototype. The key to these optimizations is the use of mobile-optimized neural network models, quantization, and the exploitation of a software development kit to take advantage of hardware acceleration. We compare the tradeoff between speed and accuracy for different runtime targets, quantization methods, and model sizes for two resource-conscious neural networks. We run our experiments on a smartphone as a reliable proxy for performance on the AI-GUIDE.
GLITCHES: GPU-FPGA LLM Inference Through a Collaborative Heterogeneous System

Fan Yang (Tsinghua Univ., SenseTime Inc.), Xinhao Yang, Hongyi Wang, Zehao Wang, Zhenhua Zhu, Shulin Zeng, Yu Wang (Tsinghua Univ.); Large language models (LLMs) demonstrate strong capabilities across various tasks. However, in latency-sensitive scenarios, a small batch or even one batch is usually required. This leads to the prefill and the decode stage of LLM inference being computational and memory bottlenecks, respectively. Therefore, it is difficult for a homogeneous FPGA or GPU system to simultaneously address different computational bottlenecks in different stages of LLM inference, resulting in long prefill latency on FPGAs and low utilization during the decode stage on GPUs. This paper proposes GLITCHES, GPU-FPGA LLM inference through a collaborative heterogeneous system. In this paper, we analyze the different characteristics of GPUs and FPGAs and employ GPUs for the prefill stage and FPGAs for the decode stage, leveraging the strengths of GPUs and FPGAs. Based on HBM profiling results, we apply the data prefetching technique to further improve the off-chip memory bandwidth utilization during the decode computations on FPGAs. Experiments demonstrate that a GLITCHES heterogeneous LLM inference system with an A100 GPU and seven U280 FPGAs achieves a 1.28/1.34 times improvement in system throughput and a 2.38/1.90 times improvement in cost efficiency compared to a homogeneous system with 8-card A100/V100S GPUs.
Graphical Learning Optimization and Dimensionality Reduction with Geometric Multi-Resolution Analysis

Felicia Schenkelberg, Allison I Gunby-Mann, Emma Graham (Dartmouth Coll.), Shuoxuan Li (Carnegie Mellon Univ.), Peter Chin (Dartmouth Coll.)

This paper employs Geometric Multi-Resolution Analysis (GMRA) as a technique for dimensionality reduction and explores its impact on high-dimensional graphical learning tasks. The burgeoning surge in data collection practices, driven by technological advancements across diverse domains, has resulted in an influx of datasets wherein the number of features significantly exceeds the number of observations—a paradigm characteristic of high-dimensional datasets. Analyzing such high-dimensional datasets presents immediate challenges owing to the intricacies of dataset complexity as well as the wealth of information encapsulated within each data point. GMRA exploits redundant representations in such high-dimensional datasets, embedding the high-dimensional data into an intrinsic, underlying lower-dimensional structure. This process aims to preserve essential features while reducing dimensionality and facilitate analysis by mitigating the computational complexities associated with analyzing high-dimensional datasets. This paper proposes a novel application of Geometric Multi-Resolution Analysis to dimensionality reduction in graph embeddings. Empirically, its efficacy is validated by its performance in computing the intrinsic, underlying lower-dimensional structure for a comprehensive set of graph learning tasks, including node classification, edge classification, link prediction, anomaly detection, and graph clustering.

3-P1 (12:15-13:15): Poster Session 3-1

Chair(s)/Host(s): TBD

CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers [Outstanding Short Paper Award]

Lakshmi V Nair (Lightmatter); This extended abstract investigates the application of Contrastive Language-Image Pre-training (CLIP) for efficient knowledge distillation, by utilizing embeddings as teachers. Typical knowledge distillation frameworks require running forward passes through a teacher model, which is often prohibitive in the case of billion or trillion parameter teachers. Our initial findings show that using only the embeddings of the teacher models to guide distillation, can outperform full-scale knowledge distillation using 9x less memory and 8x less training time.
Efficiency of Data Intensive Computing (DIC) in MEMS Research for Data Processing and Analysis

Yeligay Segizbay (Nazarbayev Univ.); This paper identifies how Data Intensive Computing (DIC) can significantly enhance MEMS research by enabling the efficient processing and analysis of large datasets. This paper contains MEMS dataset analysis by using DIC-type computing such as Parallel computing. Moreover, this paper analyzes the open-source dataset on the Frequency response of high-frequency MEMS in terms of quality factor and resonance frequencies in terms of Q-factor(quality factor) and resonance frequencies. The results showed that the Q-factor generally increases with frequency, especially at higher frequencies. Index Terms—Quality factor, resonance frequencies, MEMS, parallel computing
Capturing the Carbon Impact of Deep Learning

Alexis Corona, Sanmukh Kuppannagari (Case Western Reserve Univ.); Modern Deep Learning training is extremely carbon and energy intensive. Existing tools to evaluate carbon and energy consumption are either too coarse-grained, making them inaccurate or too sophisticated, making them inaccessible. This work presents a framework that enables users to capture accurate carbon and energy consumption of their deep learning training runs.
Transfer Learning Assisted Parameter Selection for Water-Fat Separation in Dixon MRI

Alan Okinaka (Ursinus College), Gulfam A Saju, Yuchou Chang (UMass Dartmouth); The Dixon method is a clinical Magnetic Resonance Imaging (MRI) approach employed to differentiate and separate water and fat signals and plays a crucial role in various clinical applications. The efficiency of this method is largely influenced by the optimal selection of parameters such as Echo Time(TE) and Echo Spacing. However, acquiring these optimal parameters can be challenging due to the limited availability of training datasets and the complexity of manual selection. This study proposes a novel parameter selection method using transfer learning on simulated images to address these challenges. We leverage pre-trained models trained on one task as a starting point for a related task, under the framework of transfer learning. This approach helps identify optimal TE and echo spacing parameters and thus aids in optimizing Dixon technique parameters. Our proposed method customizes these models to the specific task of differentiating water-only or fat-only images. Experimental results reveal that these pre-trained models can successfully classify the simulated images, thereby providing promising implications for enhancing the performance of the Dixon method in MRI.
Traditional Costume Image Classification for Indian States Using Deep Learning

Sahana R Koti, Sahana Channappa Jatti, Anupama S Nandeppanavar, Medha Kudari (KLE Institute of Technology); Traditional Costume Image Classification for Indian States Using Deep Learning is an emerging field in computer vision that aims to identify and categorize traditional attires from various states based on their visual features. This research addresses the development of a robust image classification model to recognize and classify costumes from different Indian states. The project leverages a deep learning approach, utilizing convolutional neural networks (CNNs) to process and analyze costume images. The primary dataset comprises images of traditional costumes from Indian state, ensuring diversity in attire styles, colors, patterns, and cultural representations. Preprocessing steps include image resizing, normalization, and augmentation to enhance model generalization. To evaluate the model’s performance, standard metrics such as accuracy, precision, recall, and F1-score are employed. The results demonstrate the model’s capability to achieve high classification accuracy, with notable precision in distinguishing between similar attires from neighboring states. The model is trained using different DL algorithms, finding their accuracy and predicting the results. The analysis was made by considering the raw data. Models’ performance was improved with normalization with the following accuracies. 95%, 98% and 100% accuracies are obtained for VGG16, MobileNetV2 and ResNet50V2 and DenseNet121 respectively.
Scalable Approach for Analytic Polynomial Subspace Projection Matrices for a Space-Time Covariance Matrix

Faizan Ahmad Khattak, Mohammed Bakhit, Ian K. Proudler, Stephan Weiss (Univ. of Strathclyde)

In sensor array applications, it can be advantageous to project data onto a given signal subspace, for example, to improve the SNR or as part of direction finding algorithms. In the broadband case, a projection operator can be derived via polynomial matrices and, more specifically, from a spacetime covariance matrix. Traditional methods perform a complete polynomial eigenvalue decomposition (PEVD) to achieve this, which can be computationally intensive. We propose a novel method to compute these subspace matrices directly, without the need for a full PEVD. Our approach is evaluated against existing methods using an ensemble of randomized para-Hermitian matrices, demonstrating significant improvements in both accuracy and computation time.

Tutorial Session: 3-T (12:15-15:45): Spiral Tutorial

Organizer(s): F. Franchetti and M. Franusich

3-2: Scaling Research Computing Education Session (12:30-13:45)

Co-Chairs: J. Mullen, L. Milechin & H. Jananthan

Invited Talk: Scaling Project-based Learning from Education to Research

Joel Grimm (MIT Lincoln Laboratory)
Invited Talk: Educational Game Dev from Start to Finish: A Short Example

Chasen Milner (USAF)
Invited Talk: HPC-ED: A Federated Catalog to Share and Discover CyberTraining Materials

Susan Mehringer (Cornell Center for Advanced Computing)
Invited Talk: The Wide Area Classroom – 10 Years On

John Urbanic (CMU and Pittsburgh Supercomputing Center)

3-P2 (13:45-14:45): Poster Session 3-2

Chair(s)/Host(s): K. Cain

Gesture Controlled System to Automate Shutdown, Screenshot and Volume Toggle

Prisha Bhosale, Ananya Dandekar, Ria Dcosta, Sri Aishwarya Jonnavittula, Shagufta Rajguru (Fr. Conceicao Rodrigues Institute of Technology); Across 33 rich countries, only 5% of the population has high computer-related abilities, and only one third of people can complete medium-complexity tasks. Research shows that on dividing people based on their ability to use technology into 4 groups, 14% of the population still falls under the ‘below level 1’ category i.e., can perform a maximum of one simple task on a computer. Keeping these users, along with any population with debilitations in mind, the need arises to automate certain tasks and simplify them, to ensure a steady growth in the number of efficient users of technology. A system working based on gestures, automating mundane and repetitive tasks is a step in the right direction to facilitate a technologically better trained and equipped environment, as well as improving time complexity in the fast-paced world of today. In this paper, a prototype for a gesture enabled shutdown is proposed. The system proposed is designed to build a framework for future systems based off of this skeleton, ensuring ease of use for every user.
Machine Learning Application for Smart Network Traffic Prediction

Islam Omar (New Mexico State Univ.), Whit Schonbein (SNL), Hameed Badawy (New Mexico State Univ.); In this work, we take an initial step toward a comprehensive study of the application of machine learning models for network traffic prediction. Using real-world traffic data collected at Sandia National Labs for several scientific applications, we investigate the features of these datasets and how different preprocessing methods affect the behavior of different machine learning models. Specifically, we evaluate the performance of several machine learning models, including Linear Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and XGBoostRegressor (XGBR), against the AutoRegressive Integrated Moving Average (ARIMA) model. The models are trained and tested on both original and normalized versions of the network traffic dataset to understand the impact of data normalization on predictive accuracy and how the performance of the machine learning models varies based on the nature of the problem and the provided data. Our findings demonstrate that machine learning models generally are a good approach to be considered for forecasting network traffic and our research work showed that some machine learning models would outperform other models. The RF, XGBR, and ARIMA models outperformed all other used machine learning models in this research work in terms of predictive accuracy and robustness. Furthermore, normalization of the dataset improves the performance of most machine learning models, particularly the RF, and XGBR models, which shows enhancement in predictive capability. The comparative analysis highlights the strengths and limitations of each model and offers insights into the best practices for network traffic prediction using machine learning. This study underscores the importance of choosing appropriate data preprocessing techniques and model selection to achieve optimal network traffic prediction performance.
Model to Predict Inventory Demand in Retail SMEs Using CRISP-DM and Machine Learning

Jhomax R Torres, Diego Moises Carpio Andia, Victor Parasi (Univ. Peruana de Ciencias Aplicadas); This study addresses efficient inventory management, a critical concern for small and medium-sized enterprises (SMEs) in the retail sector, affecting their prerational efficiency, cost management, and competitiveness. Despite its global prevalence, many SMEs lack efficient solutions that take advantage of available technology and information. The objective of this study is to train machine learning models to predict inventory demand in SMEs, addressing their unique challenges and limitations. The Cross Industry Standard Process for Data Mining methodology is employed to develop the model using four machine learning algorithms: Random Forest (RF), Long Short-Term Memory (LSTM), Extreme Gradient Boosting (XGBoost) and Decision Tree (DT). The methodology consists of five phases: business understanding, data understanding, data preparation, modeling and evaluation. For the training phase, cross-validation was used on a dataset consisting of 16,071 records collected from July 25, 2023 to March 29, 2024 from a Peruvian SME, considering a total of 14 variables. The results highlight XGBoost as the algorithm that best fit our records with an R2 of 0.82.
Determination of Game-based Design Equilibria by Using Machine Learning Approach

Sara Karimi, Ehsan Ghotbi (Alfred Univ.); This paper presents a comprehensive study on the application of artificial intelligence and machine learning to enhance efficiency and precision in game-based design problems, with specific focus on pressure vessels, bilevel problems with three followers, and speed reducers. It proposes an AI-enhanced machine learning framework to solve numerically complex optimization engineering designs problems beyond traditional methods. This approach is demonstrated through three example problems, each viewed as a game with set players, presenting unique challenges and design requirements. The process begins by developing datasets from specific problem intervals and features, using MATLAB tools to achieve optimized solutions. These optimized results then serve as training data for a neural network, designed to predict rational reaction sets of players involved in the design process, thereby facilitating more informed and accurate decision-making. By integrating advanced machine learning techniques and formulating problems through game theory, this approach not only streamlines the computational process but also significantly improves the reliability and adaptability of engineering solutions. This research introduces the transformative impact of machine learning in game-based design, offering adaptive, efficient, and robust design optimizations that provides a new era in the field.
The Analysis of the Sparse Multi-GPU Parallel Method on the Large Sparse Power Flow Calculation

Lei Zeng, Shadi Alawneh (Oakland Univ.)

Power Flow (PF) calculation serves to analyze the power system efficiently, aiming to obtain the magnitude voltage and phase angle, as the development of multiple energy supplies. Due to the high sparsity and increasingly complexity of power systems, solving large-scale sparse systems in a parallelized fashion has become a bottleneck for power engineers. To address this challenge, this paper proposes a sparse multi-GPU Fast Decouple (FD) method to accelerate the PF calculation. Specifically, data parallelism is designed to enhance scalability and maintain the load balancing across the multi-GPUs. Additionally, GPUDirect technology is employed to reduce the communication overhead between multiple GPUs. As a result, this method in the paper achieves nearly 4x on a power system with over 10, 000 buses, compared to the MATLAB-based optimized library, MATPOWER, on a large-scale power system.

3-3: AI / Machine Learning 2 Session (14:15-15:30)

Co-Chairs: TBD & TBD

A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

Yuesheng Xu, Arielle Carr (Lehigh Univ.); The increasing complexity of deep learning models and the demand for processing vast amounts of data make the utilization of large-scale distributed systems for efficient training essential. These systems, however, face significant challenges such as communication overhead, hardware limitations, and node failure. This paper investigates various optimization techniques in distributed deep learning, including Elastic Averaging SGD (EASGD) and the second-order method AdaHessian. We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process. We conduct experiments with different numbers of workers and communication periods to demonstrate improved convergence rates and test performance using our strategy.
P-YOLOv8: Efficient and Accurate Real-Time Detection of Distracted Driving

Mohamed R. Elshamy (New Mexico State Univ.), Heba Emara (Pyramids High Institute of Electronic Engineering), Mohamed Nanyang Shoaib (Nanyang Tech. Univ.), Hameed Badawy (New Mexico State Univ.); Distracted driving is a critical safety issue that leads to numerous fatalities and injuries worldwide. This study addresses the urgent need for efficient and real-time machine learning models to detect distracted driving behaviors. Leveraging the Pretrained-YOLOv8 (P-YOLOv8) model, a real-time object detection system is introduced, optimized for both speed and accuracy. This approach addresses the computational constraints and latency limitations commonly associated with conventional detection models. The study demonstrates P-YOLOv8’s versatility in both object detection and image classification tasks using the Distracted Driver Detection dataset from State Farm, which includes 22,424 images across ten behavior categories. Our research explores the application of P-YOLOv8 for image classification, evaluating its performance compared to deep learning models such as VGG16, VGG19, and ResNet. Some traditional models often struggle with low accuracy, while others achieve high accuracy but come with high computational costs and slow detection speeds, making them unsuitable for real-time applications. P-YOLOv8 addresses these issues by achieving competitive accuracy with significant computational cost and efficiency advantages. In particular, P-YOLOv8 generates a lightweight model with a size of only 2.84 MB and a lower number of parameters, totaling 1,451,098, due to its innovative architecture. It achieves a high accuracy of 99.46% with this small model size, opening new directions for deployment on inexpensive and small embedded devices using Tiny Machine Learning (TinyML). The experimental results show robust performance, making P-YOLOv8 a cost-effective solution for real-time deployment. This study provides a detailed analysis of P-YOLOv8’s architecture, training, and performance benchmarks, highlighting its potential for real-time use in detecting distracted driving.
Spike-driven YOLO: Ultra Low-Power Object Detection with Neuromorphic Computing

Mark Barnell, Courtney Raymond, Lisa Loomis (AFRL), Francesca Vidal, Daniel Brown, Darrek Isereau (SRC); The latest Intel neuromorphic processor, Loihi 2, provides a breakthrough in Artificial Intelligence (AI) for computing at the edge, where sensor information is collected. The computing architecture does this by leveraging computations at the transistor level in a fashion analogous to the human brain’s biological neural networks (vs. a Von Neumann compute architecture). The Loihi 2 high performance, small form factor, and low-power consumption make it well-suited for a wide-range of real-time, deep learning applications such as target classification, object detection, and more. Our technical approach and findings support extreme computing needs for the internet of things (IoT) and various ground and airborne platforms’ applications. The recently released Loihi 2 ecosystem and the thorough research study completed on this effort were combined to accelerate development, optimization, and demonstration of a new concept of operation for machine learning at the edge. This 2024 research included training and testing Spike-driven YOLO models on data from various sensors. Our concept uses representative sensor data to detect and classify targets of interest through a combination of image processing techniques and machine learning. Importantly, our technical approach allowed us to rapidly train and evaluate the performance of several models for benchmarking against current state-of-the-art algorithms – w/mean average precision > 93% in some cases. The use of Intel’s latest Lava framework demonstrates the art-of-the-possible in edge computing by demonstrating capabilities on several sensor platforms with wide extensibility to other domains that can use this neuromorphic-computing hardware. In summary, this research included the use of new computing frameworks, processing algorithms, and a unique concept of operation.
Exploring sparse inference with SuiteSparse:GraphBLAS

Deepak Suresh, Timothy A Davis (Texas A&M Univ.); As AI models grow in size and complexity the resource demands are also growing. Sparse inference is an approach to take advantage of sparsity in large AI models to make them run faster thereby lowering the resource requirements. SuiteSparse:GraphBLAS, an open-source implementation of GraphBLAS, offers a robust framework tailored for leveraging sparsity in matrix computations. This work explores the application of SuiteSparse:GraphBLAS in performing inference tasks, particularly focusing on the language model BERT. By showcasing the capabilities of SuiteSparse:GraphBLAS in handling sparse computations and explaining its underlying formulation using semirings, this explores its application in improving inference efficiency. This work implements a complete inference pipeline using SuiteSparse:GraphBLAS, compares its performance with traditional frameworks like PyTorch, and identifies areas for improvement. Through this investigation, the study aims to highlight the strengths and limitations of SuiteSparse:GraphBLAS in AI computations.
Improving Regression in Spiking Neural Networks for Oceanographic Data Analysis

Alissa Kane, Yuchou Chang (UMass Dartmouth)

Spiking neural networks (SNNs), as the third-generation neural networks, can work under an energy efficient mode. SNNs are different from the second-generation neural networks which consume much energy and power. SNNs are suitable for oceanographic data analysis on the edge devices in underwater, since the devices generally have constrained power supply and limited communication bandwidth in the underwater environments. Although SNNs have been widely used in the classification tasks, SNN-based regression tasks are less studied because SNNs are generally considered to process discrete and sequential spikes. Existing regression model based on the membrane potential of Leaky Integrate-and-Fire (LIF) neuron uses constant settings and this mechanism may not be adaptive and capable to analyze oceanographic data which are highly complicated and dynamic. In this paper, we proposed three novel regression models of Adaptive Threshold Adjustment, Heterogeneous Neurons, and Nonlinear Integration to improve the existing LIF-based model. Experimental results on real oceanographic data indicate that the proposed regression models outperform the existing model through qualitative and quantitative analysis. Those SNN regression models could be implemented on edge devices within underwater environments in the future.

3-4: General Purpose GPU Computing 1 Session (15:45-17:30)

Co-Chairs: S. Gottlieb & N. Prajapati

Benchmarking Thread Block Cluster [Outstanding Paper Award]

Tim Lühnen, Tobias Marschner, Sohan Lal (TU Hamburg); Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: the thread block cluster (TBC). This feature allows thread blocks to be grouped together, facilitating direct communication and synchronization between them. For this communication link a dedicated network was added into the hardware, which connects the streaming multiprocessors (SMs). This paper delves into the performance characteristics of this new feature, specifically examining the latencies developers can anticipate when utilizing the direct communication channel provided by TBCs. We present an analysis of the SM-to-SM network behavior, which is crucial for developing accurate analytical and cycle-accurate simulation models. Our study includes a comprehensive evaluation of the impact of TBCs on application performance, highlighting scenarios where this feature can lead to significant improvements. Applications where a data-producing thread block writes data directly into the shared memory of the consuming thread block can be up to 2.3× faster than using global memory for data transfer. Additionally, applications constrained by shared memory can achieve up to a 2.1× speedup by employing thread block clusters. Our findings also reveal that utilizing large cluster dimensions can result in an execution time overhead exceeding 20%. By exploring the intricacies of the Hopper architecture and its new TBC feature, this paper equips developers with the knowledge needed to harness the full potential of modern GPUs and assists researchers in developing accurate analytical and cycle-accurate simulation models.
Understanding the Efficacy of Power Profiles: A Case Study of AMD Instinct MI100 GPU

Ghazanfar Ali, Mert Side (Texas Tech Univ.), Sridutt Bhalachandra (Univ. of North Carolina), Tommy Dang, Alan Sill, Yong Chen (Texas Tech Univ.); Graphics Processing Units (GPUs) have become pivotal for modern high-performance computing (HPC) and artificial intelligence workloads due to their substantial computational prowess. However, this computational prowess comes at a cost, as GPUs consume vast amounts of power, presenting a challenge for high-end computing systems, including those aimed at achieving exascale computing capabilities. In response to the power efficiency problem, modern GPUs typically offer the ability to adjust clock frequencies and cap power consumption. However, the AMD Instinct MI100 takes a unique approach by introducing a set of predefined power profiles that internally manipulate clock frequencies to manage power. This study evaluates the effectiveness of these power profiles through a comparative analysis of various power and performance metrics. It indicates that, for most of the selected workloads and during significant portions of their execution, the GPU consumes power exceeding its specified Thermal Design Power (TDP). For instance, the GROMACS workload exceeded its TDP by one-third during almost half of its execution time. Furthermore, the study notes a significant increase in temperature reaching as high as 80°C. Moreover, DGEMM and STREAM workloads exhibit similar power consumption patterns, suggesting that the underlying power management scheme does not adapt power allocation based on the computational intensity of the workload. Thus, the study demonstrates that changing the power profile does not significantly impact crucial metrics such as performance, clock frequency, voltage, GPU utilization, or temperature. In summary, this research sheds light on the power dynamics of the AMD Instinct MI100 GPU, emphasizing the challenges associated with power, performance, and thermal management in HPC environments. The findings underscore the importance of fine-tuning power management strategies to enhance energy efficiency while maintaining optimal performance in GPUs.
Community Detection for Large Graphs on GPUs with Unified Memory

Emre Dinçer, Işıl Öz (Izmir Institute of Technology); While GPUs accelerate applications from different domains with different characteristics, processing large datasets gets infeasible on target systems with limited device memory. Unified memory support makes it possible to work with data larger than available GPU memory. However, page migration overhead for executions with irregular memory access patterns, like graph processing workloads, induces severe performance degradation. While memory hints help to deal with page movements by keeping data in suitable memory spaces, coarse-grain configurations can still not avoid migrations for executions having diverse data structures. In this work, we target the state-of-the-art CUDA implementation of the Louvain community detection algorithm and evaluate the impacts of the fine-grained unified memory hints on the performance. Our experimental evaluation shows that memory hints configured for specific data structures reveal significant performance improvements and enable us to work efficiently with large graphs.
Invited Talk: From Simple to Hyper Co-Design of HPC Platforms

Gary Grider (Los Alamos National Laboratory)
Invited Talk: Lessons Learned from Implementing the Anonymized Network Sensing Graph Challenge with GPUs and Commodity Software

Siddharth Samsi, Dan Campbell, Emanuel Scoullos, and Oded Green (NVIDIA)

3-S1: LLMs: Opportunities & Challenges Special (17:30-19:30)

Organizers: V. Gadepally & D. Burrill

Invited Talk: US DoD Chief Data and Artificial Intelligence Office LLM Strategy

Manuel Xavier Lugo (US Navy)
Invited Talk: How to Make an LLM Understand Human Conversation for Fun & Profit

Kartik Talamadupula (Symbl.ai)
Invited Talk: Innovations for Reducing the Environmental Impact of LLMs

Boris Gamazaychikov (Salesforce)
Invited Talk: Intel Gaudi and Large Language Models

Vasudev Lal (Intel)

IEEE HPEC 2024

28th Annual IEEE High Performance Extreme Computing Virtual Conference 23 - 27 September 2024