27th Annual
IEEE High Performance Extreme Computing Virtual Conference
25 - 29 September 2023

HPEC 2023 AGENDA

Wednesday, September 27

3-K: Keynote Session (10:30-11:00)

Co-Chairs: J. Kepner & A. Reuther

AI for Digital Health & Computational Biology
Dr. Pradeep Dubey (Intel Senior Fellow)

3-1: AI / Machine Learning 2 Session (11:00-12:15)

Co-Chairs: N. Pitsianis & J.Mullen

Asymmetric Grouped Convolutions for Logarithmic Scale Efficient Convolutional Neural Networks [Outstanding Student Paper Award]
Li Jing, Rumen Dangovski, Marin Soljacic (MIT)
The design of convolutional neural networks has been increasingly focused on small and efficient models to meet the modern demands of edge devices. Thus, analyzing the theoretical limits of convolutional layers of previously unexplored complexities is critical. Here, we present a logarithmic-scale efficient convolutional neural network architecture. Our model is based on the well-known depthwise convolution, and on two new layers, which we introduce in this work: an asymmetric grouped convolution and a depthwise fast wavelet transform layer. By applying asymmetry in channel dimensions and applying a provably optimal fast algorithm we shrink the complexity of convolutional blocks by an O(logD/D) factor (from O(D^2) to O(DlogD), where D is the number of channels. Experiments on CIFAR-10, CIFAR100 and ImageNet classification show superior/comparable performances of our model to classical strong baseline models such as MobileNetV2 and ShuffleNet. The new convolutional layers that we propose could serve a variety of applications, from designing efficient models by hand to augmenting the search space of AutoML architectures.
Machine Learning Across Network-Connected FPGAs
Dana Diaconu, Yanyue Xie, Mehmet Gungor, Suranga Handagala, Xue Lin, Miriam Leeser (Northeastern Univ.)
FPGAs often cannot implement machine learning inference with high accuracy models due to significant storage and computing requirements. The corresponding hardware accelerators of such models are large designs which cannot be deployed on a single platform. In this research, we implement ResNet-50 with 4 bit precision for weights and 5 bit precision for activations, which has a good trade-off between precision and accuracy. We train ResNet-50 using the quantization-aware training library Brevitas and build a hardware accelerator with the FINN framework from AMD. We map the result to three FPGAs that communicate directly with one another over the network via the User Datagram Protocol (UDP). The multi-FPGA implementation is compared to a single FPGA ResNet-50 design with lower precision of 1 bit weights and 2 bit activations. While the latter can fit on a single FPGA, the former pays for higher accuracy with a three times increase in the required number of BRAM tiles and can only be deployed on multiple FPGAs. We show the difference in accuracy, resource utilization, and throughput for the designs deployed on AMD/Xilinx Alveo U280 data center accelerator cards available in the Open Cloud Testbed (OCT). The final multi-FPGA custom accelerator design for ResNet-50 achieves a 5.3\% increase in accuracy and a throughput of 162.3 images/s at a frequency of 200 MHz, comparable to the single FPGA lower precision implementation’s throughput of 176.1 images/s at 160 MHz. We further explore a more efficient usage of the available memory on the target platform. By making use of the available Ultra RAM, we are able to fit the accelerator with higher precision on one U280 and achieve a throughput of 165 images/s.
Image Segmentation with Topological Priors
Shakir Showkat Sofi, Nadezhda Alsahanova (Skolkovo Inst. of Sci. and Tech.)
Solving segmentation tasks with topological priors proved to make fewer errors in fine-scale structures. In this work, we use topological priors before and during the deep neural network training procedure. We compared the results of the two approaches on a simple segmentation task using various accuracy metrics and the Betti number error metric, which is directly related to topological correctness. It was found that incorporating topological information into the classical U-Net model performed significantly better. We conducted experiments on the ISBI EM segmentation dataset to confirm the effectiveness of the proposed approaches.
Robust Fine-Tuning of Vision-Language Models for Domain Generalization
Kevin Vogt-Lowell, Noah Lee, Theodoros Tsiligkaridis, Marc Vaillant (MIT Lincoln Laboratory)
Transfer learning enables the sharing of common knowledge among models for a variety of downstream tasks, but traditional methods suffer in limited training data settings and produce narrow models incapable of effectively generalizing under distribution shifts. Foundation models have recently demonstrated impressive zero-shot inference capabilities and robustness under distribution shifts. However, zero-shot evaluation for these models has been predominantly confined to benchmarks with simple distribution shifts, limiting our understanding of their effectiveness under the more realistic shifts found in practice. Moreover, common fine-tuning methods for these models have yet to be evaluated against vision models in few-shot scenarios where training data is limited. To address these gaps, we present a new recipe for few-shot fine-tuning of the popular vision-language foundation model CLIP and evaluate its performance on challenging benchmark datasets with realistic distribution shifts from the WILDS collection. Our experimentation demonstrates that, while zero-shot CLIP fails to match performance of trained vision models on more complex benchmarks, few-shot CLIP fine-tuning outperforms its vision-only counterparts in terms of in-distribution and out-of-distribution accuracy at all levels of training data availability. This provides a strong incentive for adoption of foundation models within few-shot learning applications operating with real-world data.
A Massively Parallel BWP Algorithm for Solving Large-Scale Systems of Nonlinear Equations
Bruno Silva (Univ. of Madeira and SRE-RG Madeira), Luiz Guerreiro Lopes (Univ. of Madeira)
This paper presents a GPU-based massively parallel implementation of the Best-Worst-Play (BWP) metaphor-less optimization algorithm, which results from the combination of two other simple and quite efficient population-based algorithms, Jaya and Rao-1, that have been used to solve a variety of problems. The proposed parallel GPU version of the algorithm is here used for solving large nonlinear equation systems, which have enormous importance in different areas of science, engineering, and economics and are usually considered the most difficult class of problems to solve by traditional numerical methods. The proposed parallelization of the BWP algorithm was implemented using the Julia programming language on a GeForce RTX 3090 GPU with 10,496 CUDA cores and 24 GB of VRAM and tested on a set of challenging scalable systems of nonlinear equations with dimensions between 500 and 2000. Depending on the tested problem and dimension, the GPU-based implementation of BWP exhibited a speedup up to 283.17x, with an average of 161.21x, which shows the efficiency of the proposed GPU-based parallel version of the BWP algorithm.

Tutorial Session: 3-T (12:15-15:45): Spiral Tutorial

Organizer(s): F. Franchetti & M. Franusich

3-2: AI / Machine Learning 3 Session (12:30-13:45)

Co-Chairs: D. Campbell & L. Brattain

Meta-Learning and Self-Supervised Pretraining for Storm Event Imagery Translation
Ileana Rugina, Rumen Dangovski (MIT), Mark Veillette, Pooya Khorrami (MIT Lincoln Laboratory), Brian Cheung (MIT), Olga Simek (MIT Lincoln Laboratory), Marin Soljacic (MIT)
Recent advances in deep learning have provided impressive results across a wide range of computational problems such as computer vision, natural language, or reinforcement learning. Many of these improvements are however constrained to problems with large-scale curated datasets which require a lot of human labor to gather. Additionally, these models tend to generalize poorly under both slight distributional shifts and low-data regimes. In recent years, emerging fields such as meta-learning or self-supervised learning have been closing the gap between proof-of-concept results and real-life applications of machine learning by extending deep learning to the semi-supervised and few-shot domains. We follow this line of work and explore spatiotemporal structure in a recently introduced image-to-image translation problem for storm event imagery in order to: (i) formulate a novel multi-task few-shot image generation benchmark in the field of AI for Earth and Space Science and (ii) explore data augmentations in contrastive pretraining for image translation downstream tasks. We present several baselines for the few-shot problem and discuss trade-offs between different approaches. Our implementation and instructions to reproduce the experiments, available at https://github.com/irugina/meta-image-translation, are thoroughly tested on MIT SuperCloud, and scalable to other state-of-the-art HPC systems.
Accelerating GNN-based SAR Automatic Target Recognition on HBM-enabled Data-center FPGA
Bingyi Zhang (USC), Rajgopal Kannan (DEVCOM Army Research Lab), Viktor K Prasanna (USC), Carl Busart (DEVCOM Army Research Lab)
Synthetic Aperture Radar (SAR) automatic target recognition (ATR) is a key technique for remote-sensing image recognition. In real-world applications, massive SAR images captured by airplanes or satellites are sent to a data center, which requires high-throughput and low-latency processing. Recently, Graph Neural Networks (GNNs) have shown superior performance for SAR ATR in terms of accuracy and computational complexity. In this paper, we accelerate GNN-based SAR ATR on a data-center FPGA. In the proposed design, we develop a customized data path and memory organization to execute various computation kernels of GNNs, including feature aggregation and feature transformation. We exploit the high bandwidth memory (HBM) of the data-center FPGA to speed up data loading and store intermediate results. Since data-center FPGAs have multiple dies with limited cross-die interconnection, which can easily lead to routing failures, we employ the splitting kernel technique to improve the routability and frequency of the design. We implement the proposed design using High-level Synthesis (HLS) on a state-of-the-art data-center FPGA board, the AMD/Xilinx Alveo U280. Compared with implementations on state-of-the-art CPUs (GPUs), our FPGA implementation achieves a $5.2\times$ ($1.57\times$) lower latency, a $10\times$ ($3.3\times$) higher throughput, and is $36.2\times$ ($7.35\times$) more energy efficient.
An Analysis of Energy Requirement for Computer Vision Algorithms [Outstanding Student Paper Award]
Daniel G Edelman, Siddharth Samsi, Joseph McDonald, Adam Michaleas, Vijay Gadepally (MIT Lincoln Laboratory)
The energy requirements of neural network learning are growing at a rapid rate. Increased energy demands have caused a global need to seek ways to improve energy efficiency of neural network learning. This paper aims to establish a baseline on how adjusting basic parameters can affect energy consumption in neural network learning on Computer Vision tasks. I cataloged the effects of various adjustments, from simple batch size adjustments to more complicated hardware configuration (such as power capping).  Findings include that adjusting from a single precision model to a mixed precision model can result in energy reductions of nearly 40%.
Additionally, power capping the GPU can reduce energy cost by an additional 10%.
Contextualizing Enhances Gradient Based Meta Learning for Few Shot Image Classification
Evan Vogelbaum, Rumen Dangovski, Li Jing, Marin Soljacic (MIT)
Meta learning methods have found success when applied to few shot classification problems, in which they quickly adapt to a small number of labeled examples. Prototypical representations, each representing a particular class, have been of particular importance in this setting, as they provide a compact form to convey information learned from the labeled examples. However, these prototypes are just one method of representing this information, and they are narrow in their scope and ability to classify unseen examples. We propose the implementation of contextualizers, which are generalizable prototypes that adapt to given examples and play a larger role in classification for gradient-based models. We demonstrate how to equip meta learning methods with contextualizers and show that their use can significantly boost performance on a range of few shot learning datasets. We also present figures of merit demonstrating the potential benefits of contextualizers, along with analysis of how models make use of them. Our approach is particularly apt for low-data environments where it is difficult to update parameters without overfitting. Our implementation and instructions to reproduce the experiments, available at https://github.com/naveace/proto-context/, are thoroughly tested on MIT SuperCloud, and scalable to other state-of-the-art HPC systems.
Manifold Transfer Networks for Lens Distortion Rectification
Li Jing, Lay Jain, Rumen Dangovski, Marin Soljacic (MIT)
Convolutional neural networks (CNNs), well-known for their translational invariance property on translational manifolds, are not guaranteed to generalize to images on other types of manifolds. Existing works extending CNNs’ translational invariance property are limited to linear transformations such as rotation. We propose a novel framework, the Manifold Transfer Network, with an embedded inductive bias for any specified nonlinear manifold. Our model maps a nonlinear transformation to a linear translation on a translational manifold, making it suitable for a CNN to learn and predict. We design such a map through the solutions of a particular class of partial differential equations. We empirically apply our method to the domain of radial lens distortion rectification. In our experiments on the CelebA dataset we demonstrate superior performance of our model compared to conventional baselines.

3-3: AI / Machine Learning 4 Session (14:15-15:30)

Co-Chairs: D. Campbell & L. Brattain

Invited Talk: IARPA AGILE Program
Dr. Bill Harrod (IARPA Program Manager)
Automated Indexing Of TEM Diffraction Patterns Using Machine Learning
Nathaniel Tomczak, Sanmukh Kuppannagari (Case Western Reserve Univ.)
Indexing Transmission Electron Microscopy (TEM) diffraction patterns is a critical step in materials characterization. Despite the manually intensive indexing process, work related to Machine Learning (ML) in its space is sparse. We present an evaluation of current state-of-the-art classification models and a Convolutional Neural Network (CNN), found through a Neural Architecture Search (NAS), in the TEM diffraction domain. Both convolution and transformer-based architectures were considered. Our NAS model achieved the greatest top-1 accuracy of 77.03% and F1 score of 0.751. The convolution-based architectures performed better, with EfficientNet-B3 achieving the highest average accuracy of 71.82% and tying the NAS model with the largest average F1 score of 0.686. These results can be used to guide further research into the better classification and creation of TEM diffraction data.
Scalable Deep Learning for Pilot Performance Analysis Using Multimodal Physiological Time Series
Noah V Lee (MIT Lincoln Laboratory), Patrick Moore (DAF-MIT/AIA), Laura Brattain (MIT Lincoln Laboratory)
Sensors used to collect human physiological data often necessitate the processing and classification of time series data, which can quickly become intractable with very lengthy inputs or many time series features. In this study we compared the performance of two methods of time series feature extraction and dimensionality reduction, Minimally Random Convolutional Kernel Transform (MiniRocket) and statistical feature engineering using TSFresh, to determine the optimal hardware configurations and associated performance-accuracy trade-offs between model speed and complexity. Our results showed that MiniRocket scales extremely well with only linear complexity while the scaling of TSFresh is dependent on the set of features selected for computation. Further, MiniRocket outperformed the TSFresh model accuracy for all configurations except the most comprehensive (but slowest) feature extraction set thereby highlighting MiniRocket as a great all-purpose dimensionality reduction tool for human physiological time series data.
PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory Access Prediction Models
Neelesh Gupta, Pengmiao Zhang (USC), Rajgopal Kannan (DEVCOM Army Research Lab), Viktor K Prasanna (USC)
Deep neural networks (DNNs) have proven to be effective models for accurate Memory Access Prediction (MAP), a critical task in mitigating memory latency through data prefetching. However, existing DNN-based MAP models suffer from the challenges such as significant physical storage space and poor inference latency, primarily due to their large number of parameters. These limitations render them impractical for deployment in real-world scenarios. In this paper, we propose PaCKD, a Pattern-Clustered Knowledge Distillation approach to compress MAP models while maintaining the prediction performance. The PaCKD approach encompasses three steps: clustering memory access sequences into distinct partitions involving similar patterns, training large pattern-specific teacher models for memory access prediction for each partition, and training a single lightweight student model by distilling the knowledge from the trained pattern-specific teachers. We assess the effectiveness of our approach by evaluating it on three distinct models commonly used for image classification tasks: LSTM, MLP-Mixer, and ResNet since these models exhibit diverse structures. We then evaluate their MAP performance across four widely utilized graph applications. Compared to the teacher models with 5.406M parameters and an F1-score of 0.4625, our student models achieve a 552x model size compression while maintaining an F1-score of 0.4538 (with a 1.92% performance drop). Our approach yields an 8.70% higher result compared to student models trained with standard knowledge distillation and an 8.88% higher result compared to student models trained without any form of knowledge distillation.
A Composable Just-In-Time Programming Framework with LLMs and FBP
Andy Vidan, Lars Fiedler (Composable Analytics)
This paper introduces a computing framework that combines Flow-Based Programming (FBP) and Large Language Models (LLMs) to enable Just-In-Time Programming (JITP). JITP empowers users, regardless of their programming expertise, to actively participate in the development and automation process by leveraging their task-time algorithmic insights. By seamlessly integrating LLMs into the FBP workflow, the framework allows users to request and generate code in real-time, enabling dynamic code execution within a flow-based program. The paper explores the motivations, principles, and benefits of JITP, showcasing its potential in automating tasks, orchestrating data workflows, and accelerating software development. Through a fully implemented JITP framework using the Composable platform, we explore several examples and use cases to illustrate the benefits of the framework in data engineering, data science and software development. The results demonstrate how the fusion of FBP and LLMs creates a powerful and user-centric computing paradigm.

3-4: Scaling HPC Education Session (15:45-17:00)

Co-Chairs: J. Mullen, L. Milechen & H. Jananthan

Invited Talk: Team Building Inside a Highly Decentralized System: The MIT Office of Research Computing and Data
James Cuff (MIT Office of Research Computing and Data)
Invited Talk: High ‘PI’-Performance Computing: Leveraging Raspberry Pis to Introduce Young Learners to HPC, Linux, and Parallel Programming
Arianna Martin (BP, NAG Partner)
Invited Talk: Training Next Generation AI Users & Developers at NCSA
Dr. Daniel S. Katz (Chief Scientist, NCSA, Univ. of Illinois)
Invited Talk: Enhancing Education with AI-Generated Content
Rocael Hernández Rizzardini (Galileo Univ.)
Invited Talk: Building AI Mentors with Custom Indexes, Prompts, Guardrails and APIs
Miguel Amigot II (IBL Education)

3-S1: AI / Machine Learning 5 Special (17:30-19:30)

Co-Chairs: X. Sun & S. Kuppannagari

G-MAP: A Graph Neural Network-Based Framework for Memory Access Prediction
Abhiram Rao Gorle (Indian Institute of Technology, Madras), Pengmiao Zhang (USC), Rajgopal Kannan (DEVCOM Army Research Lab), Viktor K Prasanna (USC)
Memory access prediction is a crucial problem in data prefetchers, as it helps us improve memory performance and reduce latency in computing systems. Existing works model the problem as a sequence prediction problem. This can be limited in its ability to capture complex patterns and dependencies in memory access behavior. In recent years, Graph Neural Networks (GNNs) have emerged as a promising technique for modeling and predicting complex relationships in graph-structured data. In this paper, we introduce G-MAP, a novel Graph Neural Network-based framework for Memory Access Prediction. First, we propose Mem2Graph, a novel approach mapping a memory access sequence to a graph representation, capturing both the spatial and temporal locality in the sequence. Second, we implement various GNNs for G-MAP, including Graph Convolutional Network (GCN), Gated Graph Sequence Neural Network (GG-NN), and Graph Attention Network (GAT). Those models take the graph generated from Mem2Graph as input and predict future memory address jumps (deltas). We evaluate the effectiveness of G-MAP using the SPEC 2006 benchmark. G-MAP using GG-NN shows the highest among all models, achieving 0.7526 F1-Score on the average, which is 10.77% higher than the Multi-Layer Perceptron baseline.
Accelerating Multi-Agent DDPG on CPU-FPGA Heterogeneous Platform
Samuel Wiggins, Yuan Meng (USC), Rajgopal Kannan (DEVCOM Army Research Lab), Viktor K Prasanna (USC)
Multi-Agent Reinforcement Learning (MARL) is a key technology in artificial intelligence applications such as robotics, surveillance, energy systems, etc. Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is a state-of-the-art MARL algorithm that has been widely adopted and considered a popular baseline for novel MARL algorithms. However, existing implementations of MADDPG on CPU and CPU-GPU platforms do not exploit fine-grained parallelism between cooperative agents and handle inter-agent communication sequentially, leading to sub-optimal throughput performance in MADDPG training. In this work, we develop the first high-throughput MADDPG accelerator on a CPU-FPGA heterogeneous platform. Specifically, we develop dedicated hardware modules that enable parallel training of each agent’s internal Deep Neural Networks (DNNs) and support low-latency inter-agent communication using an on-chip agent interconnection network. Our experimental results show that the speed performance of agent neural network training improves by a factor of 3.6× – 24.3× and 1.5× – 29.5× compared with state-of-the-art CPU and CPU-GPU implementations. Our design achieves up to a 1.99× and 1.93× improvement in overall system throughput compared with CPU and CPU-GPU implementations, respectively.
Decreasing the Computing Time of Bayesian Optimization using Generalizable Memory Pruning
Alexander E Siemenn, Tonio Buonassisi (MIT)
Bayesian optimization (BO) suffers from long computing times when processing highly-dimensional or large data sets. These long computing times are a result of the Gaussian process surrogate model having a polynomial time complexity with the number of experiments. Running BO on high-dimensional or massive data sets becomes intractable due to this time complexity scaling, in turn, hindering experimentation. Alternative surrogate models have been developed to reduce the computing utilization of the BO procedure, however, these methods require mathematical alteration of the inherit surrogate function, pigeonholing use into only that function. In this paper, we demonstrate a generalizable BO wrapper of memory pruning and bounded optimization, capable of being used with any surrogate model and acquisition function. Using this memory pruning approach, we show a decrease in wall-clock computing times per experiment of BO from a polynomially increasing pattern to a sawtooth pattern that has a non-increasing trend without sacrificing convergence performance. Furthermore, we illustrate the generalizability of the approach across two unique data sets, two unique surrogate models, and four unique acquisition functions. All model implementations are run on the MIT Supercloud state-of-the-art computing hardware.
Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C++ [Outstanding Student Paper Award]
Bin Lei, Caiwen Ding (Univ. of Connecticut), Le Chen, Pei-Hung Lin, Chunhua Liao (LLNL)
In this study, we present a novel dataset for training machine learning models translating between OpenMP Fortran and C++ code. To ensure reliability and applicability, the dataset is created from a range of representative open-source OpenMP benchmarks. It is also refined using a meticulous code similarity test. The effectiveness of our dataset is assessed using both quantitative (CodeBLEU) and qualitative (human evaluation) methods.
We showcase how this dataset significantly elevates the translation competencies of large language models (LLMs). Specifically, models without prior coding knowledge experienced a boost of $\mathbf{\times~5.1}$ in their CodeBLEU scores, while models with some coding familiarity saw an impressive $\mathbf{\times~9.9}$-fold increase. The best fine-tuned model using our dataset outperforms GPT-4. It is also reaching human-level accuracy. This work underscores the immense potential of our dataset in propelling advancements in the domain of code translation for high-performance computing. The dataset is accessible at \href{https://github.com/bin123apple/Fortran-CPP-HPC-code-translation-dataset}{OpenMP-Fortran-CPP-Translation}.
ANEDA: Adaptable Node Embeddings for Shortest Path Distance Approximation
Frank Pacini (Boston Univ.), Allison Gunby-Mann (Dartmouth Coll.), Sarel Cohen (Academic College of Tel Aviv-Yaffo), Peter Chin (Dartmouth Coll.)
Shortest path distance approximation is a crucial aspect of many graph algorithms, in particular the heuristic-based routing algorithms that make fast, scalable map navigation possible. Past literature has introduced deep learning models which try to approximate these distances by training on graph embeddings (i.e. node2vec, Gra100, ProNE, Poincare). We propose ANEDA, a more lightweight technique than the embedding and graph neural network scheme, which involves training the embeddings directly, using either previous embedding techniques or geographic coordinates as a good initialization. We demonstrate the applications ANEDA to deep A* routing and learned road maps. Through experiments on several road and social networks, we show our model’s error reduction of up to 75\% against two recent deep learning approaches, and its competitive performance against the larger, state-of-the-art architecture.

3-S2: Open SuperComputing Special (17:30-19:30)

Co-Chairs: K. Keville & Po Hao Chen

Cycle Stealing in Exascale HPC Workloads
Akshaya Bali (Boston Univ.)

MoSAIC, the RISC-V and Mesh Network Prototyping Environment
Farzad Fatollahi-Fard (LBL)