27th Annual
IEEE High Performance Extreme Computing Virtual Conference
25 - 29 September 2023

HPEC 2023 AGENDA

Thursday, September 28


4-K: Keynote Session (10:30-11:00)
Co-Chairs: J. Kepner & A. Reuther

Linux is Legacy. Hear What Should Replace It.
Prof. Michael Stonebraker (MIT CSAIL & ACM A.M. Turing Award Winner)

4-1: Case Studies & Benchmarking 1 Session (11:00-12:15)

Co-Chairs: H.Badawy & D.Cousins

An Analysis of Accelerator Data-Transfer Modes in NoC-Based SoC Architectures [Outstanding Student Paper Award]
Kuan-Lin Chiu, Davide Giri, Luca Piccolboni, Luca Carloni (Columbia Univ.)
Data movement is a key factor impacting the performance of hardware accelerators. In a complex SoC architecture, multiple accelerators compete for accessing the resources of on-chip communication and off-chip memory interfaces. For a program that invokes many accelerators, orchestrating the data movement is critically important to avoid degrading the speedup that each standalone accelerator can achieve. We present a comparative analysis of the two main data-transfer modes among accelerators: memory-based and point-to-point (p2p) communication. We describe their implementation on FPGA for both single-thread and multi-thread software programs. We analyze the implications on programmability, performance, and energy efficiency by using a variety of synthetic benchmarks to evaluate the data-transfer modes in different scenarios and by accelerating two real-world image processing applications: Nightvision and Wide-Area Motion Imagery (WAMI). We demonstrate that for various configurations of a tile-based many-accelerator SoC, p2p outperforms memory-based communication.
High-Level Framework for Solving Systems of the PDEs on Distributed Systems
Yevhen Pankevych, Oleg Farenyuk (Ukrainian Catholic Univ.)
This work aims to develop a framework for the high-level and user-transparent distribution of calculations for solving systems of partial differential equations. Framework provides a simple high-level application programming interface and automatically partitions problems into sub-problems that are then executed on MPI-based clusters. Disposing of CUDA accelerators is supported. The management of processes and accelerators is performed by the framework using the MPI services.  The performance of the developed system was analyzed, the influence of the data distribution scheme on performance was investigated, and the performance of the CPU and GPU modes of the framework was compared. Computational fluid dynamics problems on the 2D unstructured mesh were used as a benchmark task for the framework. Linear acceleration on a number of available CPUs is observed, with several super-linear cases.
Quantifying OpenMP: Statistical Insights into Usage and Adoption
Tal Kadosh (Ben-Gurion University), Niranjan Hasabnis, Timothy Mattson (Intel), Yuval Pinter (Ben-Gurion University), Gal Oren (Technion)

In high-performance computing (HPC), the demand for efficient parallel programming models has grown dramatically since the end of Dennard Scaling and the subsequent move to multi-core CPUs. OpenMP stands out as a popular choice due to its simplicity and portability, offering a directive-driven approach for shared-memory parallel programming. Despite its wide adoption, however, there is a lack of comprehensive data on the actual usage of OpenMP.

This paper presents a statistical analysis of OpenMP usage based on a novel and extensive database, HPCORPUS, compiled from GitHub repositories containing C, C++, and Fortran code. The results reveal that OpenMP is the dominant parallel programming model, accounting for 45% of all analyzed parallel APIs. Furthermore, it has demonstrated steady and continuous growth in popularity over the past decade. Analyzing specific OpenMP constructs, the study provides in-depth insights into their usage patterns and preferences across the three languages. Notably, we found that while OpenMP has a strong “common core” of constructs in common usage (while the rest of the API is less used), use of newer constructs such as simd, target directives for accelerated computing, and tasks for irregular parallelism are growing as well.

Overall, this study sheds light on OpenMP’s significance in HPC applications and provides valuable data for researchers and practitioners. It showcases OpenMP’s versatility, evolving adoption, and relevance in contemporary parallel programming, underlining its continued role in HPC applications and beyond. These statistical insights are essential for making informed decisions about parallelization strategies and provide a foundation for further advancements in parallel programming models and techniques.

HPCORPUS, as well as the analysis scripts and raw results, are available at: https://github.com/Scientific-Computing-Lab-NRCN/HPCorpus

Solving Sparse Linear Systems via Flexible GMRES with In-Memory Analog Preconditioning
Vasileios Kalantzis, Mark S Squillante, Chai Wah Wu, Anshul Gupta, Shashanka Ubaru, Tayfun Gokmen, Lior Horesh (IBM Research)

Analog arrays of non-volatile crossbars leverage physics to compute approximate matrix-vector multiplications in a rapid, in-memory fashion. In this paper we consider exploiting this technology to precondition the Generalized Minimum Residual iterative solver (GMRES). Since the preconditioner must be applied through matrix-vector multiplication, approximate inverse preconditioners are a natural fit. At the same time, the errors introduced by the analog hardware render an iteration matrix that changes from one iteration to another. To remedy this, we propose to combine analog approximate inverse preconditioning with a flexible GMRES algorithm which naturally incorporates variations of the preconditioner into its model. The benefit of this approach is that the analog circuit is much simpler than correcting the errors at the hardware level. Our experiments with a simulator for analog hardware show that such an analog-flexible scheme can lead to fast convergence.

Invited Talk: Combining Physical Laws, AI, HPC, and Experiment to Predict Chemical Reactions and Properties
Prof. William Green (MIT ChemE & AAAS Fellow)

Poster Session: 4-P (12:15-14:15) Poster Session
Organizer(s): TBD & TBD

FAxM: FPGA-specific Approximate Multipliers for Accelerators of Machine Learning
Zainab Aizaz, Kavita Khare (Maulana Azad National Inst. of Tech.), Aizaz Tirmizi (IES Coll. of Tech.)
In this paper, a truncated FPGA-specific Approximate Multiplier, FAxM, is designed using the Adaptive Logic Module (ALM) and carry chains for energy efficient multiplication on Intel FPGAs. Sum and carries of three columns preceding the truncation are added in a unique way using the ALMs in the arithmetic mode resulting in accuracy that surpasses that of all existing approximate multipliers while maintaining low PDP on FPGAs. Results show that FAxM variants possess comparatively lower MREDs ranging from 6.67×10-7 to 1.08×10-2. Compared to the exact multiplier, 16-bit FaxM achieves 51.92%-67.31% reduction in area, and 45.53%-69.15% reduction in PDP.
IoT Security: Smart Doorbell to Botnet
Ayyappan Rajesh (UMass Dartmouth)
As we step further into the interconnected era of the Internet of Things (IoT), the security implications of these devices come to the forefront of the discussion. This comprehensive study zeroes in on the Qubo IoT doorbell, a pioneering product designed and manufactured in India. This device, however, is not exempt from the security vulnerabilities that often plague IoT devices, as highlighted by the issue tracked as CVE-2023-22906.
Navigating the intricacies of IoT security, we aim to throw light on the consequences of this specific vulnerability for the end-users, illustrating the potential threats and breaches to their digital safety. Our investigation reaches beyond the identification of these risks, striving to propose strategic and effective countermeasures.  Underpinning this exploration is an emphasis on the critical need for stringent security protocols in the IoT landscape. By assessing their incorporation during the essential stages of IoT development and deployment, we highlight the protective role they play in safeguarding digital ecosystems. By unveiling the intricacies of this particular vulnerability, this study offers essential insights into IoT security. Beyond diagnosing potential pitfalls, it paves the way for developers, manufacturers, and end-users to proactively engage with these threats, thereby fostering a more secure digital environment.
Nonlinear Spectral Clustering with C++ GraphBLAS [Outstanding Short Paper Award]
Dimosthenis Pasadakis, Olaf Schenk (Univ. della Svizzera italiana), Verner Vlacic, Albert-Jan Yzelman (Huawei Zurich Research Center)
Nonlinear reformulations of the spectral clustering method have gained a lot of recent attention due to their increased numerical benefits and their solid mathematical background. However, the estimation of the multiple nonlinear eigenvectors is associated with an increased computational cost. We present an implementation of a direct multiway spectral clustering algorithm in the p-norm, for p between 1 and 2, using a novel C++ GraphBLAS API. The key operations are expressed in linear algebraic terms and are executed over the resulting sparse matrices and dense vectors, parameterized in the algebra pertinent to the computation. We demonstrate the effectiveness and accuracy of our shared-memory algorithm on several artificial test cases. Our numerical examples and comparative results against competitive methods indicate that the proposed implementation attains high quality clusters in terms of the balanced graph cut metric. The strong scaling capabilities of our algorithm are showcased on a range of datasets with up to 8 million nodes and 48 million edges.
Hypersparse Traffic Matrix Construction using GraphBLAS on a DPU [Outstanding Short Paper Award]
William Bergeron, Michael Jones (MIT Lincoln Laboratory), Chase Barber, Kale DeYoung, George Amaiucai, Kaleb Ernst, Nathan Fleming (KSU), Peter Michaleas, Sandeep Pisharody (MIT Lincoln Laboratory), Nathan Wells (KSU), Antonio Rosa (MIT Lincoln Laboratory), Eugene Y. Vasserman (KSU), Jeremy Kepner (MIT Lincoln Laboratory)
Low-power small form factor data processing units (DPUs) enable offloading and acceleration of a broad range of networking and security services. DPUs have accelerated the transition to programmable networking by enabling the replacement of FPGAs/ASICs in a wide range of network oriented devices. The GraphBLAS sparse matrix graph open standard math library is well-suited for constructing anonymized hypersparse traffic matrices of network traffic which can enable a wide range of network analytics. This paper measures the performance of the GraphBLAS on an ARM based NVIDIA DPU (BlueField 2) and, to the best of our knowledge, represents the first reported GraphBLAS results on a DPU and/or ARM based system. Anonymized hypersparse traffic matrices were constructed at a rate of over 18 million packets per second.
Spla: Generalized Sparse Linear Algebra Library with Vendor-Agnostic GPUs Acceleration [Outstanding Short Paper Award]
Egor Orachev, Semyon Grigorev (St. Petersburg St. Univ.)
Scalable high-performance graph analysis is a nontrivial challenge. Usage of sparse linear algebra operations as building blocks for graph analysis algorithms, which is a core idea of GraphBLAS standard, is a promising way to attack it. While it is known that sparse linear algebra operations can be efficiently implemented on GPU, full GraphBLAS implementation on GPU is a nontrivial task that is almost solved by GraphBLAST project. It is shown that utilization of GPUs for GraphBLAS implementation significantly improves performance. But GraphBLAST is not portable because it is based on Nvidia Cuda. In this work we propose Spla library which aims to solve this problem using OpenCL API for vendor-agnostic GPUs accelerated computations. Evaluation shows that the proposed solution demonstrates performance comparable with GraphBLAST, outperforming it up to 36 times in some cases, and remains portable across different GPUs vendors.
Twiddle Factor Generation for a Vectorized Number Theoretic Transform [Outstanding Short Paper Award]
Patrick J Brinich (Drexel Univ.), Naifeng Zhang, Franz Franchetti (Carnegie Mellon Univ.), Jeremy Johnson (Drexel Univ.)
Implementations of Fast Fourier Transforms often precompute some twiddle factors, trading efficient use and space by generating additional twiddle factors on the fly. The same trade-offs can be made in implementations of the Number Theoretic Transform. In this extended abstract, we provide a approach for generating twiddle factors for a vectorized Number Theoretic Transform algorithm from a set of tables requiring small additional space using mathematical formalizations. We also discuss an implementation using this approach for a fully homomorphic encryption accelerator.
Optimizing Quotient Filters using Graveyard Hashing [Outstanding Short Paper Award]
Isabelle A Quaye, Temi Taylor (MIT)
We aim to improve the performance of the Quotient Filter at high load factors. Our Graveyard Filter is a variation of the Quotient Filter which incorporates Graveyard Hashing, a technique that uses tombstones to counteract the effects of primary clustering. We summarize our implementation of the graveyard filter and detail approaches to redistributing tombstones. Evaluating these variations under conditions similar to the original quotient filter paper, we found the performance of the graveyard filter to be competitive for insertion and query operations, with certain redistribution schemes showing stronger performance at high load factors. We discuss potential further improvements, such as using the current load factor to determine the employed redistribution approach.
Comparison of Quantum Simulators for Variational Quantum Search: A Benchmark Study
Mohammadreza Soltaninia, Junpeng Zhan (Alfred University)
Simulating quantum circuits using classical computers can accelerate the development and validation of quantum algorithms. Our newly developed algorithm, variational quantum search (VQS), has shown an exponential advantage over Grover’s algorithm in the range from 5 to 26 qubits, in terms of circuit depth, for searching unstructured databases. We need to further validate the VQS for more than 26 qubits. Numerous simulators have been developed. However, it is not clear which simulator is most suitable for executing VQS with many qubits. To solve this issue, we implement a typical quantum circuit used in VQS on eight mainstream simulators. Results show that the time and memory required by most simulators increase exponentially with the number of qubits and that Pennylane with GPU and Qulacs are the most suitable simulators for executing VQS efficiently. Our results aid researchers in selecting suitable quantum simulators without the need for exhaustive implementation, and we have made our codes available for community contributions.
Application of Natural Language Processing Techniques for Sentiment Analysis of Social Media Content
Ahmed Muntasir Hossain, Sree Veera Venkata Sai Saran Naraharisetti, Mehdi Mekni (Univ. of New Haven)
Digital reputation management systems are in high demand for individuals and businesses seeking to enhance and safeguard their online image. However, current systems like Online Social Network Interactions (OSNI) have limitations in effectiveness, accuracy, cost, and scope. To overcome these challenges, sentiment analysis, a powerful natural language processing technique for identifying opinions, emotions, and attitudes in text data, proves invaluable for digital reputation management. This study proposes the development of an open-source, multi-channel, multi-engine sentiment analysis software called Sentiment Analysis of Social Media (SASM). Specifically designed for social media and digital reputation management, SASM collects and analyzes data from platforms like Twitter, Reddit, and Tumblr. Leveraging sentiment analysis engines such as Microsoft Text Analytics (MTA), IBM Watson Natural Language Understanding (IWNLU), and Google Cloud Natural Language API (GCNLA), SASM filters, aggregates, and assesses sentiment trends. Through a case study on major information technology companies, Google, Amazon, and Microsoft, the feasibility and performance of this multi-channel, multi-engine platform, including GCNLA, MTA, and IWNLU, will be evaluated. 

4-2: Case Studies & Benchmarking 2 Session (12:30-13:45)

Co-Chairs: C. Long & B. Thoelen 

Invited Talk: Observing & Modeling the Earth System: Innovation at NOAA
Frank Indiviglio (CTO, NOAA)
Benchmarking Deep Learning Classifiers for SAR Automatic Target Recognition
Jacob Fein-Ashley, Tian Ye (USC), Rajgopal Kannan (DEVCOM Army Research Lab), Viktor K Prasanna (USC), Carl Busart (DEVCOM Army Research Lab)
Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique of remote-sensing image recognition, which can be supported by deep neural networks. The existing works of SAR ATR mostly focus on improving the accuracy of the target recognition while ignoring the system’s performance in terms of speed and storage, which is critical to real-world applications of SAR ATR. For decision-makers aiming to identify a proper deep learning model to deploy in a SAR ATR system, it is important to understand the performance of different candidate deep learning models and determine the best model accordingly. This paper comprehensively benchmarks several advanced deep learning models for SAR ATR with multiple distinct SAR imagery datasets. Specifically, we train and test five SAR image classifiers based on Residual Neural Networks (ResNet18, ResNet34, ResNet50), Graph Neural Network (GNN), and Vision Transformer for Small-Sized Datasets (SS-ViT). We select three datasets (MSTAR, GBSAR, and SynthWakeSAR) that offer heterogeneity. We evaluate and compare the five classifiers concerning their classification accuracy, runtime performance in terms of inference throughput, and analytical performance in terms of number of parameters, number of layers, model size and number of operations. Experimental results show that the GNN classifier outperforms with respect to throughput and latency. However, it is also shown that no clear model winner emerges from all of our chosen metrics and a “one model rules all” case is doubtful in the domain of SAR ATR.
AOCL-Compression — A High Performance Optimized Lossless Data Compression Library
S Biplab Raut (AMD)
Data compression is the process of encoding (or compressing) information using fewer bits than originally present in the data stream or signal. Depending upon whether this process is invertible or not, data compression can be lossless or lossy. While there exists various popular implementations of different lossless data compression methods, they are unable to completely meet the performance requirements demanded by the ever-increasing data usage of the applications. In this paper, we present a comparative survey of different lossless compression methods and introduce a high performance compression library called AOCL-Compression optimized for x86 architecture in general and AMD’s “Zen”-based processors in particular. This library supports LZ4, LZ4HC, ZLIB, ZSTD, LZMA, BZIP2, and Snappy based compression methods. This paper discusses the design features of the new library framework and the algorithmic optimizations implemented for the different compression methods. Results highlighting the performance benefits of this new library implementation over the reference implementations of the respective compression methods are also presented.
Performance of Graph Neural Networks for Point Cloud Applications
Dhruv Parikh, Bingyi Zhang (USC), Rajgopal Kannan (DEVCOM Army Research Lab), Viktor K Prasanna (USC), Carl Busart (DEVCOM Army Research Lab)

Graph Neural Networks (GNNs) have gained significant momentum recently due to their capability to learn on unstructured graph data. Dynamic GNNs (DGNNs) are the current state-of-the-art for point cloud applications; such applications (viz. autonomous driving) require real-time processing at the edge with tight latency and memory constraints. Conducting performance analysis on such DGNNs, thus, becomes a crucial task to evaluate network suitability.

This paper presents a profiling analysis of EdgeConv-based DGNNs applied to point cloud inputs. We assess their inference performance in terms of end-to-end latency and memory consumption on state-of-the-art CPU and GPU platforms. The EdgeConv layer has two stages: (1) dynamic graph generation using k-Nearest Neighbors (kNN) and, (2) node feature updation. The addition of dynamic graph generation via kNN in each (EdgeConv) layer enhances network performance compared to networks that work with the same static graph in each layer; such performance enhancement comes, however, at the added computational cost associated with the dynamic graph generation stage (via kNN algorithm). Understanding its costs is essential for identifying the performance bottleneck and exploring potential avenues for hardware acceleration. To this end, this paper aims to shed light on the performance characteristics of EdgeConv-based DGNNs for point cloud inputs. Our performance analysis on a state-of-the-art EdgeConv network for classification shows that the dynamic graph construction via kNN takes up upwards of 95% of network latency on the GPU and almost 90% on the CPU. Moreover, we propose a quasi-Dynamic Graph Neural Network (qDGNN) that halts dynamic graph updates after a specific depth within the network to significantly reduce the latency on both CPU and GPU whilst matching the original networks inference accuracy.

On the Three P’s of Parallel Programming for Heterogeneous Computing: Performance, Productivity, and Portability [Outstanding Student Paper Award]
Atharva M Gondhalekar, Wu-chun Feng (Virginia Tech)

As FPGAs and GPUs continue to make inroads into high-performance computing (HPC), the need for languages and frameworks that offer performance, productivity, and portability across heterogeneous platforms, such as FPGAs and GPUs, continues to grow.

OpenCL and SYCL have emerged as frameworks that offer cross-platform functional portability between FPGAs and GPUs.
While functional portability across a diverse set of platforms is an important feature of portable frameworks, achieving performance portability often requires vendor and platform-specific optimizations.  Achieving performance portability, therefore, comes at the expense of productivity.

This paper presents a quantification of the tradeoffs between performance, portability, and productivity of OpenCL and SYCL. It extends and complements our prior work on quantifying performance-productivity tradeoffs between Verilog and OpenCL for the FPGA.  In addition to evaluating the performance-productivity tradeoffs between OpenCL and SYCL, this work quantifies the performance portability (PP) of OpenCL and SYCL as well as their code convergence (CC), i.e., a measure of productivity across different platforms (e.g., FPGA and GPU).
Using two applications as case studies (i.e., edge detection using the Sobel filter, and graph link prediction using the Jaccard similarity index), we characterize the tradeoffs between performance, portability, and productivity.

Our results show that OpenCL and SYCL offer complementary tradeoffs. While OpenCL delivers better performance portability than SYCL, SYCL offers better code convergence and a 1.6x improvement in source lines of code over OpenCL. 

4-3: Case Studies & Benchmarking 3 Session (14:15-15:30)

Co-Chairs: S.Gottlieb & B.Sroka

 Leveraging Mixed Precision in Exponential Time Integration Methods [Outstanding Paper Award]

Cody J. Balos, Steven Roberts, David J. Gardner (LLNL)
The machine learning explosion has created a prominent trend in modern computer hardware towards low precision floating-point operations. In response, there have been growing efforts to use low and mixed precision in general scientific computing. One important area that has received limited exploration is time integration methods, which are used for solving differential equations that are ubiquitous in science and engineering applications. In this work, we develop two new approaches for leveraging mixed precision in exponential time integration methods. The first approach is based on a reformulation of the exponential Rosenbrock–Euler method allowing for low precision computations in matrix exponentials independent of the particular algorithm for matrix exponentiation. The second approach is based on an inexact and incomplete Arnoldi procedure in Krylov approximation methods for computing matrix exponentials and is agnostic to the chosen integration method. We show that both approaches improve accuracy over approaches that use purely low precision and offer better efficiency than using only double precision when solving an advection-diffusion-reaction partial differential equation.
Exploring Challenges Associated with Employing SmartNICs as General-Purpose HPC Accelerators
Brody Williams, Yong Chen (Texas Tech Univ.), Wendy Poole, Steve Poole (LANL)
Galvanized by waning general-purpose CPU performance improvements, large-scale system architectures have shifted towards increasingly heterogeneous designs in recent years. Systems devised under this regime are constructed using the principles of hardware/software codesign and incorporate domain specific accelerators in order to optimize system performance for targeted use cases. In this context, smartNICs represent a class of accelerator devices currently experiencing a resurgence. In the past, power and flexibility limitations largely relegated these devices to performing only tasks associated with low-level networking operations. More recent smartNIC incarnations have addressed many of these design constraints. The full range of capabilities of these new smartNICs, however, is not well understood. Moreover, the suitability of these devices as general-purpose accelerators, particularly within the high performance computing domain, as well as any potential barriers to their more widespread adoption, remain to be seen. In this work, we detail our efforts to explore these open questions.
A Holistic Optimisation – Success Mantra for HPC Performance
Ashish Bisht, Deepika H V, Haribabu Pasupuleti, S A Kumar, S D Sudarsan (CDAC)
“High Performance Computing (HPC) aids in solving numerous complex scientific problems such as weather forecasting, drug discovery, physical simulations, molecular modeling, nuclear research, cryptanalysis, oil and gas exploration.
To scale these kinds of applications across the nodes of a cluster or supercomputer Message Passing Interface (MPI) is used. The performance of MPI is one of the key aspects to achieve the expected speedup in the application performance on a HPC cluster, which in turn depends on the architecture of the servers and hence it is important to understand the same for HPC practitioners. Application performance is dependent not only on an applications’ capability to scale but also on the efficiency of application to execute on an architecture. In this paper, we compare the performance of applications using MPI on two architectures of based on Intel and ARM. This is important as the former is a predominant architecture and the latter is one of the upcoming architectures among the supercomputers. We have used benchmarks which cover a wide variety of applications belonging to the Berkley dwarfs. This will help to analyze the comprehensive behavior of HPC applications on both the architectures. Finally, we present our observations in terms of computation, communication, data size and functional behavior of the applications.
pPython Performance Study
Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Anna Klein, Peter Michaleas (MIT Lincoln Laboratory), Lauren Milechin (MIT), Guillermo Morales, Julie Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, Siddharth Samsi, Charles Yee, Jeremy Kepner (MIT Lincoln Laboratory)
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. pPython follows a SPMD (single program multiple data) model of computation. pPython runs on a single-node (e.g., a laptop) running Windows, Linux, or MacOS operating systems or on any combination of heterogeneous systems that support Python, including on a cluster through a Slurm scheduler interface so that pPython can be executed in a massively parallel computing environment. It is interesting to see what performance pPython can achieve compared to the traditional socket-based MPI communication because of its unique file-based messaging implementation. In this paper, we present the point-to-point and collective communication performances of pPython and compare them with those obtained by using mpi4py with OpenMPI. For large messages, pPython demonstrates comparable performance as compared to mpi4py.
High-Level Frameworks: Effect on Transformer Inference Time and Power on Embedded GPU Devices
Marika E Schubert (Univ. of Pittsburgh), David Langerman (NSF Center for Space, High-Performance, and Resilient Computing), Alan George (Univ. of Pittsburgh)
Developing software for machine- and deep-learning (ML and DL) workloads is often a daunting task to individuals with minimal programming experience or to organizations with limited engineering capacity. ML frameworks address these issues by providing a high-level API able to perform otherwise complex tasks with less engineering time. This high level of abstraction can reduce and hide many of the challenges that are induced by unclean datasets, complicated pre/postprocessing pipelines, and low-level dependencies like CUDA. This high-level approach encourages model portability and can dramatically increase design iteration speed, as well as providing model speedup in some cases. This research demonstrates that these high-level ML frameworks are also more performant out-of-the-box on embedded systems than their pure PyTorch reference implementations likely due to their myriad of optimizations related to data movement and memory management. In this research, we benchmark a state-of-the-art transcription model, wav2vec2, and compare performance across different frameworks: the reference implementation from the Fairseq framework and the two higher-level frameworks HuggingFace and Lightning Flash.
Overall, we observe that both Lightning Flash and HuggingFace are substantially faster than the original unoptimized PyTorch model. In general, these models ran between 1.8x and 2.0x faster than the base PyTorch implementation on the embedded NVIDIA Jetson platforms targeted. As a secondary result, we also observe the high-level frameworks to be more power efficient for the same computation. 

4-4: General Purpose GPU Computing Session (15:45-17:00)

Co-Chairs: S.Gottlieb & B.Sroka 

Acceleration of Synthetic Aperture Radar for On-board Space Systems
Marc Solé, Ivan Rodriguez (Universitat Politècnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC)), David Steenari (ESA), Leonidas Kosmidis (Barcelona Supercomputing Center (BSC))

There is a recent trend in modern space systems to move processing that until now was transmitted to ground for processing, on board the satellite. Synthetic Aperture Radar (SAR) is an example of such processing. However, such a computationally intensive processing requires high performance hardware. In this paper we present the CPU and GPU acceleration of a SAR processing application, part of ESA’s open source benchmarking suite OBPMark.

We benchmark several embedded multicore and GPU platforms which are promising candidates for future on-board systems. Our results show that both embedded multicores and especially GPUs can provide significant speedups in this type of processing, and achieve performance level similar to the ones of high performance ground stations.

FAST-CON: a Multi-source Approach for Efficient ST Connectivity on Sparse Graphs
Leonardo Fraccaroli, Rosalba Giugno (Univ. of Verona), Samuele Cancellieri (Univ. Trento), Federico Busato (NVIDIA), Nicola Bombieri (Univ. of Verona)
S-T connectivity is a decision problem asking, for vertices s and t in a graph, if t is reachable from s. Many parallel solutions for GPUs have been proposed in literature to solve the problem. The most efficient, which rely on two concurrent BFS starting from s and t have shown limitations when applied on sparse graphs (i.e., graphs with low average degree). In this paper we present FAST-CON, an alternative solution based on multi-source BFS and adjacency matrix to better exploit the massive parallelism of the GPU architectures with any type of graph. The results show that FAST-CON achieves speedup up to one order of magnitude for dense graphs and up to two orders of magnitude for sparse graphs compared to the state of the art solutions.
A GPU Parallel Algorithm for Finding a Negative Subset Disjoint Cycle in a Graph
Piotr Sielski, Akif Coerduek, Hugo Linsenmaier, Alex Fender (NVIDIA)
Many problems in combinatorial optimization rely on local search heuristics attempting to swap elements between subsets. The best results are achieved by combining multiple exchanges, where the multi-subset move can be described as a subset disjoint cycle. Finding a valid cycle becomes the core part of the search routine, and the valid cycle of minimum weight corresponds to the best possible move. The problem complexity has been growing at a rapid pace in recent years, hitting the limits of both exact and heuristic approaches. We propose a parallel algorithm for finding subset disjoint negative cycles of minimum weight and a GPU accelerated implementation. The approach is exact if the memory limit is sufficient and heuristic otherwise. A parallel hashmap structure is provided to meet the specific needs of the algorithm. The computational experiments demonstrate up to 60x speedup with respect to the sequential alternative. We showcase the practical benefits of our algorithm for a time sensitive variant of the vehicle routing problem.
Build Energy-Efficient GPU Computing Environment for Machine Learning Algorithms with Register File Packing Technique
Xin Wang (Virginia Commonwealth Univ.), Wei Zhang (Univ. of Louisville)
Popular machine learning algorithms built with a mass of matrix multiplications can be well paralleled and the GPUs are desirable computing environment for these applications. However, the energy consumption on GPUs becomes a big concern which prevents the further performance increases for machine learning algorithms. In this work, we aim to build an energy-efficient GPU computing environment for famous machine learning algorithms with a GPU register file management theory named narrow width operand packing. First, we observed that RF occupancies of modern machine learning algorithms are relatively low leaving a great waste of GPU’s RF leakage energy. Second, we found that the data maintained by the RF contains a large fraction of narrow-width operands for machine learning algorithms. We proposed to pack multiple narrow width operands to a single register. After the register packing, the RF occupancies can be further reduced. Finally, we attempt to save both static and dynamic energy consumption of GPU’s RF by smartly shutting down the unused portion of the RF. We evaluated the energy reduction of this GPU RF management with five state-of-the-art machine learning algorithms. The experimental results show that the register packing techniques achieve the total GPU energy consumption reduction, up to 14.14\% and 10.71\% on average.
Multi-Sweep-Line Algorithm for Rectangle Union on GPU and Its Application for VLSI Density Calculation
Chang-Hung Wu, Che-Rung Lee (National Tsing Hua Univ.)
Graphics Processing Units (GPUs) have been widely used for computational intensive applications owing to its massive number of computing cores. However, only few algorithms can fully utilize the computational power of GPUs, because it requires not only enough degree of parallelism, but also the synergetic control between computation and data access in the finest granularity. In this paper, we proposed the Multi-Sweep-Line (MSL) algorithm on GPU for VLSI layout density calculation, which computes the union area of components in a layout. The shapes of most components are rectilinear rectangles. The MSL algorithm divides the input layout hierarchically into windows, slabs, and sweep line regions to explore the large degree of parallelism. In addition, to overcome the memory pressure of GPU, tasks are partitioned into batches based on the estimated memory usage. Optimization techniques, including fast segmented sort, reducing atomic instruction, and load balance, are also applied to further improve the performance. The experimental results show that our MSL implementation can achieve 75 to 160 times speedup over the CPU version using one Nvidia GTX 1080TI. 

4-S1: AI / Machine Learning 6 Special (17:30-19:30)

Co-Chairs: H. Badawy & J. Mullen 

Scalable and Portable Pipelines for Predicting 3D Protein Structures on Standalone and HPC Systems
Adam Michaleas, Darrell O Ricke (MIT Lincoln Laboratory)
Advances in machine learning techniques are enabling improved protein structure prediction solutions. These solutions include AlphaFold, ESMFold, OmegaFold, and others. Configuring each solution with associated software dependencies and data files is a barrier for many scientists. Singularity containers were developed for AlphaFold, ESMFold, and OmegaFold to enable parallelization of these solutions on high performance computing (HPC) systems. These containers also enable portability to cloud-based platforms. These folding prediction solutions were characterized for performance with a series of human proteins with increasing protein sequence lengths. The current solutions all encounter scaling limitations by protein length due to memory usage. The Singularity containers for AlphaFold, ESMFold, and OmegaFold are provided as open source.
Lumpy Skin Disease Detection using GUI based Deep Learning model in cattle.
Manjunath Naikar, Anupama S Nandeppanavar, Medha Kudari (KLE Inst. of Tech.)
Cattle are susceptible to the extremely contagious viral infection known as Lumpy Skin Disease (LSD), which results in skin lesions and substantial financial losses for the livestock industry. VGG16, VGG19, and DenseNet121 are a few examples of Convolutional Neural Networks (CNNs) that have shown success in correctly diagnosing LSD lesions. Transfer learning is frequently used to optimize previously trained models on certain datasets. Model performance is measured using evaluation measures like accuracy, precision, recall, and F1-score. In comparison to VGG16 and VGG19, CNN and DenseNet121 provide a balance between accuracy and processing economy. Based on required accuracy levels and available processing resources, a CNN design is chosen. In order to implement effective control measures and lessen the financial burden on cow populations, it is important to accurately classify LSD lesions. This enables early detection and intervention. The results of this study improve LSD.
Machine Learning at the Edge Using Neural Network Processors
Edwin Lee, Michael A Parker, Michael Cervantes, Benjamin Plotner (Raytheon Technologies)
Increasingly, power, space, and cooling constrained embedded computing assets need to run machine learning applications to make autonomous, low latency decisions within operational environments. Where GPUs, CPUs, and FPGAs aren’t suitable for these constraints, custom ASICs with their non-standard development processes and frameworks are the best available option for bringing the power of machine learning to the edge. This paper discusses a neural network processing architecture and why it surpasses all size, power, throughput, and weight requirements while maintaining performance and allowing interoperability with commercial vendors license comprehensive development environment and machine learning libraries. We compare this neural network processor architecture with existing AI and machine learning platforms that leverage common architectures such as GPUs, CPUs, and FPGAs. The plug and play nature of the neural network processor architecture is optimized for applications such as EO-IR (Electro-optical/Infrared) Sensor and Radar Systems without respinning the hardware, reducing costs. This paper explores best practices for deploying common AI and machine learning models for object detection, super resolution, and natural language processing on a neural network processor at the edge. The neural network processor offer an alternative for deployment of autonomous machine learning at the edge that brings the power of a robust development ecosystem together with an architecture favorable for power and space constrained use cases.
Modeling and Analyzing Wind Velocity at Entrance Doors to Avoid Accidents
Abu Asaduzzaman, Luke Mercer, Md Raihan Uddin, Yoel Woldeyes (Wichita State Univ.)
There are safety threats due to unexpected uncontrolled sudden opens and shuts of entrance doors. This work aims to develop a computer-simulated wind velocity model to study the doors’ risky behavior by analyzing the relationship between wind velocity and the corresponding door movements. We develop a microcontroller-based system to detect when a door is opened, and record the wind velocity and door open distance when the door is opened and closed. This process is completed using an anemometer to measure the wind velocity, a magnetic door switch to detect when the door opens, using an ultrasonic sensor to measure the door distance, and calculating the time the door was open using the Arduino timer. The experiments are conducted inside a room, where wind speed and maximum door open distance can be controlled. The preliminary results show that the door open speed and distance increases significantly with increased wind speed. The proposed model can be extended as a potential remedy to dangerous threats for buildings and building occupants.
Bridge Crack Detection using Horse Herd Optimization Algorithm
Rishitha Ponnuru, Anuradha Govada, Uppu Venkata Sai, Dyutik Chaudhary Suryadevara (V.R Siddhatha Engr. Coll.)
Detection of Bridge Cracks is an important task for ensuring the safety and structural integrity of bridges. Cracks in bridges can lead to serious safety hazards and require frequent inspections to prevent accidents. However, manual inspections can be time-consuming and expensive. Automated detection systems can help in identifying cracks and reducing the cost and time associated with manual inspections. The previous years, several methods have been proposed for automating the crack detecting procedure using Machine Learning and optimization techniques. In this project, a novel approach to detect bridge cracks is described using the Horse Herd Optimization Algorithm (HHOA). The HHOA is a population-based meta-heuristic algorithm motivated by the behaviour of horses in a herd. It is a relatively new algorithm that has shown promising results in several optimization problems. The Convolution Neural Network (CNN) is a popular deep learning technique used for image classification, object detection, and segmentation tasks. The basic idea behind CNN is to extract important features from an image using convolutional layers and pooling layers, followed by fully connected layers to classify the image. Here, a algorithm that can detect cracks in bridge structures efficiently and accurately using HHOA with CNN is giving an accuracy of 97%.
Performance Analysis of Graph Neural Networks for Manufacturing Feature Recognition Problem
Igor Betkier, Mateusz Oszczypała (Military Univ. of Technology), Janusz Pobożniak (Cracow Univ. of Tech.), Sergiusz Sobieski (TIZ Implements)
This scientific paper presents a comprehensive performance analysis of Graph Neural Networks (GNNs) for the task of manufacturing feature recognition. The manufacturing industry heavily relies on accurate identification and classification of various features in order to ensure efficient production processes and quality control. Traditional methods for feature recognition often suffer from limitations in handling complex manufacturing datasets with intricate interdependencies. In this study, we investigate the effectiveness of GNNs in addressing these challenges by leveraging their ability to capture and model graph-structured data. We propose a novel framework that employs GNNs to recognize manufacturing features based on their spatial and relational characteristics. Extensive experiments are conducted using real-world manufacturing datasets, and the results demonstrate the superior performance of GNNs compared to traditional approaches. Furthermore, we analyze the impact of different GNN architectures, hyperparameters, and training strategies on the recognition accuracy and computational efficiency. Our findings shed light on the potential of GNNs as a powerful tool for manufacturing feature recognition, providing valuable insights for researchers and practitioners in the field. 

4-S2: Graph Challenge Special (17:30-19:30)

Co-Chairs: J. Kepner & A. Reuther 

Adaptive Sparse Deep Neural Network Inference on Resource-Constrained Cost-Efficient GPUs [Champion]
Ming Dun, Xu Zhang, Huawei Cao, Yuan Zhang, Junying Huang, Xiaochun Ye (Inst. of Computing Tech, CAS)
Sparse Deep Neural Networks (SpDNNs) has gained great popularity and been widely applied in various machine learning area. Compared to traditional dense DNNs, the unpredictable irregularity and sparsity in the sparse weight matrices
of SpDNNs make them difficult to be efficiently parallelled. Moreover, most of the recent advanced efforts to optimize SpDNNs are based on high-end GPUs like NVIDIA V100, which may not be affordable to individuals and smaller research groups. However, migrating the SpDNNs to those costefficient but resource-constrained GPUs confronts enormous challenges, including limitations in both memory and computing resources, as well as the tiresome hyper-parameter tuning in batch parallelism. In this paper, we accelerate SpDNNs on GPUs with more restricted resources through exploiting the memory and computing resources. On one hand, we design the adaptive memory-aware data partition scheme to reduce memory consumption automatically. On the other hand, we propose the Tensor core/CUDA core fusion mechanism to efficiently utilize the hetergeneous computing resources on modern GPU architecture. To the best of our knowledge, we are the first to improve the performance on SpDNNs through adaptive memory
tuning and utilizing hetergeneous computing core concurrency. We compare our implementation with the state-of-art previous champions, and the results demonstrate that our work achieves the highest speedup of 1.35× and 1.39× compared to 2022 champion S&Z on single and multiple NVIDIA Tesla T4 GPUs,
respectively. What’s more, our work can reach similar or even better throughput compared to 2020 champion H&P on 6 V100 GPUs with only 4 T4 GPUs.
GLARE: Accelerating Sparse DNN Inference Kernels with Global Memory Access Reduction [Innovation Award]
Shui Jiang (Chinese Univ. of Hong Kong), Tsung-Wei Huang (Univ. of Wisconsin), Tsung-Yi Ho (Chinese Univ. of Hong Kong)
Sparse deep neural networks (DNNs) leverage sparse representations to achieve faster inference and lower memory footprint. However, deploying sparse DNNs comes with challenges, such as irregular memory access patterns, workload imbalance, etc. To address these challenges, IEEE HPEC has organized the Sparse DNN Graph Challenge (SDGC), seeking new methods from the high-performance computing community. For many years, SDGC has yielded innovative works on accelerating sparse DNN inference. However, none of them have identified redundant global memory access that contributes to significant runtime overhead. To overcome this challenge, we propose GLARE, a framework that can assist existing sparse inference kernels in effectively reducing redundant global memory access. We have applied GLARE to previous SDGC champions and a recent sparse inference engine SNICIT. Evaluated on SDGC benchmarks, we demonstrate the promising performance of GLARE and its generalizability in accelerating existing sparse inference kernels, for instance, up to 31.56× speed-up over one of the previous SDGC champions.
An Integrated Approach to Accelerating Stochastic Block Partitioning [Champion]
Frank D Wanye (Virginia Tech), Vitaliy Gleyzer, Edward Kao (MIT Lincoln Laboratory), Wu-chun Feng (Virginia Tech)
Community detection, or graph partitioning, is a fundamental problem in graph analytics with applications in a wide range of domains including bioinformatics, social media analysis, and anomaly detection. Stochastic block partitioning (SBP) is a community detection algorithm based on sequential Bayesian inference. SBP is highly accurate even on graphs with a complex community structure. However, it does not scale well to large real-world graphs that can contain upwards of a million vertices due to its sequential nature. Approximate methods that break computational dependencies improve the scalability of SBP via parallelization and data reduction. However, these relaxations can lead to low accuracy on graphs with complex community structure. In this paper, we introduce additional synchronization steps through vertex-level data batching to improve the accuracy of such methods. We then leverage batching to develop a high-performance parallel approach that improves the scalability of SBP while maintaining accuracy. Our approach is the first to integrate data reduction, shared-memory parallelization, and distributed computation, thus efficiently utilizing distributed computing resources to accelerate SBP. On a one-million vertex graph processed on 64 compute nodes with 128 cores each, our approach delivers a speedup of 322X over the sequential baseline and 6.8X over the distributed-only implementation. To the best of our knowledge, this Graph Challenge submission is the highest-performing SBP implementation to date and the first to process the one-million vertex graph using SBP.
uSAP: An Ultra-Fast Stochastic Graph Partitioner [Innovation Award]
Chih-Chun Chang, Tsung-Wei Huang (Univ. of Wisconsin)
Stochastic graph partitioning (SGP) plays a crucial role in many real-world applications, such as social network analysis and recommendation systems. Unlike the typical combinatorial graph partitioning problem, SGP presents unique computational difficulties due to time-consuming sampling processes. To address this challenge, the recent HPEC launched the Stochastic Graph Partitioning Challenge (SGPC) to seek novel solutions from the high-performance computing community. Despite many SGP algorithms over the last few years, their speed-ups are not remarkable because of various algorithm limitations. Consequently, we propose uSAP, an ultra-fast stochastic graph partitioner to largely enhance SGP performance. uSAP introduces a novel strongly connected component-based initial block merging strategy to reduce the number of partitioning iterations significantly. To further improve the runtime and memory performance, uSAP adopts a dynamic batch parallel nodal block assignment algorithm and a dynamic matrix representation. We have evaluated uSAP on the 2022 official HPEC SGPC benchmarks. The results demonstrate the promising performance of uSAP on graphs of different sizes and complexities. For example, uSAP achieves 129.4× speed-up over the latest champion on a graph of 50K nodes.
RaftGP: Random Fast Graph Partitioning [Innovation Award]
Yu Gao (Huawei Technologies), Meng Qin (HKUST), Yibin Ding, Li Zeng, Chaorui Zhang, Weixi Zhang, Wei Han, Rongqian Zhao, Bo Bai (Huawei Technologies)
Graph partitioning (GP), a.k.a. community detection, is a classic problem that divides the node set of a graph into densely-connected blocks. Following prior work on the IEEE HPEC Graph Challenge benchmark and recent advances in graph machine learning, we propose a novel RAndom FasT Graph Partitioning (RaftGP) method based on an efficient graph embedding scheme. It uses the Gaussian random projection to extract community-preserving features from classic GP objectives. These features are fed into a graph neural network (GNN) to derive low-dimensional node embeddings. Surprisingly, our experiments demonstrate that a randomly initialized GNN even without training is enough for RaftGP to derive informative community-preserving embeddings and support high-quality GP. To enable the derived embeddings to tackle GP, we introduce a hierarchical model selection algorithm that simultaneously determines the number of blocks and the corresponding GP result. We evaluate RaftGP on the Graph Challenge benchmark and compare the performance with five baselines, where our method can achieve a better trade-off between quality and efficiency. In particular, compared to the baseline algorithm of the IEEE HPEC Graph Challenge, our method is 6.68x — 23.9x faster on graphs with 1E3 — 5E4 nodes and at least 64.5x faster on larger (1E5 node) graphs on which the baseline takes more than 1E4 seconds. Our method achieves better accuracy on all test cases. We also develop a new graph generator to address some limitations of the original generator in the benchmark.
Decontentioned Stochastic Block Partition [Honorable Mention]
Ahsen J Uppal (George Washington Univ.), Thomas Rolinger (Laboratory for Physical Sciences), H. Howie Huang (George Washington Univ.)
Stochastic block partitioning (SBP) is an important community detection algorithm that can achieve good accuracy, even on graphs with irregular structure. But SBP is difficult to parallelize because updates to its internal state are interdependent. This inherently serial nature limits its scalabilty and applicability to large, real-world graph data. In this work we address this challenge by introducing a Decontentioned approach to reduce the write contention on its internal state data. We apply a lock-free compressed data structure to handle writes, and split the parallel nodal movement procedure into a read phase for generating proposals, and a write phase for updating shared state. Finally, we find an optimal batch size to balance parallelism in worker threads and the overhead and convergence rate of the algorithm. Compared to our previous approach that buffers and combines updates to shared state, Decontentioned scales to the maximum number of CPU cores, where it yields a speedup of up to 5.38x on a 100k node input graph, and allows parallel processing of larger-sized graphs due to its more efficient memory usage.
SMOG: Accelerating Subgraph Matching on GPUs [Champion]
Zhibin Wang, Ziheng Meng (Nanjing Univ.), Xue Li (Alibaba), Xi Lin (NJU), Long Zheng (Huazhong Univ. of Science and Tech.), Chen Tian, Sheng Zhong (Nanjing Univ.)
Subgraph matching is a crucial problem in graph theory with diverse applications in fields, such as bioinformatics, social networks and recommendation systems. Accelerating subgraph matching can be greatly facilitated by GPUs, which offer exceptional parallelism and high memory bandwidth. By leveraging the power of multiple GPU cards, subgraph matching can be scaled to achieve unprecedented levels of performance.
In this paper, we propose SMOG, an abbreviation for Subgraph Matching On Multi-Card GPUs. It is a general, high-performance and scalable subgraph matching system that utilizes multi-card GPUs. To address the issue of duplication resulting from subgraph automorphism, SMOG introduces a two-step approach. Firstly, it analyzes the symmetry within the subgraph. Then, it adaptively adjusts the graph preprocessing and generates subgraph-aware GPU codes tailored to the given subgraph. Furthermore, SMOG leverages multi-level parallelism by designing the specific strategy for each level, enabling it to scale from 1 to 1,024 GPU cards, resulting in an extraordinary 553× speedup.
We evaluate SMOG on various subgraph queries and datasets. The experimental results demonstrate that SMOG outperforms the triangle-specific system TRUST with an average speedup of 2.94×. And it performs significantly better than the subgraph matching system RPS by 203.55× and the graph processing system Gunrock by 35, 455.52× on average.
Triangle Counting Through Cover-Edges [Student Innovation Award]
David A Bader, Fuhuan Li, Anya Ganeshan, Ahmet Gundogdu, Jason Lew, Oliver Alvarado Rodriguez, Zhihui Du (New Jersey Inst. of Tech.)
Counting and finding triangles in graphs is often used in real-world analytics to characterize cohesiveness and identify communities in graphs. In this paper, we propose the novel concept of a cover-edge set that can be used to find triangles more efficiently. We use a breadth-first search (BFS) to quickly generate a compact cover-edge set. Novel sequential and parallel triangle counting algorithms are presented that employ cover-edge sets. The sequential algorithm avoids unnecessary triangle-checking operations, and the parallel algorithm is communication-efficient. The parallel algorithm can asymptotically reduce communication on massive graphs such as from real social networks and synthetic graphs from the Graph500 Benchmark. In our estimate from massive-scale Graph500 graphs, our new parallel algorithm can reduce the communication on a scale 36 graph by 1156x and on a scale 42 graph by 2368x.
Fast Triangle Counting [Innovation Award]
David A Bader (New Jersey Inst. of Tech.)
Listing and counting triangles in graphs is a key algorithmic kernel for network analyses including community detection, clustering coefficients, k-trusses, and triangle centrality. We design and implement a new serial algorithm for triangle counting that performs competitively with the fastest previous approaches on both real and synthetic graphs, such as those from the Graph500 Benchmark and the MIT/Amazon/IEEE Graph Challenge. The experimental results use the recently-launched Intel Xeon Platinum 8480+ and CPU Max 9480 processors.