28th Annual
IEEE High Performance Extreme Computing Virtual Conference
23 - 27 September 2024

All times are EDT (UTC/GMT -04 hours)

Speaker/Presenting Author in Italics

Tuesday, September 24

2-K: Keynote Session (10:30-11:00)

Co-Chairs: J. Kepner & A. Reuther

Keynote Talk: Energy Efficiency Scaling for 2 Decades (EES2) Roadmap for Computing

Tina Kaarsberg (Dept. of Energy)

2-1: Case Studies, Benchmarking, and Tools 1 Session (11:00-12:15)

Co-Chairs: B. Raut & C. Byun

A Neural Network Based GCC Cost Model for Faster Compiler Tuning

Hafsah Shahzad (Boston Univ.), Ahmed Sanaullah, Sanjay Arora, Ulrich Drepper (Red Hat), Martin Herbordt (Boston Univ.)

Machine learning models have been found to be effective in predicting compiler heuristics, but are limited by their very long training times. This is because computing the impact of transformations on a code, e.g., through the performance values, involves invoking the downstream compiler. One way to circumvent the cost-computation bottleneck is to devise accurate cost models that can be trained to predict the target metric. In this paper, we develop a neural net based cost function that can accurately predict binary code size for GCC-based compilation. The input to the model is a comprehensive list of features that have been extracted offline from GCC’s intermediate tree representation and the compiler flags that need to be evaluated by a compiler tuning workload. To extract the code features, we have built a GIMPLE analysis framework that can generate feature sets from intermediate representations at different stages of the compilation process. Our results show that the cost model has a mean absolute percentage error of just 8%, and a Spearman correlation of 0.98 between predicted and measured binary size of the test applications. We also demonstrate that compiler pass selection for feature extraction has a significant benefit on the accuracy of the model. Finally, we show that the cost model can reduce metric evaluation time by multiple orders of magnitude.

HENNC: Hardware Engine for Artificial Neural Network-Based Chaotic Oscillators

Mobin Vaziri (Polytechnique Montréal), Shervin Vakili (Institut National de la Recherche Scientifique), Mohammad Mehdi Rahimifar (Interdisciplinary Institute for Technological Innovation), Pierre Langlois (Polytechnique Montréal)

This paper introduces a framework for automatically generating hardware cores for Artificial Neural Network (ANN)-based chaotic oscillators. The framework starts with training a model to approximate a chaotic system and subsequently conducts design space exploration to identify potential hardware architectures for implementation. From the selected solution, the framework generates synthesizable High-Level Synthesis (HLS) code and a validation testbench tailored for Field-Programmable Gate Arrays (FPGAs). The proposed framework enables a rapid hardware design process for candidate architectures, offering superior hardware cost efficiency and throughput compared to manually designed solutions. The source code is available on GitHub.

A Graph-Based Algorithm for Optimizing GCC Compiler Flag Settings

Reza Sajjadinasab (Boston Univ.), Sanjay Arora, Ulrich Drepper, Ahmed Sanaullah (Red Hat), Martin Herbordt (Boston Univ.)

Compiler tuning through external mechanisms — such as source code modifications, e.g., with pragmas, and adjusting compiler flags — is well-explored. Many researchers have shown significant performance improvement through different approaches, including heuristics and machine learning. Most of these approaches, however, require a few hundred iterations to converge towards an optimal answer. A number of studies have addressed this problem by reducing the number of iterations required, but we find that further improvements are still possible. In this work, we explore the optimization of GCC compiler flag settings with the goal of faster convergence. We find as an ancillary result that the effectiveness of the compilation itself is sometimes improved with respect to both code size reduction and application program execution time. The proposed graph-based approach can reduce the code size to 90% of the convergence point with 15 compilations fewer on average, i.e., the solution found after running Opentuner for many compilations. It also can reduce the execution time over the -O3 level for different versions of the Smith-Waterman and Bubble Sort by 1.2x and 5x, respectively.

Analyzing an In-line Compression on the Matrix Matrix Multiplication Kernel

Steven Platt, Jon C Calhoun (Clemson Univ.)

High-performance computing (HPC) has enabled advancements in computation speed and resource cost by utilizing all available server resources and using parallelization for speedup. This computation scheme encourages simulation model development, massive data collection, and AI computation models, all which store and compute on massive amounts of data. Data compression has enhanced the performance of storing and transferring this HPC application data to enable acceleration, but the benefits of data compression can also be transferred to the active allocated memory used by the application. In-line compression is a compression method that keeps the application memory compressed in allocated memory, decompressing memory as its data is needed by the algorithm. The actively decompressed data size in allocated memory can be limited by grouping and compressing the data into blocks informed by the application’s computational kernel’s data access patterns and a selected compressor. This research explores factors such as block size and compressor choice on the runtime and memory usage of the Matrix Multiplication kernel. Matrix multiplication (MM) is a fundamental HPC algorithm that exemplifies the memory access patterns and use cases present in other HPC kernels. MM kernels provide a baseline for evaluating in-line compression due to MM’s row-based data access patterns and the usage of several matrices to compute the resulting matrix. Trade-offs between memory size and the compression runtime necessitate tuned parameters for each in-line compression-enabled kernel. The results of this research demonstrate essential parameter trends, trade-offs, and the importance of locality in the kernel’s data access patterns.

On the Scalability of Computing Genomic Diversity Using SparkLeBLAST: A Feasibility Study [Outstanding Student Paper Award]

Ritvik R Prabhu, Bernard Moussad (Virginia Tech), Karim Youssef (LLNL), Emil Vatai (RIKEN), Wu-chun Feng (Virginia Tech)

Studying the genomic diversity of viruses can help us understand how viruses evolve and how that evolution can impact human health. Rather than use a laborious and tedious wet-lab approach to conduct a genomic diversity study, we take a computational approach, using the ubiquitous NCBI BLAST and our parallel and distributed SparkLeBLAST, across 53 patients (~40,000,000 query sequences) on Fugaku, the world’s fastest homogeneous supercomputer with 158,976 nodes, where each code contains a 48-core A64FX processor and 32GB RAM. To project how long BLAST and SparkLeBLAST would take to complete a genomic diversity study of COVID-19, we first perform a feasibility study on a subset of 50 query sequences from a single COVID-19 patient to identify bottlenecks in sequence alignment processing. We then create a model using Amdahl’s law to project the run times of NCBI BLAST and SparkLeBLAST on supercomputing systems like Fugaku. Based on the data from this 50-sequence feasibility study, our model predicts that NCBI BLAST, when running on all the cores of the Fugaku supercomputer, would take approximately 26.7 years to complete the full-scale study. In contrast, SparkLeBLAST, using both our query and database segmentation, would reduce the execution time to 0.026 years (i.e., 22.9 hours) – resulting in more than a 10,000x speedup over using the ubiquitous NCBI BLAST.

2-P1 (12:15-13:15): Poster Session 2-1

Chair(s)/Host(s): K. Keville

Running GraphBLAS on the FABRIC testbed [Outstanding Short Paper Award]

Vaneshi Ramdhony, Hyunsuk Bang, Nik Sultana (Illinois Institute of Technology)

This paper describes the first FABRIC deployment of GraphBLAS—a powerful linear algebra-based framework for network traffic analysis. FABRIC is an international network testbed that is used for teaching and research in networking. This work has two goals to provide a foundation for further research on network security: (1) deploying an advanced network monitoring technique on FABRIC, and (2) making this deployment shareable with other FABRIC users. These goals are achieved by creating FABRIC experiments that build and deploy GraphBLAS to process network traffic on FABRIC. These experiments are described as executable Jupyter notebooks that are shareable with other researchers. These experiments also include the replaying of network workloads (from pcap files), which we use to test and evaluate the deployment.

Solving Hard Combinatorial Problems in Parallel Using Lift-and-Project Preconditioning

Bogdan Zavalnij (Renyi Institute)

The original Lift and Project method was a useful tool for solving some ILP problems. Encouraged by this idea we propose a new method for preconditioning graphs for the k-clique problem. After a blow-up phase some standard preconditioning applied, then the results projected back to the original graph. Because the blow-up may produce a graph too big we deal with a series of smaller blown-up graphs that can be processed independently and thus parallelly. Numerical experiments back up the usefulness of this approach.

Community Detection in Stochastic Block Model Variations

Allison I Gunby-Mann, Peter Chin (Dartmouth Coll.)

In this paper we expand upon a spectral algorithm for the stochastic block model first presented by Chin et. al.wi h additional variations of construction. Our algorithm works with graphs that contain unequal sized blocks and non-uniform densities which was not considered by the original algorithm. It builds upon the spectral algorithm and remains robust and simple while capable of solving additional cases.

Explainable DiGCNs for Decomposition of Opaque Node Ranking Functions

Vishal Chandra (MIT Lincoln Laboratory)

From Snap Score in social media, to ResearchGate score in professional networks, to College Football Playoff ranking in sports, networks everywhere are home to opaque measurements with unknown factors. Thus far, standard techniques in explainable AI, such as Grad-CAM, have allowed the fitting and inspection of models based on black-box data. This work aims to do the same with these black-box rankings in directed networks. The proposed work uses the DiGCN architecture, a directed and multi-scale variant of the vanilla Graph Convolutional Network, for learning-to-rank tasks based on black-box input and output. Then, we extract multi-scale activation maps from the network to determine what factors at each neighborhood size contribute to the ranking of nodes. By leveraging intuitive ideas from existing centrality measures, this work allows the decomposition of black-box node influence functions in a more granular fashion than standard attention map analysis.

Hypersparse Traffic Matrices from Suricata Network Flows using GraphBLAS [Outstanding Short Paper Award]

Michael D Houle, Michael Jones (MIT Lincoln Laboratory), Dan Wallmeyer, Risa Brodeur, Justin Burr (Center for Internet Security),
Hayden Jananthan (MIT Lincoln Laboratory), Sam Merrell (Center for Internet Security), Peter Michaleas (MIT Lincoln Laboratory), Anthony Perez (Center for Internet Security), Andrew Prout, Jeremy Kepner (MIT Lincoln Laboratory)

Hypersparse traffic matrices constructed from network packet source and destination addresses is a powerful tool for gaining insights into network traffic. SuiteSparse:GraphBLAS, an open-source package for building, manipulating, and analyzing large hypersparse matrices, is one approach to constructing these traffic matrices. Suricata is a widely used open-source network intrusion detection software package. This work demonstrates how Suricata network flow records can be used to efficiently construct hypersparse matrices using GraphBLAS.

2-2: Case Studies, Benchmarking, and Tools 2 Session (12:30-13:45)

Co-Chairs: C. Valentine & C. Byun

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine [Outstanding Student Paper Award]

Zuoning Zhang, Dhruv Parikh (USC), Youning Zhang (UC, Berkeley), Viktor K Prasanna (USC)

Transformer based Large Language Models (LLMs) have recently reached state of the art performance in Natural Language Processing (NLP) and Computer Vision (CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to capture long-range global attention relationships among input words or image patches, drastically improving its performance over prior deep learning approaches. In this paper, we evaluate the performance of LLMs on the Cerebras Wafer Scale Engine (WSE). Cerebras WSE is a high-performance computing system with 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras
WSE’s Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros operations and its 40 GB of on-chip memory is uniformly distributed among SLAC cores, enabling fast local access to model parameters. Moreover, Cerebras software configures routing between cores at runtime, optimizing communication overhead among cores. As LLMs are becoming more commonly used, new hardware architectures are needed to accelerate LLMs training and inference. We benchmark the effectiveness
of this hardware architecture at accelerating LLMs training and inference. Additionally, we analyze if Cerebras WSE can scale the memory-wall associated with traditionally memory-bound compute tasks using its 20 PB/s high bandwidth memory. Furthermore, we examine the performance scalability of Cerebras WSE through a roofline model. By plotting performance metrics against computational intensity, we aim to assess their effectiveness at handling high compute-intensive LLMs training
and inference tasks.

Characterization and Optimization of the Fitting of Quantum Correlation Functions

Pi-Yueh Chuang, Niteya M Shah (Virginia Tech), Patrick Barry, Ian Cloet, Emil Constantinescu (Argonne National Laboratory), Nobuo Sato (Jefferson Lab), Wu-chun Feng (Virginia Tech)

This case study presents a characterization and optimization of an application code for fitting quantum correlation functions (QCFs) to femtoscale experimental data, developed by the QuantOm project at Argonne National Laboratory, Jefferson Laboratory, and Virginia Tech. Our profiling reveals that the bottleneck component consumed approximately 93% of the overall execution time and that 78% of the time to solution was spent on CPU idling (due to load imbalance) when scaling up to 128 CPU cores. We address the bottleneck by re-implementing the code in C++ and tackling the load imbalance via a hybrid scheduling strategy that combines dynamic and static scheduling. These efforts result in a 62% reduction in CPU idle time and a 2.46x speedup in overall execution time per node. In addition, the typically enabled power-management mechanisms (e.g., AMD Turbo Core or RAPL) in supercomputers significantly impact intra-node scalability when more than 50% of CPU cores are used. This finding underscores the importance of understanding system details like Turbo Boost, Turbo Core, and RAPL, as they can adversely impact performance, and highlights the necessity of intra-node scaling tests to identify performance degradation that inter-node scaling tests might overlook.

Elucidating US Import Supply Chain Dynamics: A Spatial-Temporal Graph Neural Network Approach

Nikolay Aristov (MIT-CTL), Ziyan Li, Thomas Koch, Elenna Dugundji (MIT)

To enhance the understanding of congestion points at ports and provide visibility into the incoming container ships in the USA, this study focuses on maritime ports and corresponding terminals, using several ports along the East Coast of the United States as case studies. We analyzed Automatic Identification System (AIS) data from January 2015 to September 2023, deriving comprehensive historical berthing data. This analysis enabled us to build several port congestion prediction models, such as Extreme Gradient Boosting (XGBoost), Long Short-Term Memory (LSTM), and a more sophisticated Spatial-Temporal Graph Neural Network (ST-GNN) model. Additionally, this research traced the historical routes of container ships and accurately mapped the sectors where ships’ berth within terminals, thereby providing deeper insights into ship scheduling. This study offers considerable value to stakeholders in the supply chain industry, contributing to both theoretical and practical applications in maritime logistics.

The Genomic Computing Revolution: Defining the Next Decades of Accelerating Genomics

Harisankar Sadasivan (AMD), Artur Klauser (-NA-), Juergen Hench (University Hospital Basel), Yatish Turakhia (UCSD), Gagandeep Singh, Alberto Zeni (AMD), Sarah Beecroft (Pawsey Supercomputing Research Centre), Satish Narayanasamy (University of Michigan), Jeff Nivala (Univ. of Washington Seattle), Bob Robey (AMD), Onur Mutlu (ETH), Kristof Denolf, Gina Sitaraman (AMD)

The rapid growth of genomic data has positioned genomics to become one of the world’s most storage-intensive and computationally demanding fields. This review paper provides a comprehensive analysis of emerging trends in software and hardware for genomics, addressing a critical gap in the current literature. We categorize and examine key algorithms in genomic sequencing workflows, including sequence data indexing, retrieval and pile-up, dynamic programming, and machine learning. We analyze how computational and memory requirements scale with input sizes for each category, offering insights into potential bottlenecks and optimization opportunities. Our review synthesizes recent advancements in hardware-software co-design, compression schemes, and acceleration techniques for genomic data processing. We propose graphics processing units (GPUs) as a promising first step for deploying genomics workflows due to their high throughput, memory bandwidth, and programmability. By providing a holistic perspective on the computational challenges in genomics, this paper attempts to set a promising direction for future researchers to work collectively to optimize all the components of genomic data processing workflows. Our work aims to foster more targeted and effective advancements in the field, potentially leading to significant improvements in the efficiency and scalability of genomic analyses. We also discuss performance portability as a second step to meet the diverse requirements in point-of-care workflows.

Comparison of Vectorization Capabilities of Different Compilers for X86 and ARM CPUs

Nazmus Sakib (New Mexico State Univ.), Tarun Prabhu, Nandakishore Santhi (LANL), John Shalf (LBNL), Abdel-Hameed Badawy (New Mexico State Univ.)

Many modern processors contains vector units that simultaneously perform the same arithmetic operation over multiple sets of operands. The ability of compilers to automatically vectorize code is a critical factor in effectively using these units. Understanding the abilities of compilers to vectorize code on different hardware platforms is important for anyone writing compute-intensive, high-performance and portable code. We tested the vectorization capabilities of several compilers on x86 and ARM using a modified version of the TSVC2 test suite. On x86, GCC reported ∼ 54% of the loops of in TSVC suite as having been being vectorized, ICX reported ∼ 50% and Clang, ∼ 46%. On ARM, GCC reported ∼ 56% of the loops in TSVC suite as having being vectorized, ACFL reported ∼ 54% and Clang reported ∼ 47%. We found that the vectorized code did not always outperform the unvectorized code. In some cases, given two very similar vectorizable loops, a compiler would vectorize one but not the other. We also report multiple cases where a compiler vectorized a specific loop on one hardware platform but failed to do so on a different platform.

2-P2 (13:45-14:45): Poster Session 2-2

Chair(s)/Host(s): TBD

Performance Analysis of Falcon Post-Quantum Cryptography in Embedded Hardware-Software Integration [Outstanding Short Paper Award]

John Biselx (HES-SO), Andrea Guerrieri (HES-SO and EPFL)

Post-Quantum Cryptography (PQC) algorithms are gaining significant interest as they transition from theoretical prototypes to practical implementations, particularly in embedded applications such as the Internet of Things (IoT). Falcon, a PQC signature standard candidate selected by NIST, is the focus of this study. This paper aims to (1) benchmark Falcon’s performance on an embedded target and (2) explore hardware-software codesign approaches to enhance the performance of cryptographic functions. Our hardware-software codesign approach demonstrates a speedup of up to 42x for the most critical functions.

A Performance Analysis of GPU-Aware MPI Implementations Over the Slingshot-11 Interconnect

Michael Beebe (Texas Tech Univ.), Rahulkumar Gayatri, Kevin Gott, Adam Lavely (LBNL), Muhammad Haseeb (Nvidia), Brandon Cook (LBNL), Yong Chen (Texas Tech Univ.)

This study evaluates the performance of three GPU-aware MPI implementations—Cray MPICH, MPICH, and Open MPI—on the Slingshot-11 interconnect using the Perlmutter supercomputer. We use the OSU microbenchmarks to evaluate the bandwidth and latency of specific MPI routines, and two Exascale Computing Projects – LAMMPS and WarpX to evaluate application performance. Results confirm that Cray MPICH outperforms the others over Slingshot-11, significantly influencing HPC application efficiency by leveraging proprietary hardware mechanisms. This research assists domain scientists in selecting MPI libraries, enhancing application performance on Slingshot-11 systems and contributing to future software and hardware optimization studies.

Application of Virtual Client for Azure Hardware Qualification

Anna Mary Mathew, Bryan DeYoung, Michael Chhor, Sharjil Khan (Microsoft)

Virtual Client (VC) is an open-source platform intended to evaluate system performance by running industry available benchmarks. VC helps to standardize the workflow for validation and qualification, compare relevant system performance generation over generation, and correlate workload performance and system telemetry for results review.

Applying Natural Language Processing for Initial Categorizing of Product Descriptions

Nikolay Aristov, Thomas Koch, Elenna Dugundji (MIT-CTL)

The procurement process is a critical aspect of any company’s operations, and optimizing it can lead to significant cost savings. This study presents a methodology for enhancing the categorization of procurement items by mapping them to the United Nations Standard Products and Services Code (UNSPSC). By applying this approach, companies can analyze procurement patterns more effectively, consolidate purchases, and negotiate better terms with suppliers, thereby reducing overall costs. Given the lack of substantial labeled training data often encountered in real-world scenarios, traditional machine learning techniques for categorization are not feasible. To overcome this challenge, we employed Natural Language Processing (NLP) techniques using spaCy to clean and preprocess item descriptions, followed by embedding methods to generate vector representations for both the procurement data and UNSPSC categories. Our approach resulted in the initial mapping and categorization of approximately 29% of the dataset, providing a solid foundation for further analysis. This initial success demonstrates the potential for significant improvements in procurement efficiency. Additionally, we propose several advanced techniques to enhance the categorization process, suggesting that a higher categorization rate is achievable

Privacy-Preserving AI for Document Understanding with Controlled Unclassified Information

Scott M Sawyer (Paperless Parts, Inc.)

In manufacturing, manual exchange and review of engineering drawings constrains the supply chain. AI techniques have the potential to streamline these processes. This paper presents two important use cases and proposes solutions powered by generative AI models, including text and multi-modal Large Language Models. The solutions preserve the privacy of documents used for training and inference and are thus suitable for controlled data. Performance is measured and discussed in the context of the applications, including technical and other practical considerations. Results indicate generative AI is comparable or better than previous solutions.

2-3: Graph Analytics & Network Science 1 Session (14:15-15:30)

Co-Chairs: P. Luszczek & TBD

Multilevel Diffusion Based Spectral Graph Clustering [Outstanding Paper Award]

Malik Lechekhab, Dimosthenis Pasadakis, Olaf Schenk (Univ. della Svizzera italiana)

Spectral clustering and kernel k-means are ubiquitous methods for dividing non-linearly separable data into distinct groups. Spectral methods, while effective in partitioning non-convex spaces, are computationally intensive as they involve eigenvector computations. Conversely, kernel k-means maps data to higher dimensions and circumvents the need for this costly computation, with a weighted variant of it being mathematically equivalent to spectral clustering. This work extends this equivalence to diffusion based spectral methods and introduces the Multilevel Diffusion Clustering (MDC) algorithm. MDC leverages diffusion principles to minimize the normalized cut, adds flexibility through a diffusion parameter, and retrieves high-quality partitions in a multilevel fashion without computing eigenpairs. Our numerical examples and comparative results with modern multilevel graph clustering packages reveal that the proposed method can improve the clustering of graphs both in terms of balanced cut criteria and classification accuracy.

Batch-Parallel Compressed Sparse Row: A Locality-Optimized Dynamic-Graph Representation [Outstanding Student Paper Award]

Brian Wheatman, Randal Burns (Johns Hopkins Univ.), Helen Xu (Georgia Inst. of Tech.)

The default data structure for storing sparse graphs is Compressed Sparse Row (CSR), which enables efficient algorithms but is not designed to accommodate changes to the graph. Since many real-world graphs are dynamic (i.e., they change over time), there has been significant work towards developing dynamic-graph data structures that can support fast algorithms as well as updates to the graph. This paper introduces Batch-Parallel Compressed Sparse Row (BP-CSR), a batch-parallel data structure optimized for storing and processing dynamic graphs based on the Packed Memory Array (PMA). At a high level, Batch-Parallel Compressed Sparse Row extends Packed Compressed Sparse Row (PCSR, HPEC ’18), a serial dynamic-graph data structure built on a PMA. However, since the original PCSR runs only on one thread, it cannot take advantage of the parallelism available in multithreaded machines. In contrast, Batch-Parallel Compressed Sparse Row is built on the batch-parallel Packed Memory Array data structure (PPoPP ’24) and can support fast parallel algorithms and updates. The empirical evaluation demonstrates that Batch-Parallel Compressed Sparse Row supports fast parallel updates with minimal cost to algorithm performance. Specifically, Batch-Parallel Compressed Sparse Row performs up to 420 million inserts per second. Across a suite of 10 graph algorithms and 10 input graphs, Batch-Parallel Compressed Sparse Row incurs 1.05× slowdown on average and about 1.5× slowdown at most compared to Compressed Sparse Row (CSR), a classical static graph representation. Furthermore, the empirical results show that Batch-Parallel Compressed Sparse Row outperforms existing tree-based and PMA-based dynamic-graph data structures on both algorithms and updates.

Indexed Binary Operations in the GraphBLAS

Tim Mattson (Human Learning Group), Manaswinee Bezbaruah, Matthias Maier (Texas A&M Univ.), Scott McMillan (CMU Software Engineering Institute), Michel Pelletier (Graphegon), Erik Welch (Nvidia), Timothy A Davis (Texas A&M Univ.)

GraphBLAS is a sparse matrix library for graph algorithms. We could improve the GraphBLAS and support a wider range of applications if we had an indexed binary operator; i.e., a binary operator that depends on elements from two matrices and their positions (indices). We propose a design for these operators and illustrate their impact on performance and expressiveness for several algorithms including breadth-first- search, argmin/argmax, vector search, and finite-element assembly.

VF2-PS: Parallel and Scalable Subgraph Monomorphism in Arachne

Mohammad Dindoost, Oliver Alvarado Rodriguez (New Jersey Inst. of Tech.), Sounak Bagchi (Edison Academy Magnet School), Palina Pauliuchenka, Zhihui Du, David A Bader (New Jersey Inst. of Tech.)

This paper introduces a novel, parallel, and scalable implementation of the VF2 algorithm for subgraph monomorphism developed in the high-productivity language Chapel. Efficient graph analysis in large and complex network datasets is crucial across numerous scientific domains. We address this need through our enhanced VF2 implementation, widely uti lized in subgraph matching, and integrating it into Arachne—a Python-accessible, open-source, large-scale graph analysis framework. Leveraging the parallel computing capabilities of modern hardware architectures, our implementation achieves significant performance improvements. Benchmarks on synthetic and real-world datasets, including social, communication, and neuro-science networks, demonstrate speedups of up to 97X on 128 cores, compared to existing Python-based tools like NetworkX and DotMotif, which do not exploit parallelization. Our results on large-scale graphs demonstrate scalability and efficiency, establishing it as a viable tool for subgraph monomorphism, the backbone of numerous graph analytics such as motif counting and enumeration. Arachne, including our VF2 implementation, can be found on GitHub: https://github.com/Bears-R-Us/arkouda-njit.

MESM: A Query-Agnostic and Memory-Efficient Parallel Subgraph Matching Algorithm

Shubhashish Kar, Shaikh Arifuzzaman (UNLV)

Subgraph matching, also known as motif finding, is a fundamental problem in graph analysis with extensive applications. However, identifying subgraphs in large-scale graphs is challenging due to its NP-Hard complexity. In addition to the time complexity, previous solutions often suffer from excessive memory usage when dealing with large-scale graphs. This issue is exacerbated in shared-memory systems, where memory is more limited compared to distributed settings. Therefore, achieving a balance between execution time and memory efficiency is vital in such environments. In this paper, we present a query-agnostic shared-memory parallel algorithm that incorporates ordering in set intersection, resulting in an 8% reduction in enumeration time for large graphs. Our approach also achieves memory usage reductions ranging from 2× to 8.2× compared to state-of-the-art techniques, while maintaining comparable runtime performance on large datasets. Extensive experiments with various query and graph datasets demonstrate improved scalability and effective workload balancing of our approach compared to other methods.

2-4: Graph Analytics & Network Science 2 Session (15:45-17:30)

Co-Chairs: X. Sun & TBD

Constant-Memory Graph Coarsening

Christopher Brissette (NVIDIA), George M Slota (RPI)

Graph coarsening is an important step for many multi-level algorithms, most notably graph partitioning. However, such methods often utilize an iterative approach, where a new coarser graph representation is explicitly constructed and retained in memory at each level of coarsening. These overheads can be prohibitive for processing massive datasets or in constrained-memory environments like GPUs. We develop a data structure (CM-Graph) for representing coarsened graphs, which can be used with any adjacency-based graph representation. The CM-Graph data structure uses a constant amount of memory, regardless of the desired level of coarsening. In addition, CM-Graph does not require modification to the existing graph representation, it offers a several-fold memory savings in practice, and it can even accelerate graph coarsening, due to not having to explicitly construct coarser graph structures. We further describe efficient GPU parallelizations of the CM-Graph subroutines for adjacency access, which can also be utilized in most arbitrary graph computations without modification.

Algebraic Vertex Ordering of a Sparse Graph for Adjacency Access Locality and Graph Compression

Dimitris Floros (Duke Univ.), Nikos P Pitsianis (Aristotle Univ. of Thessaloniki), Xiaobai Sun (Duke Univ.)

In this work, we establish theoretical and practical connections between vertex indexing for sparse graph/network compression and matrix ordering for sparse matrix-vector multiplication and variable elimination. We present a fundamental analysis of adjacency access locality in vertex ordering from the perspective of graph composition of, or decomposition into, elementary compact graphs. We introduce an algebraic indexing approach that maintains the advantageous features of existing methods, mitigates their shortcomings, and adapts to the degree distribution. The new method demonstrates superior and versatile performance in graph compression across diverse types of graphs. It also renders proportional improvement in the efficiency of matrix-vector multiplications for subspace iterations in response to random walk queries on a large network.

An Efficient Multi-core Parallel Implementation of SSSP Algorithm with Decreasing Delta-stepping

Rakibul Hassan, Shaikh Arifuzzaman (UNLV)

Single-Source Shortest Path (SSSP) algorithms are crucial in diverse applications, ranging from optimizing transportation networks to analyzing social networks. This research focuses on the implementation and optimization of a parallel DeltaStepping algorithm to compute SSSP on shared-memory systems. We evaluate the performance of our implemented algorithms using synthetic Kronecker graphs and large-scale real-world graphs, including LiveJournal, Orkut, Twitter, and others. Our parallel Delta-Stepping algorithm achieves substantial speedup by employing several techniques: gradual reduction of delta value, property-driven graph reordering, bucket fusion, and dynamic load balancing. Extensive experimentation demonstrates that our Parallel Delta-Stepping with Decreasing Delta (PD3) yields speed-up between 10× and 65× with respect to best known serial implementation and faster than or comparable with other existing parallel methods. These findings underscore the importance of customized parallelization strategies and algorithmic optimizations for efficient SSSP computation in shared-memory systems, with significant implications for a wide array of graph-based applications.

IRIS-MEMFLOW: Data Flow Enabled Portable Memory Orchestration in IRIS Runtime for Diverse Heterogeneity

Mohammad Alaul Haque Monil, Narasinga Rao Miniskar, Seyong Lee, Beau Johnston, Pedro Valero-Lara, Aaron Young, Keita Teranishi, Jeffrey Vetter (ORNL)

Task-based programming models and execution paradigms provide a means to decompose a computation by expressing it as a graph in which each node represents a specific computation operating on memory objects and the edges define the dependencies in the execution flow. In this execution model, independent nodes in the graph can be executed concurrently in different computing devices, making it suitable for heterogeneous systems in which computing devices with different architectures coexist. However, careful memory orchestration across heterogeneous devices is needed because copies of the same memory object may reside in multiple devices during execution. Manually ensuring such an orchestration is quite challenging. Not only must an application developer guard against race conditions, but they must also optimize data movement between the host and devices because unnecessary data movement significantly impacts performance. To mitigate these challenges, we enhance the IRIS heterogeneous runtime and introduce IRIS-MEMFLOW–a data flow-enabled portable memory abstraction for seamlessly orchestrating memory in diverse heterogeneous computing environments. By using data-flow analysis, IRIS-MEMFLOW guards against race conditions while multiple heterogeneous devices access memory objects. IRIS-MEMFLOW also optimizes data movement between the host and devices without manual intervention. As a result, IRIS provides improved programming productivity, performance, and portability for multidevice heterogeneous executions in high-performance computing and cloud systems that run diverse architectures from different vendors. The efficacy of IRIS-MEMFLOW is evaluated through experiments that show its capability in terms of programming productivity, multidevice heterogeneity, portability, and low overhead versus the state of the art.

A Deployment Tool for Large Scale Graph Analytics Framework Arachne

Garrett R Gonzalez-Rivas, Zhihui Du, David A Bader (New Jersey Inst. of Tech.)

Data sets have grown exponentially in size, rapidly exceeding the scale at which traditional exploratory data analysis (EDA) tools can be used effectively to analyze real-world graphs. This led to the development of Arachne, a user-friendly tool enabling interactive graph analysis at terabyte scales while using familiar Python code and utilizing a high-performance back-end powered by Chapel that can be run on nearly any *nix-like system. Various disciplines, including biological, information, and social sciences, use large-scale graphs to represent the flow of information through a cell, connections between neurons, interactions between computers, relationships between individuals, etc. However, to take advantage of Arachne, a new user has to go through a long and convoluted installation process, which often takes a week or more to complete, even with assistance from the developers. To support Arachne’s mission of being an easy-to-use exploratory graph analytics tool that increases accessibility to high performance computing (HPC) resources, a better deployment experience was needed for users and developers. In this paper, we propose a tool specially designed to greatly simplify the deployment of Arachne for users and offer the ability to rapidly and automatically test the software for compatibility with new releases of its dependencies. The highly portable nature of Arachne necessitates that this deployment tool be able to install and configure the software in diverse combinations of hardware, operating system, initial system environment, and handle evolving packages and libraries in Arachne. The tool was tested in virtual and real-world environments, where its success was evaluated by an improvement to efficiency and productivity for both users and developers. Current results show that the installation and configuration process has been greatly improved, with a significant reduction in the time and effort spent by both users and developers.

2-S1: GraphBLAS BoF Special (17:30-19:30)

Organizers: T. Mattson, B. Brock & S. McMillan

Wild and Crazy Ideas for GraphBLAS 3.0

Raye Kimmerer (MIT)
Report on the binsparse Specification

Ben Brock (Intel)
SuiteSparse Update

Tim Davis (Texas A&M Univ.)
Python GraphBLAS Update

Jim Kitchen (Anaconda), Erik Welsh (Nvidia)
Julia GraphBLAS Update

Raye Kimmerer (MIT)
Postgres GraphBLAS Update

Michel Pelletier (Graphegon)
Keynote Talk: The Future of Sparse Computing is Compilers

Fredrik Kjolstad (Stanford)

IEEE HPEC 2024

28th Annual IEEE High Performance Extreme Computing Virtual Conference 23 - 27 September 2024