28th Annual
IEEE High Performance Extreme Computing Virtual Conference
23 - 24 September 2024

All times are EDT (UTC/GMT -04 hours)

Speaker/Presenting Author in Italics

Monday, September 23

1-K: Kickoff Session (10:30-11:00)

Co-Chairs: J. Kepner & A. Reuther

Kickoff Talk: Where We Stand: Education, Research, and High Performance Computing

Peter Fisher (MIT)

1-1: Advanced Multicore Software Technologies Session (11:00-12:15)

Co-Chairs: A. Conard & C. Byun

Supercomputer 3D Digital Twin for User Focused Real-Time Monitoring [Outstanding Paper Award]

William Bergeron, Matthew Hubbell, Daniel Mojica, Albert Reuther, William Arcand, David Bestor, Daniel Burrill, Chansup, Byun, Vijay Gadepally, Michael Houle, Hayden Jananthan, Michael Jones, Piotr Luszczek, Peter Michaleas (MIT Lincoln Laboratory), Lauren Milechin (MIT), Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner (MIT Lincoln Laboratory); Real-time supercomputing performance analysis is a critical aspect of evaluating and optimizing computational systems in a dynamic user environment. The operation of supercomputers produce vast quantities of analytic data from multiple sources and of varying types so compiling this data in an efficient matter is critical to the process. MIT Lincoln Laboratory Supercomputing Center has been utilizing the Unity 3D game engine to create a Digital Twin of our supercomputing systems for several years to perform system monitoring. Unity offers robust visualization capabilities making it ideal for creating a sophisticated representation of the computational processes. As we scale the systems to include a diversity of resources such as accelerators and the addition of more users, we need to implement new analysis tools for the monitoring system. The workloads in research continuously change, as does the capability of Unity, and this allows us to adapt our monitoring tools to scale and incorporate features enabling efficient replay of system wide events, user isolation, and machine level granularity. Our system fully takes advantage of the modern capabilities of the Unity Engine in a way that intuitively represents the real time workload performed on a supercomputer. It allows HPC system engineers to quickly diagnose usage related errors with its responsive user interface which scales efficiently with large data sets.
Dynamic Task Scheduling with Data Dependency Awareness Using Julia

Rabab MA Alomairy, Felipe Tome, Julian Samaroo, Alan Edelman (MIT); Dynamic task scheduling is vital for optimizing performance and resource utilization, particularly in heterogeneous computing environments. The LLVM-based Julia programming language offers a unique opportunity for developing efficient task-based runtime systems. This paper introduces the Dagger.jl package, a Julia-native implementation for dynamic task scheduling with data dependency awareness. We design a high-performance scheduler that leverages Julia’s type inference capabilities to support various computational tasks and data types. Our approach provides an unified API, facilitating the development and deployment of applications across different architectures. We evaluate the performance and overhead of Dagger through several tiled dense linear algebra computations on shared memory systems. Notably, our results show that Dagger with data dependency awareness outperforms other parallel paradigms in Julia and achieves performance comparable to vendor-optimized operations. Dagger also leverages the implementation of the QR communication-avoiding algorithm, delivering significant performance improvements, and highlighting its potential for scalable and efficient parallel computing.
Optimization Strategies to Accelerate BLAS Operations with ARM SVE

Aniket P Garade, Sushil Pratap Singh, Juliya James, Deepika H V, Haribabu Pasupuleti, S A Kumar, Sudarsan S D (C-DAC); Optimized mathematical libraries designed for specific hardware platforms are critical for achieving maximum performance in scientific and engineering applications. These libraries play a key role in accelerating computations and improving code efficiency. The Scalable Vector Extension (SVE) for the ARM architecture is a recent development that enhances vectorization capabilities, with wide vectors, leading to significant performance improvements. This paper explores vector optimizations for Basic Linear Algebra Subprograms (BLAS) routines, targeting both single and double precision data. It details the strategies for vectorizing BLAS operations using SVE. The approach is implemented with OpenBLAS, and experimental results reveal notable performance gains, demonstrating the efficacy of SVE in accelerating computational tasks on ARM platforms.
A Highly Scalable Parallel Design for Data Compression

S Biplab Raut (AMD); With ever-increasing use of digital data, many applications rely on data compression for their needs related to processing, storage and communication over the network of large volumes of data. While compression saves the memory/disk space and decreases the communication time, there is a considerable runtime spent in this process. Parallel compression algorithms and solutions that are developed to speed up the operations do not scale well on the multi-core CPUs. The data parallel schemes implemented by the prior arts are inefficient in partitioning the data and scaling the performance on the multi-core processors. Another major drawback of the existing multi-threaded compression solutions is the non-compliance to the single-threaded compression format. In this paper, we propose a set of novel and high-performance parallel compression and decompression schemes. We introduce novel designs for dynamic threading based parallel compression and random access point based parallel decompression. With our solution, we mitigate both the scaling issues on multi-core x86 CPUs and format compliance issues encountered in multi-threading the compression operations. Our test results demonstrate massive speedups by manyfolds and performance scaling never seen before on x86 CPUs especially AMD’s ”Zen”-based recent processors that come with very high core counts.
Investigating Resilience of Loops in HPC Programs: A Semantic Approach with LLMs

Hailong Jiang, Jianfeng Zhu (Kent State Univ.), Bo Fang (PNNL), Chao Chen (Intel), Qiang Guan (Kent State Univ.)

Transient hardware faults, resulting from particle strikes, are significant concerns in High-Performance Computing (HPC) systems. As these systems scale, the likelihood of soft errors rises. Traditional methods like Error-Correcting Codes (ECCs) and checkpointing address many of these errors, but some evade detection, leading to silent data corruptions (SDCs). This paper evaluates the resilience of HPC program loops, which are crucial for performance and error handling, by analyzing their computational patterns, known as the thirteen dwarfs of parallelism. We employ fault injection techniques to quantify SDC rates and utilize Large Language Models (LLMs) with prompt engineering to identify the loop semantics of the dwarfs in real source code. Our contributions include defining and summarizing loop patterns for each dwarf, quantifying their resilience, and leveraging LLMs for precise identification of these patterns. These insights enhance the understanding of loop resilience, aiding in the development of more resilient HPC applications.

1-P1 (12:15-13:15): Poster Session 1-1

Chair(s)/Host(s): K Keville & K. Cain

Performance Benchmarking of H2O AutoML and Individual Models on Malware Detection Tasks

Minakshi Arya (NDSU), Shubhavi Arya (Indiana Univ.), Saatvik Arya (Univ. of Washington); This paper presents a comprehensive comparative analysis of H2O AutoML and individual machine learning models including Distributed Random Forest (DRF), Gradient Boosting Machine (GBM), XGBoost, and Deep Learning, applied to malware detection. We evaluate these models using key performance metrics such as accuracy, AUC, log loss, precision, recall, and F1 score across different time frames. Our findings highlight the efficiency and effectiveness of H2O AutoML in identifying optimal models, providing insights into its potential advantages and limitations compared to traditional manual model selection.
IOS: A Low Cost Defense to Mitigate Meltdown and Spectre like Attacks

Xin Wang (Virginia Commonwealth Univ.), Wei Zhang (Univ. of Louisville); The Meltdown and Spectre attacks brings severe security issues to a wide range of modern processors. The Meltdown steals sensitive information in the kernel memory space by utilizing the nature of out-of-order execution and the Spectre exploits the speculative execution to access the secret data. Both attacks transmit the secrets via the cache side-channels. Although software patches have been applied to protect the processors from the Meltdown and Spectre attacks, the countermeasure also introduces huge performance degradation and close the door to the benefit of out-of-order and speculative execution. Customizing a hardware-based solution can be more performance friendly and also conserve the bonus the out-of-order and speculative execution. In this paper, we proposed a hardware-based mitigation technique named Invalidation on Squash (IOS) which is able to close the cache covert channel and stops Meltdown and Spectre from exposing the secrets to the adversary. To simplify the additional defense logic and minimize the hardware overhead, IOS targets the squashed load instructions and invalidates the corresponding cache lines introduced by these squashed loads. Compared to the existing Meltdown and Spectre Countermeasures, IOS incurs only negligible hardware overhead by taking the advantage of the simple invalidation logic.
Authentication in High Noise Environments using PUF-Based Parallel Probabilistic Searches

Brian Donnelly, Michael Gowanlock (Northern Arizona Univ.); Enabling secure communication in noisy environments is a major challenge. In these environments, the outputs of cryptography algorithms undergo error where several bits change states and since these algorithms cannot tolerate any error, authenticating and securing communication between parties is disrupted. We propose a noise-resistant public key infrastructure protocol that employs physical unclonable functions (PUFs). PUFs act as a unique fingerprint for each device in a network; however, their state may drift over time due to fluctuations in temperature and other factors. Using a PUF requires a search to identify flipped bits which is conducted on a secure server. This has the benefit of removing error correction on low-powered client devices. This paper exploits the probabilistic nature of PUF bit error rates (BERs) and uses this information to aid in the search process that resolves the noise imparted by the environment. We show that using a 256-bit PUF-generated seed (a PUF response), our protocol is robust to a PUF BER of approximately 11% (or 30 of 256 bits), and a transmission bit error rate (TBER) of 30%. In this scenario, on average the authentication mechanism on a secure server requires <5s. We also show results for higher PUF BERs which have a <100% authentication success rate which indicates the upper limit on the PUF BER tolerance of our protocol.
Intel Xeon Optimization for Efficient Media Workload Acceleration

Karan Puttannaiah, Rajesh Poornachandran (Intel); This paper discusses key methodologies involved in performing Workload Affinity characterization along with how to characterize the power-performance tradeoff across fine granular Intel Xeon CPU parameters across variety of industry popular Media use cases. Key results from the detailed study along with business acumen helped to define first ever Media workload optimized Intel Xeon CPU.
Towards an End-to-End Processing-in-DRAM Acceleration of Spectral Library Search

Tianyun Zhang, Eric Tang (Carnegie Mellon Univ.), Farzana A Siddique, Kevin Skadron (Univ. of Virginia), Franz Franchetti (Carnegie Mellon Univ.); This work explores accelerating spectral library searches, a key mass spectrometry (MS) workload, using processing-in-memory (PIM) architectures through an end-to-end, co-designed approach. We apply signal processing and approximate computing techniques for pre-filtering MS data and implement a sum of absolute differences (SAD) algorithm optimized for PIM to compare spectral similarity. Our methodology is evaluated using a DRAM-based PIM simulator and compared against traditional CPU implementations. While initial results with small datasets favor CPUs, our analysis indicates potential benefits for PIM with larger, more realistic proteomics datasets. This work represents an initial step towards investigating PIM acceleration for MS applications.
Neuromorphic Circuits with Spiking Astrocytes for Increased Energy Efficiency, Fault Tolerance, and Memory Capacitance

Murat Isik (Drexel Univ.), Kaushal Gawri (SemaAI), Maurizio De Pitta (University Health Network)

In the rapidly advancing field of neuromorphic computing, integrating biologically-inspired models like the Leaky Integrate-and-Fire Astrocyte (LIFA) into spiking neural networks (SNNs) enhances system robustness and performance. This paper introduces the LIFA model in SNNs, addressing energy efficiency, memory management, routing mechanisms, and fault tolerance. Our core architecture consists of neurons, synapses, and astrocyte circuits, with each astrocyte supporting multiple neurons for self-repair. This clustered model improves fault tolerance and operational efficiency, especially under adverse conditions. We developed a routing methodology to map the LIFA model onto a fault-tolerant, many-core design, optimizing network functionality and efficiency. Rigorous evaluation showed that our design is area and power-efficient while achieving superior fault tolerance compared to existing approaches. Our model features a fault tolerance rate of 81.10\% and a resilience improvement rate of 18.90\%, significantly surpassing other implementations. The results validate our approach in memory management, highlighting its potential as a robust solution for advanced neuromorphic computing applications. The integration of astrocytes represents a significant advancement, setting the stage for more resilient and adaptable neuromorphic systems.

1-2: Advanced Processor Architectures Session (12:30-13:45)

Co-Chairs: M. Barnell & K. Gettings

VeBPF Many-Core Architecture for Network Functions in FPGA-based SmartNICs and IoT

Zaid Tahir (Boston Univ.), Ahmed Sanaullah (Red Hat), Sahan Bandara (Boston Univ.), Ulrich Drepper (Red Hat), Martin Herbordt (Boston Univ.); FPGA-based SmartNICs and IoT devices integrated with soft-processors for executing network functions have been introduced to overcome hardware-reconfigurability limitations in DPUs and MCUs, respectively. However, existing FPGA-based SmartNICs and IoT devices lack a highly configurable many-core architecture that specializes in network packet processing. This work introduces a resource-optimized highly configurable VeBPF (Verilog eBPF) many-core architecture built upon VeBPF CPU cores that we have developed for specialized network packet processing in FPGAs. These VeBPF cores are eBPF ISA compliant and have been developed in Verilog HDL for easy integration with existing FPGA IP blocks/subsystems. The VeBPF many-core architecture executes multiple eBPF rules on multiple VeBPF cores in-parallel for low-latency network packet processing. Due to the highly configurable hardware design of this VeBPF many-core architecture, any number of VeBPF cores can be instantiated by assigning a parameter in the Verilog code of the VeBPF many-core architecture and any number of eBPF rules can be uploaded with FPGA resources as the only constraint. The proposed VeBPF many-core architecture has been designed to process eBPF rules faster if N is increased and the eBPF rules can be dynamically changed during run-time without requiring new bitstreams. It uses various hardware and computer architecture optimizations to support its implementation on low-end FPGAs-based IoT devices along with high-end FPGA-based SmartNICs, for network packet processing. We have also developed automatic-testing and simulation frameworks for the proposed VeBPF many-core architecture, using the latest open-source tools like Python and Cocotb.
Hunting the Needle – The Potential of Innovation in Architecture

Peter Kogge (Univ. of Notre Dame), Janice McMahon (Self), Timothy Dysart (Tactical Computing Labs); Subgraph Isomorphism involves using a small graph as a pattern to identify within a larger graph a set of vertices that have edges that match, and is becoming of increasing importance in many application areas. Such problems exhibit the potential for very significant fine-grain parallelism, with individual threads having short lifetimes while touching potentially “distant” memory objects in very unpredictable and irregular fashion. This is difficult for conventional distributed memory systems to achieve efficiently, but an alternative that combines cheap multi-threading with threads that can migrate freely through a large memory is a more natural fit. This paper demonstrates the potential of such an architecture by comparing its execution characteristics for a large graph to that of several conventional parallel implementations on modern but conventional architectures. The gains exhibited by the migrating threads are significant.
Predictive Performance of Photonic SRAM-based In-Memory Computing for Tensor Decomposition [Outstanding Student Paper Award]

Sasindu Wijeratne (USC), Sugeet Sunder (USC Information Sciences Institute), Md Abdullah-Al Kaiser, Akhilesh Jaiswal (Univ. of Wisconsin), Clynn Mathew, Ajey Jacob (USC Information Sciences Institute), Viktor K Prasanna (USC); Photonic based in-memory computing systems have demonstrated a significant speedup over traditional transistor-based systems because of their ultra-fast operating frequencies and high data bandwidths. Photonic static random access memory (pSRAM) is a crucial component for achieving the objective of ultra-fast photonic in-memory computing systems. In this study, we model and evaluate the performance of a new photonic SRAM array architecture in development, using predictive performance metrics for key photonic components. Additionally, we examine hyperspectral operation through wavelength division multiplexing (WDM) to enhance throughput. We map the tensor operation, Matricized Tensor Times Khatri-Rao Product (MTTKRP), used in tensor decomposition, onto the proposed photonic SRAM array architecture. Our predictive performance model indicates that the proposed architecture can sustain a performance of 17 PetaOps while executing MTTKRP on extremely large tensors.
Accelerating Sensor Fusion in Neuromorphic Computing: A Case Study on Loihi-2

Murat Isik (Drexel Univ.), Karn Tiwari (IIS Bangalore), Burak Eryilmaz (Bilkent Univ.), Ismail Can Dikmen (TEMSA); In our study, we utilized Intel’s Loihi-2 neuromorphic chip to enhance sensor fusion in fields like robotics and autonomous systems, focusing on datasets such as AIODrive, Oxford Radar RobotCar, D-Behavior (D-Set), nuScenes by Motional, and Comma2k19. Our research demonstrated that Loihi-2, using spiking neural networks, significantly outperformed traditional computing methods in speed and energy efficiency. Compared to conventional CPUs and GPUs, Loihi-2 showed remarkable energy efficiency, being over 100 times more efficient than a CPU and nearly 30 times more than a GPU. Additionally, our Loihi-2 implementation achieved faster processing speeds on various datasets, marking a substantial advancement over existing state-of-the-art implementations. This paper also discusses the specific challenges encountered during the implementation and optimization processes, providing insights into the architectural innovations of Loihi-2 that contribute to its superior performance.
A Multilevel Approach For Solving Large-Scale QUBO Problems With Noisy Hybrid Quantum Approximate Optimization

Filip B Maciejewski (NASA/USRA), Bao Gia Bach (Univ. of Delaware), Maxime Dupont (Rigetti Computing), Paul A Lott (Universities Space Research Association), Bhuvanesh Sundar (Rigetti Computing), David Neira (Purdue University/USRA), Ilya Safro (Univ. of Delaware), Davide Venturelli (Universities Space Research Association)

Quantum approximate optimization is one of the promising candidates for useful quantum computation, particularly in the context of finding approximate solutions to Quadratic Unconstrained Binary Optimization (QUBO) problems. However, the existing quantum processing units (QPUs) are of relatively small size, and canonical mappings of QUBO via the Ising model require one qubit per variable, rendering direct large-scale optimization infeasible. In classical optimization, a general strategy for addressing many large-scale problems is via multilevel/multigrid methods, where the large target problem is iteratively coarsened and the global solution is constructed from multiple small-scale optimization runs. In this work, we experimentally test how existing QPUs perform when used as a sub-solver within such a multilevel strategy. To this aim, we combine and extend (via additional classical processing steps) the recently proposed Noise-Directed Adaptive Remapping (NDAR) and Quantum Relax $\&$ Round (QRR) algorithms. We first demonstrate the effectiveness of our heuristic extensions on Rigetti’s superconducting transmon device Ankaa-2. We find approximate solutions to $10$ instances of fully connected $82$-qubit Sherrington-Kirkpatrick graphs with random integer-valued coefficients obtaining normalized approximation ratios (ARs) in the range $\sim 0.98-1.0$, and the same class with real-valued coefficients (ARs $\sim 0.94-1.0$). Then, we implement the extended NDAR and QRR algorithms as subsolvers in the multilevel algorithm for $6$ large-scale graphs with at most $\sim 27,000$ variables. In practice, the QPU (with classical post-processing steps) is used to find approximate solutions to dozens of at most $82$-qubit problems, which are iteratively used to construct the global solution. We observe that quantum optimization results are competitive in terms of the quality of solutions when compared to classical heuristics used as subsolvers within the multilevel approach.

1-P2 (13:45-14:45): Poster Session 1-2

Chair(s)/Host(s): P. Luszczek

Quantum Machine Learning in the Cognitive Domain: Alzheimer’s Disease Study

Emine Akpinar (Yıldız Technical Univ.); Alzheimer’s disease (AD) is the most prevalent neurodegenerative disorder, primarily affecting the elderly population and leading to significant cognitive decline. This decline manifests in various mental faculties such as attention, memory, and higher-order cognitive functions, severely impacting an individual’s ability to comprehend information, acquire new knowledge, and communicate effectively. One of the tasks influenced by cognitive impairments is handwriting. By analyzing specific features of handwriting, including pressure, velocity, and spatial organization, researchers can detect subtle changes that may indicate early-stage cognitive impairments, particularly AD. Recent developments in classical artificial intelligence (AI) methods have shown promise in detecting AD through handwriting analysis. However, as the dataset size increases, these AI approaches demand greater computational resources, and diagnoses are often affected by limited classical vector spaces and feature correlations. Recent studies have shown that quantum computing technologies, developed by harnessing the unique properties of quantum particles such as superposition and entanglement, can not only address the aforementioned problems but also accelerate complex data analysis and enable more efficient processing of large datasets. In this study, we propose a variational quantum classifier with fewer circuit elements to facilitate early AD diagnosis based on handwriting data. Our model has demonstrated comparable classification performance to classical methods and underscores the potential of quantum computing models in addressing cognitive problems, paving the way for future research in this domain.
On the Design of the Quantum-Classical Hybrid-Service Architecture

Yi Liu, Yuchou Chang (UMass Dartmouth); As one of the latest advancing technologies, quantum computing shows its potential to rapidly solve problems that are too complex for traditional computing methods. However, due to its unique nature, such as its complicated principles of quantum mechanics, quantum computing is still not widely used in many fields. In the software engineering domain, one of the challenges is the lack of methodologies for constructing applications that use both quantum and classical computing environments. This study proposes an innovative Hybrid-Service architectural design approach for constructing hybrid quantum-classical applications. The design approach is then applied to the construction of a hybrid quantum Magnetic Resonance Imaging (MRI) application to accelerate MRI speed as a case study to demonstrate the effectiveness of the design approach.
Quantum Computing for Data Calibration in Parallel Magnetic Resonance Imaging Reconstruction

Girish Babu Reddy, Gulfam A Saju, Yi Liu, Yuchou Chang (UMass Dartmouth); Parallel imaging techniques, such as GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA) play an important role in Magnetic Resonance Imaging (MRI) by significantly reducing scan times and enhancing patient comfort without compromising image quality. GRAPPA’s algorithmic framework involving calibration and synthesis stages is critical in reconstructing high-quality images. However, the computational load of the calibration stage, especially with large convolutional kernel sizes or an increased number of receiver coils, poses a significant bottleneck and limits its efficiency and applicability in clinical settings. In this paper, we introduce an approach by proposing a hybrid software architecture that integrates quantum computing into the GRAPPA reconstruction process. Our method exploits the computational capabilities of quantum computing to accelerate the calibration phase, thereby enabling real-time processing speeds. Through experimentation, we demonstrate that the quantum-enhanced approach can expedite the calibration process to around 10-20 milliseconds of Quantum Processing Unit (QPU) programming time for each Linear Time-Invariant (LTI) system solved. The method maintains the integrity of calibration outcomes, achieving results on par with conventional central processing unit (CPU)-based processes. The result represents the progress towards real-time MRI reconstruction by reducing clinical MRI workflows times, improving patient throughput, and potentially enabling new diagnostic capabilities. Solving LTI system with more attributes on D-Wave quantum computer will be studied to show advantages of QPU to CPU in the future work.
Ultra Low Latency Hardware Optimised Radix-4 FFT for Optical Wireless FPGA Transceivers via Hermitian Symmetry Characteristics

Michael Codd, Ciara McDonald (Maynooth Univ.), Yiyue Jiang, Chunan Chen (Northeastern Univ.), Holger Claussen (Tyndall National Institute), Miriam Leeser (Northeastern Univ.), John Dooley (Maynooth Univ.); Future telecommunication networks are expected to deliver exponential performance increases across all domains and with the increased prevalence of real-time IoT devices, greater emphasis is placed on reducing the latency of network links. Traditionally, wireless networking requirements have been fulfilled primarily through use of the RF spectrum, which is rapidly approaching saturation and will eventually very likely become insufficient to meet all future network demands. The optical spectrum however offers enormous amounts of unrestricted and unallocated bandwidth. Efficient high-modulation indoor LED lighting fixtures could potentially integrate with and complement the RF spectrum for short to medium distance low latency applications. The Fast Fourier Transform (FFT) is a ubiquitous operation in many communication network topologies. Typically the FFT is computed via serial methods which are optimised for low resource usage, however these architectures fall short of the Ultra Low Latency (ULL) requirements for optical wireless communication. Fully parallel FFT computations can achieve nanosecond latency and tens of gigasamples of throughput, far surpassing serial methods. However, their high resource utilization has limited their practical use. In this work, we introduce a hardware optimized, fully parallel architecture for optical wireless communication which leverages hermitian symmetry characteristics within real-valued optical signals and properties of the discrete DFT to reduce the footprint of a fully parallel FFT on an FPGA. The final architecture is implemented on an AMD RFSoC2x2 and requires only 3 clock cycles to compute a 256-point real-valued FFT, a 290 fold reduction compared to an equivalent serial model. The design was tested at 122.88 MHz resulting in a 24 nanosecond latency, demonstrating its potential for use in optical wireless communication and other high-performance 5G+ networks.
Fully Transparent Client-Side Caching for Key-Value Store Applications Using FPGAs

Sahan Bandara, Noah Cherry, Martin Herbordt (Boston Univ.); Key-value stores (KVS) are a critical component of current data center infrastructure. They help address the extreme demand on data centers for high bandwidth, low latency access to large amounts of data. Due to their importance, many efforts have been made to improve their performance, which includes using FPGAs to offload some functionality. These efforts have been focused on improving the performance of the key-value store itself and reducing the load on the server running the KVS. However, with more FPGAs being deployed in the data centers by many cloud service providers, some use models that were not previously practical are becoming more realistic. In this work, we explore one such use case where we attempt to cache key-value entries at the network interface of the client server. We propose an FPGA design that is capable of caching the KVS data transparently to the KVS client application. The proposed solution is able to improve the application throughput while also reducing the network traffic generated by the KVS client. Also, as the proposed solution targets client servers that are typically shared by multiple clients, we discuss the importance of, and present our vision for, an FPGA design supporting multiple tenants.
Impact of Grid Processing on Signal Cross-Correlation

Rhea Senthil Kumar, Nathan Simard, Jonathan Mathews, Jeremy Kepner, Timothy Collard (MIT Lincoln Laboratory)

Systems involving signal processing produce large amounts of data, requiring an efficient computing architecture to measure data characteristics on feasible time frames. Grid processing offers scalability and user-controlled resource optimization, making it a viable method for large-scale computing applications, such as machine-learning, financial modeling, and data analytics. In this paper, we focus on grid processing of collected signal data. In this work, we adapted a cross-correlation algorithm to be compatible with grid processing, improving runtime by several orders of magnitude. This paper measures the performance of the MIT Lincoln Laboratory Supercomputing Center (LLSC) cluster using Intel Xeon 64-core processors for signal cross-correlation. To the best of our knowledge, this is the first successful implementation of fast Fourier transform (FFT)-based signal cross-correlation utilizing CPU-based grid processing.

1-3: ASIC and FPGA Advances Session (14:15-15:30)

Co-Chairs: TBD & TBD

A High-Performance Curve25519 and Curve448 Unified Elliptic Curve Cryptography Accelerator

Aniket Banerjee (IISc), Utsav Banerjee (Indian Institute of Science); In modern critical infrastructure such as power grids, it is crucial to ensure security of data communications between network-connected devices while following strict latency criteria. This necessitates the use of cryptographic hardware accelerators. We propose a high-performance unified elliptic curve cryptography accelerator supporting NIST standard Montgomery curves Curve25519 and Curve448 at 128-bit and 224-bit security levels respectively. Our accelerator implements extensive parallel processing of Karatsuba-style large-integer multiplications, restructures arithmetic operations in the Montgomery Ladder and exploits special mathematical properties of the underlying pseudo-Mersenne and Solinas prime fields for optimized performance. Our design ensures efficient resource sharing across both curve computations and also incorporates several standard side-channel countermeasures. Our ASIC implementation achieves record performance and energy of 10.38 µs / 54.01 µs and 0.72 µJ / 3.73 µJ respectively for Curve25519 / Curve448, which is significantly better than state-of-the-art.
Direct RF FPGAs built with Multi-Chip Packaging Overcome Technology Challenges

Marjorie Catt, Dustin J Henderson (Altera); The wideband Altera™ Agilex™ Direct RF-Series portfolio is a leap forward which enables the implementation of direct RF systems in a single package RF FPGA. This development included the integration of existing technologies and creating solutions for new challenges.
A Run-Time Configurable NTT Architecture for Homomorphic Encryption Based on 3D Algorithm

Weicong Lu, Xiaojie Chen, Dihu Chen, Tao Su (Sun Yat-Sen Univ.); Homomorphic encryption (HE) allows computations on encrypted data without compromising data privacy, making it ideal for scenarios like privacy-preserving computing. The primary bottleneck within HE schemes is polynomial multiplication, which can be accelerated using the number theoretic transform (NTT). This paper proposes a run-time configurable (RTC) NTT/INTT accelerator supporting HE parameter sets based on the 3D NTT algorithm. A conflict-free memory access pattern is proposed to efficiently implement the 3D NTT algorithm without additional hardware units. Additionally, an on-the-fly twiddle factor generator (TFG) is proposed to optimize memory utilization for twiddle factors (TFs). The proposed design achieves significant improvements in performance and area efficiency compared to state-of-the-art FPGA implementations.
Optimizing FPGA Memory Allocation for Matrix-Matrix Multiplication using Bayesian Optimization

Mehmet Gungor, Stratis Ioannidis, Miriam Leeser (Northeastern Univ.); Matrix-matrix multiplication (MM) of large matrices plays a crucial role in various applications, including machine learning. MM requires significant computational resources, but accessing memory can quickly become the bottleneck. Field-Programmable Gate Arrays (FPGAs) offer a range of memory options, such as Block RAM (BRAM), UltraRAM (URAM), and High Bandwidth Memory (HBM), each with unique characteristics. In this study, we explore the optimal combination of HBM with either BRAM, URAM, or both, depending on the size of the input data. We employ Bayesian optimization to optimize the FPGA implementation, analyze the trade-offs between different memory types, and determine the most suitable memory allocation based on memory sizes. Our findings provide insights for designers seeking to optimize their designs and demonstrate that URAM outperforms a combination of BRAM and URAM when data fits in URAM. Overall, our approach enables more efficient memory allocation for larger matrix sizes on FPGAs compared to prior research.
pc-COP: An Efficient and Configurable 2048-p-Bit Fully-Connected Probabilistic Computing Accelerator for Combinatorial Optimization

Kiran Magar (IISc), Shreya Bharathan (National Inst. of Tech., Tiruchirappalli), Utsav Banerjee (Indian Institute of Science)

Probabilistic computing is an emerging quantum-inspired computing paradigm capable of solving combinatorial optimization and various other classes of computationally hard problems. In this work, we present pc-COP, an efficient and configurable probabilistic computing hardware accelerator with 2048 fully connected probabilistic bits (p-bits) implemented on Xilinx UltraScale+ FPGA. We propose a pseudo-parallel p-bit update architecture with speculate-and-select logic which improves overall performance by 4× compared to the traditional sequential p-bit update. Using our FPGA-based accelerator, we demonstrate the standard G-Set graph maximum cut benchmarks with near-99% average accuracy. Compared to state-of-the-art hardware implementations, we achieve similar performance and accuracy with lower FPGA resource utilization.

1-4: BRAINS – Building Resilience through Artificial Intelligence for Networked Systems Session (15:45-17:30)

Co-Chairs: S. Pisharody & J. Holodnak

Invited Talk: The SWARM Project: Reimagining Workflow and Resource Management Systems with Swarm Intelligence

Prasanna Balaprakash (ORNL)
Invited Talk: The Convergence of Intuitive AI and Exascale Computing: Redefining What’s Possible

Eliu Huerta (ANL)
Invited Talk: Operational AI/ML Opportunities

Scott Weed (US Air Force)
Hardware Trojan Detection Utilizing Graph Neural Networks and Structural Checking

Hunter Nauman, Jia Di (Univ. of Arkansas); The integrated circuit (IC) industry has experienced exponential growth in the complexity and scale of hardware designs. To sustain this growth, faster development cycles and cost-effective solutions have been the focus of many companies, notably through the incorporation of third-party intellectual property (IP). Outsourcing the production of sub-components reduces development time and enables faster time-to-market; however, this approach also introduces the threat of Hardware Trojans, which are malicious modifications or additions to an IC, posing significant security risks due to their small size, low activation frequency, and complex obfuscation techniques. This research proposes an advancement to the Trojan detection mechanisms incorporated in the Structural Checking Tool, a Trojan detection tool that focuses on the identification of logical Trojans embedded within soft IPs. Leveraging graph structures generated by the tool and signal-level features, this research develops a new dataset and three graph neural network architectures. Each neural network corresponds to classical graph neural network layers and execute graph-level probabilistic binary classification of Trojan inclusion. Through rigorous testing with two potential sets of node-level feature vectors, this research offers a faster, more accurate, and more adaptable approach than those existing within the current tool.
Break

Composable Mission-Critical Embedded System Architecture for High Assurance

Michael Vai, Eric Simpson, Alice Lee, Huy Nguyen, Jeffrey Hughes, Ben Nahill, Jeffery Lim, Roger Khazan, Sean O’melia (MIT Lincoln Laboratory), Fred Schneider (Cornell University); Mission-critical systems must go through a laborious and lengthy high assurance certification process. Slight modifications of a certified system often trigger a new certification cycle. We have leveraged a Modular Open Systems Approach (MOSA) and developed a composable Embedded-Security-as-a-Service (ESaaS) architecture for mission-critical embedded systems. A zero-trust approach has been applied to incorporate security and resilience technologies and address mission assurance requirements. In this paper, we discuss an ecosystem that supports the acquisition and certification processes of high assurance ESaaS modular embedded systems for critical missions.
What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic

Jeremy Kepner, Hayden Jananthan, Michael Jones, William Arcand, David Bestor, William Bergeron, Daniel Burrill (MIT Lincoln Laboratory), Aydin Buluc (LBNL), Chansup Byun (MIT Lincoln Laboratory), Timothy Davis (Texas A&M), Vijay Gadepally (MIT Lincoln Laboratory), Daniel Grant (GreyNoise), Michael Houle, Matthew Hubbell, Piotr Luszczek (MIT Lincoln Laboratory), Lauren Milechin (MIT), Chasen Milner, Guillermo Morales (MIT Lincoln Laboratory), Andrew Morris (GreyNoise), Julie Mullen, Ritesh Patel (MIT Lincoln Laboratory), Alex Pentland (MIT), Sandeep Pisharody, Andrew Prout, Albert Reuther, Antonio Rosa, Gabriel Wachman, Charles Yee, Peter Michaleas (MIT Lincoln Laboratory); Understanding what is normal is a key aspect of protecting a domain. Other domains invest heavily in observational science to develop models of normal behavior to better detect anomalies. Recent advances in high performance graph libraries, such as the GraphBLAS, coupled with supercomputers enables processing of the trillions of observations required. We leverage this approach to synthesize low-parameter observational models of anonymized Internet traffic with a high regard for privacy.
Invited Talk: National Centers of Academic Excellence in Cybersecurity Program

Teddy Lynch (NSA)
Keynote Talk: Verification in ML

Shafi Goldwasser (Simons Theory of Computing Institute)

IEEE HPEC 2024

28th Annual IEEE High Performance Extreme Computing Virtual Conference 23 - 24 September 2024