27th Annual
IEEE High Performance Extreme Computing Virtual Conference
25 - 29 September 2023

HPEC 2023 AGENDA

Friday, September 29

 

5-K: Keynote Session (10:30-11:00)

Co-Chairs: J. Kepner & A. Reuther
Speeding Progress, Advice for Research Managers
Dr. Ivan Sutherland (von Neumann Medal & ACM A.M. Turing Award Winner)

5-1: Quantum & Advanced Processor Architectures Session (11:00-12:15)

Co-Chairs: C. Byun & D. Ricke
Lincoln AI Computing Survey (LAICS)
Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, Jeremy Kepner (MIT Lincoln Laboratory)
This paper is an update of the survey of AI accel- erators and processors from past four years, which is now called the Lincoln AI Computing Survey – LAICS (pronounced “lace”). As in past years, this paper collects and summarizes the current commercial accelerators that have been publicly announced with peak performance and peak power consumption numbers. The performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this plot are again discussed and analyzed. Market segments are highlighted on the scatter plot, and zoomed plots of each segment are also included. Finally, a brief description of each of the new accelerators that have been added in the survey this year is included.
Accelerating Garbled Circuits in the Open Cloud Testbed with Multiple Network-Attached FPGAs
Kai Huang (Google), Mehmet Gungor, Suranga Handagala, Stratis Ioannidis, Miriam Leeser (Northeastern Univ.)
Field Programmable Gate Arrays are increasingly used in cloud computing to increase the run time performance of applications. For complex applications or applications that operate over large amounts of data, users may want to use more than one FPGA. The challenge is how to map and parallelize applications to a multi-FPGA cloud computing platform such that the problem is partitioned evenly over the FPGAs, memory resources are used effectively, communication is minimized, and speedup is maximized. In this research, we build a framework to map Garbled Circuit applications, an implementation of Secure Function Evaluation, to the Open Cloud Testbed, which has FPGA cards attached to computing nodes. The FPGAs are directly connected to 100 GbE switches and can communicate directly through the network; we use the Xilinx UDP stack for this. Preprocessing generates efficient memory allocation and partitioning maps and schedules executions to different FPGAs to minimize communication and maximize processing overlap. This framework achieves close to perfect speedup on a two-FPGA setup compared to a one-FPGA implementation, and can handle large examples that cannot fit on a single FPGA.
UNet Performance with Wafer Scale Engine (Optimization Case Study)
Vyacheslav Romanov (NETL)
Science-based simulations can significantly improve efficiency and effectiveness of the carbon storage operations and subsurface reservoir management and help to mitigate the impact of atmospheric carbon dioxide on climate change. However, they can be computationally very expensive, especially when inversion modeling is employed. Training robust, trustworthy, and reliable site-specific, field-scale, science-informed artificial intelligence models is also computationally and data intensive. In this work, compute acceleration of such model training was achieved by taking advantage of extreme workflow parallelism of single-wafer-embedding of the compute resources. Successful kernel mapping to system fabric and accelerated training were accomplished using a prototype UNet architecture and dynamic loss scaling.
Hybrid Quantum-Classical Multilevel Approach for Maximum Cuts on Graphs
Anthony Angone (Univ. of Delaware), Xiaoyuan Liu (Fujitsu Research), Ruslan Shaydulin (JPMorgan Chase), Ilya Safro (Univ. of Delaware)
Combinatorial optimization is one of the fields where near term quantum devices are being utilized with hybrid quantum-classical algorithms to demonstrate potentially practical applications of quantum computing. One of the most well studied problems in combinatorial optimization is the Max-Cut problem. The problem is also highly relevant to quantum and other types of “post Moore” architectures due to its similarity with the Ising model and other reasons. In this paper, we introduce a scalable hybrid multilevel approach to solve large instances of Max- Cut using both classical only solvers and quantum approximate optimization algorithm (QAOA). We compare the results of our solver to existing state of the art large-scale Max-Cut solvers. We demonstrate excellent performance of both classical and hybrid quantum-classical approaches and show that using QAOA within our framework is comparable to classical approaches. Our solver is publicly available at https://github.com/angone/MLMax-cut.
Optimization and Performance Analysis of Shor’s Algorithm in Qiskit
Dewang Sun, Naifeng Zhang, Franz Franchetti (Carnegie Mellon Univ.)
Shor’s algorithm is widely renowned in the field of quantum computing due to its potential to efficiently break RSA encryption in polynomial time. In this paper, we optimized an end-to-end library-based implementation of Shor’s algorithm using the IBM Qiskit quantum library and derived a speed-of-light (i.e., theoretical peak) performance model that calculates the minimum runtime required for executing Shor’s algorithm with input size N on a certain machine by counting the total operations as a function of different numbers of gates. We evaluated our model by running Shor’s algorithm on CPUs and GPUs, and simulated the factorization of a number up to 4,757. Comparing the speed-of-light runtime to our real-world measurement, we are able to quantify the margin for future quantum library improvements.

Poster Session: 5-P (12:15-14:15) Poster Session

Chair(s)/Host(s): D. Enright & TBD
FFTX-IRIS: A Dynamic Execution System for Heterogeneous Platforms
Sanil Rao, Het Mankad (Carnegie Mellon Univ.), Mohammad Alaul Haque Monil (ORNL), Het Mankad (Carnegie Mellon Univ.), Jeffrey Vetter (ORNL), Franz Franchetti (Carnegie Mellon Univ.)
FFTX-IRIS is a dynamic system to efficiently utilize novel heterogeneous platforms. This system links two next generation frameworks, FFTX and IRIS, to navigate the complexity of different hardware architectures. FFTX provides a runtime code generation framework for high performance kernels. IRIS provides a heterogeneous runtime environment allowing computation on any available compute resource. Together, FFTX-IRIS enables seamless portability and performance without user involvement. We show the design of the FFTX-IRIS system as well as a simple example of a common Fast Fourier Transform(FFT).
EVPFFTX: A First Look at FFTX Applications in Material Science
Het Mankad (Carnegie Mellon Univ.), Andrea Rovinelli, Miroslav Zecevic (LANL), Peter McCorquodale (LBNL), Franz Franchetti, Naifeng Zhang, Sanil Rao (Carnegie Mellon Univ.), R. A. Lebensohn, Laurent Capolungo (LANL)
This work presents a first look of EVPFFTX. It is an FFTX based library for the elsto-viscoplastic FFT (EVPFFT) algorithm used in material science. FFTX is a high performance library with a SPIRAL backend, build for running the fast Fourier transform (FFT) based applications on the latest exascale machines. SPIRAL is a code generation system that was developed for generating highly optimized code for different types of linear transforms.The use of EVPFFT is to study polycrystalline material which forms an integral part of the molten salt reactors. The aim of this work is to provide a brief overview of the procedure required to translate the EVPFFT algortihm into the SPIRAL framework.
SRAM Performance Tuning Via Flex-Gate Biasing
Elijah E Racz, Maher Rizkalla (Indiana Univ.-Purdue Univ. Indianapolis), Trond Ytterdal (NTNU), John Lee (Indiana Univ.-Purdue Univ. Indianapolis)
As demands on computing memory increase, the demands for increasingly small, fast, and power efficient SRAM cache is increasing. As SRAM occupies the majority of chip area, development of reliable and highly manufacturable chips is crit- ically important. This paper proposes an architecture for tuning SRAM Cells post fabrication, which has the potential to vastly improve yeild and provides preformance benefits depending on the use case.
The Use of Differential Equations to Model Multiprocessors Architecture
Ricardo Citro, Kayla Zantello (Grand Canyon Univ.)
This paper shows a method to find mathematically different parameters important for evaluating a specific computer science or computer engineering case scenario. Differential equations are used to model a computer architecture and the method can be extended to any other type of investigation in the topic of multicore computers, symmetric multiprocessors, or multiple processors with input/output components. The system of single differential equations is built from a hypothetical architecture. We discuss two case scenarios, that is, the first is formed by a 3×3 matrix and the second by a 2×2 matrix. The 2×2 matrix is solved using the matrix method. Results reveal that the solutions to the system of first order linear differential equations for a 2×2 matrix represent data degradation, system degradation or performance degradation when the system moves in the time domain. It is represented by an exponential distribution with symmetric behavior.
A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models
Leonard Apanasevich, Yogesh Kale, Himanshu Sharma, Ana Marija Sokovic (Univ. of Illinois Chicago)
For many years, systems running Nvidia-based GPU architectures have dominated the heterogeneous supercomputer landscape. However, recently GPU chipsets manufactured by Intel and AMD have cut into this market and can now be found in some of the world’s fastest supercomputers. The June 2023 edition of the TOP500 list of supercomputers ranks the Frontier supercomputer at the Oak Ridge National Laboratory in Tennessee as the top system in the world. This system features AMD Instinct 250X GPUs and is currently the only true exascale computer in the world. In the near future, another exoscale system, Aurora, equipped with Intel Delta GPUs, is expected to come online at Argonne National Laboratory in Illinois. As the use of different GPU architectures becomes more prevalent in today’s supercomputing centers, it is becoming crucial to have a programming model that could support different platforms without the need for separate codebases . The first framework that enabled support for heterogeneous platforms across multiple hardware vendors was OpenCL, in 2009. Since then a number of frameworks have been developed to support vendor-agnostic heterogeneous environments including OpenMP, OpenCL, Kokkos, and SYCL. SYCL, which combines the concepts of OpenCL with the flexibility of single-source C++, is one of the more promising programming models for heterogeneous computing devices. In recent years, there has been growing interest in using heterogeneous computing architectures to accelerate molecular dynamics simulations. Some of the more popular molecular dynamics simulations include Amber, NAMD, and Gromacs. However, to the best of our knowledge, only Gromacs has been successfully ported to SYCL to date. In this paper, we compare the performance of GROMACS compiled using the SYCL and CUDA frameworks for a variety of standard GROMACS benchmarks. In addition, we compare its performance across three different Nvidia GPU chipsets, P100, V100, and A100.
P2Prop: A Benchmarking Tool for Assessing Power to Performance Proportionality in HPC-grade GPUs
Ghazanfar Ali (Texas Tech Univ.)
The power and performance signatures of each GPU device varies based on GPU architectures and vendors. Studies show that, generally, more performance is delivered at the cost of more power consumption. The higher power consumption is justifiable as long as the power consumption is proportional to the increase in performance. However, the determination of proportionality between power and performance among different GPU architectures and vendors is a challenge. Many researchers explored energy proportionality area but these existing approaches are limited in terms of scope and goals. In this study, we design a power-to-performance proportionality (P2Prop GPU Benchmark Tool) framework that defines four key metrics – performance per watt, power-to-thermal design power ratio, energy, and energy-delay product. We evaluated the power and performance proportionality of the state-of-the-art NVIDIA (GV100, GA100) and AMD Instinct (MI100, MI210) GPUs using benchmarks and real-world HPC and ML workloads. We observed that AMD GPUs have improved power-to-performance proportionality compared to NVIDIA GPUs (TBC).

5-2: ASIC/FPGA Advances & Design Tools 1 Session (12:30-13:45)

Co-Chairs: J. Hughes & B. Thoelen
Pruning Binarized Neural Networks Enables Low-Latency, Low-Power FPGA-Based Handwritten Digit Classification
Syamantak Payra (Stanford Univ.), Gabriel Loke, Yoel Fink, Joseph Steinmeyer (MIT)
As neural networks are increasingly deployed on mobile and distributed computing platforms, there is a need to lower latency and increase computational speed while decreasing power and memory usage. Rather than using FPGAs as accelerators in tandem with CPUs or GPUs, we directly encode individual neural network layers as combinational logic within FPGA hardware. Utilizing binarized neural networks minimizes the arithmetic computation required, shrinking latency to only the signal propagation delay. We evaluate size-optimization strategies and demonstrate network compression via weight quantization and weight-model unification, achieving 96% of the accuracy of baseline MNIST digit classification models while using only 3% of the memory. We further achieve 86% decrease in model footprint, 8mW dynamic power consumption, and <9ns latency, validating the versatility and capability of feature-strength-based pruning approaches for binarized neural networks to flexibly meet performance requirements amid application resource constraints.
Leveraging Mathworks Tools to Accelerate the Prototyping of Custom 5G Applications in Hardware
Joshua Geyster, Karen Gettings, Paul Monticciolo, Matthew Rebholz (MIT Lincoln Laboratory)
The ubiquity and flexibility of 5G networks make it an attractive technology to customize for particular needs. In this work, we highlight some of the challenges associated with implementing custom 5G applications in hardware and the growing trend of utilizing high-level synthesis tools to relieve these issues. We present an overview of the 5G resource grid and the MATLAB-Simulink-HDL Coder workflow. We then demonstrate the workflow through an example 5G resource grid transmitter design.
Generating High-Performance Number Theoretic Transform Implementations for Vector Architectures
Naifeng Zhang (Carnegie Mellon Univ.), Austin Ebel, Negar Neda (New York Univ.), Benedict Reynwar, Andrew G. Schmidt (USC Information Sciences Institute), Brandon Reagen (New York Univ.), Franz Franchetti (Carnegie Mellon Univ.)
Fully homomorphic encryption (FHE) offers the ability to perform computations directly on encrypted data by encoding numerical vectors onto mathematical structures. However, the adoption of FHE is hindered by substantial overheads that make it impractical for many applications. Number theoretic transforms (NTTs) are a key optimization technique for FHE by accelerating vector convolutions. Towards practical usage of FHE, we propose to use SPIRAL, a code generator renowned for generating efficient linear transform implementations, to generate high-performance NTT on vector architectures. We identify suitable NTT algorithms and translate the dataflow graphs of those algorithms into SPIRAL’s internal mathematical representations. We then implement the entire workflow required for generating efficient vectorized NTT code. In this work, we target the Ring Processing Unit (RPU), a multi-tile long vector accelerator designed for FHE computations. On average, the SPIRAL-generated NTT kernel achieves a 1.7× speedup over naive implementations on RPU, showcasing the effectiveness of our approach towards maximizing performance for NTT computations on vector architectures.
Selective Encryption of Compressed Image Regions on the Edge with Reconfigurable Hardware
Justin Kawakami, Dominik Zajac, Miriam Leeser (Northeastern Univ.)
It is becoming more and more common for people to take photos on edge devices such as smartphones, and wish to transmit them in a secure manner. For example, a patient may wish to send an image to a medical practitioner for a quick screening analysis and be given advice for further follow-up. We investigate possible implementations of DCT followed by RSA encryption. We use the AMD Zynq processor System on Chip comprising an embedded ARM processor and FPGA fabric to investigate design trade-offs. Our implementation of DCT plus RSA exhibits a 3.4x improvement in latency over an optimized software implementation. We further investigate selectively encrypting the compressed image for both speeding up the encryption process and reducing the amount of data needed to be transmitted. This design space exploration results in lessons that can be applied to future implementations.
Quantifying the Gap Between Open-Source and Vendor FPGA Place and Route Tools
Shachi Vaman Khadilkar (UMass Lowell), Ahmed Sanaullah (Red Hat Research), Martin Margala (Univ. of Louisiana)
The use of Field Programmable Gate Arrays (FP-GAs) has increased greatly as a result of their flexibility, power efficiency, and hardware acceleration capabilities. CAD tools needed to map hardware description language (HDL) code to the FPGA are complex and challenging to build. Open-source CAD tools for FPGAs have been developed to facilitate restriction-free customization and have more control over the mapping process. A significant milestone in the development of open-source CAD tools was the ability to target real commercial devices. In recent years, multiple academic CAD tools can map circuits to commercial FPGAs. It is essential to identify and quantify the performance gap between academic and vendor tools targeting commercial devices. To this effect, we compare relevant hardware quality metrics after placement and routing for five tool-flows targeting a commercial FPGA. Our results show the divide between academic and commercial place and route tools for device utilization, run time, and maximum circuit speeds.

5-3: ASIC/FPGA Advances & Design Tools 2 Session (14:15-15:30)

Co-Chairs: J. Hughes & D. Ricke
Errant Beam Detection Using the AMD Versal ACAP and Vitis AI
Anthony M Cabrera, Yigit Yucesan, Frank Liu, Willem Blokland, Jeffrey S Vetter (ORNL)
The prevalence of ML and AI-powered solutions along with the slowing of Moore’s Law has given rise to novel hardware platforms aimed at accelerating ML and AI.
While programming these hardware platforms can be difficult, particularly for non-hardware experts, hardware vendors provide high-level tooling in an effort to address this difficulty.
The Versal ACAP is an SoC designed by AMD that combines CPU cores, FPGA fabric, and a tiled, vector architecture called an AI engine all on the same socket.
In an effort to more easily program this heterogeneous system, AMD has provided the Vitis AI development stack. In this work, we leverage Vitis AI to program a Versal ACAP to perform errant beam detection in the Spallation Neutron Source at Oak Ridge National Laboratory. Our initial work shows that after quantization and compilation of the model for the Versal ACAP, the classification accuracy, as measured by the AUC metric, is over 95\% accurate while achieving this accuracy in 46 microseconds on average.
Improved Models for Policy-Agent Learning of Compiler Directives in HLS
Robert P Munafo, Hafsah Shahzad (Boston Univ.), Ahmed Sanaullah, Sanjay Arora, Uli Drepper (Red Hat), Martin Herbordt (Boston Univ.)
Acceleration by Field-Programmable Gate Array (FPGA) continues to be deployed into data center and edge computing hardware designs; the tools and integration for accelerating computationally-intensive tasks continue to increase in practicality. In this paper, we build on previous work in applying machine learning to automatically tune the transformation of high-level language (HLL) C code by a High Level Synthesis (HLS) system to generate an FPGA hardware design that runs at high speed. This tuning is done primarily through the selection of code transformations (optimizations) and an ordering in which to apply them. We present more detailed results from the use of reinforcement learning (RL), and improve on previous results in several ways: by developing additional strategies that perform better and more consistently, by normalizing the learning rate to the frequency of new (yet untried) action sequences, and by informing the model from aggregate statistics of optimization sub-orderings.
Feature-Oriented FSMs for FPGAs
Justin Deters (SimpleRose), Peyton Gozon, Max Camp-Oberhauser, Ron K. Cytron (Washington Univ. St. Louis)
In this paper we consider a feature-oriented approach for specifying finite-state machines, which form the basis of cache controllers (and other components) for RISC-V implementations, and which are commonly found in hardware designs. Using a library we constructed for Chisel, developers can apply features at will, with the resulting machine containing only the circuitry needed to support the desired features. Our library offers two constructs for building features. The first, inspired by aspect-oriented programming, applies incremental changes to the states and edges of a finite-state machine to alter and customize its behavior in response to features of interest. The second construct couples the behavior of separate finite machines into a single machine that processes its inputs simultaneously. We illustrate each construct separately using a vending machine and the game of Nim, respectively. Our approach offers significant leverage in supporting both the number and size of the generated designs. We present results from synthesis that show the size of the design endpoints compared with the much smaller size of their specification.
Towards a Flexible Hardware Implementation for Mixed-Radix Fourier Transforms
Mario Vega, Xiaokun Yang, John Shalf, Doru Thom Popovici (LBNL)
The discrete Fourier transform is a versatile mathematical kernel widely used in a myriad of applications from physics, chemistry, and even machine learning. While most applications typically rely on power-of-two sizes for which high performance implementations have been developed as software libraries or custom hardware IPs, there are codes in chemistry or even machine learning that require non-power of two or even prime-sized Fourier transforms. The classic algorithms that target such sizes fall short compared to those geared towards power of two sizes, being less performant and more complicated both in software and in hardware. However, the recent work~\cite{popovici2020exploiting} has shown that casting small prime-sized Fourier transforms as specialized matrix operations, a unifying kernel kernel can be developed that outperforms all classical implementations on most modern CPUs. Therefore, in this work, we focus on designing the corresponding hardware unit for prime-sized Fourier transforms and integrating the custom design within any mixed-radix Fourier transform. We provide a detailed description of the hardware design choices for the custom IP, outlining the characteristics and the trade-offs between latency, throughput and resource utilization on Xilinx FPGAs for the both standalone and mixed-radix design. We provide an end-to-end implementation for the mixed-radix Fourier transform, showing that even oddly shaped Fourier transforms can become appealing when building custom hardware designs for general size Fourier transforms.
Tyche: A Compact and Configurable Accelerator for Scalable Probabilistic Computing on FPGA
Yashash Jain, Utsav Banerjee (IIS Bangalore)
Probabilistic computing is an emerging computing paradigm which involves the systematic control and manipulation of unstable stochastic units called p-bits. Multiple p-bits are connected together to implement p-circuits which have been shown to be capable of solving interesting computationally hard problems. In this work, we present Tyche, a compact and configurable hardware accelerator for scalable probabilistic computing on FPGA. Our architecture allows p-circuits requiring different number of p-bits to be implemented using the same hardware. The use of a single p-bit computing core instead of an array of processing elements provides significant logic resource savings. A logarithmic adder tree is used for single-cycle weight logic computation while ensuring reasonable performance even for large number of p-bits. Various application-specific p-circuits are experimentally demonstrated using our proposed hardware accelerator implemented on Xilinx UltraScale+ FPGA, thus emphasizing the viability of practical scalable probabilistic computing on modern FPGAs.

5-4: Advanced Multicore Software Technologies 1 Session (15:45-17:00)

Co-Chairs: C. Byun & C. Long
Invited Talk: The Confluence of HPC, Data and AI for Science
Dr. Sudip Dosanjh (NERSC Director, LBL)
Finding Your Niche: An Evolutionary Approach to HPC Topologies [Best Paper Award]
Stephen J. Young, Joshua Suetterlein, Jesun Firoz, Joseph Manzano, Kevin Barker (PNNL)
Traditional interconnection network design approaches focus on building general network topologies by optimizing the bisection bandwidth or minimizing the network’s diameter to reduce the maximum distance between any two nodes, thus amortizing the overall execution time of the HPC workloads. While such network topologies may accommodate a wide variety of applications in general, this may result in sub-optimal performance for many frequently-executed or dynamic workloads. In this paper, instead of focusing on designing an all-encompassing, general-purpose network topology, we develop a methodology to design customized network interconnects, evolved by “finding” the optimal topologies for a particular target workload given by its communication and contention profiles. To this end, we implement a Genetic Algorithm (GA)-based approach for network topology design tailored to improve the overall execution time of a particular workload of interest. We conducted extensive experiments with well-known motifs in physics-based workloads (Sweep3D and FFT), as well as with a representative graph application (MiniVite), using the well-known Structural Simulation Toolkit (SST) Macroscale Element Library (SST/macro) simulator for network interconnect evaluation. We demonstrate that our genetic algorithm-based approach is robust enough to find the underlying optimal topology of a particular workload.
IRIS-DMEM: Efficient Memory Management for Heterogeneous Computing [Outstanding Paper Award]
Narasinga Rao Miniskar, Mohammad Alaul Haque Monil, Pedro Valero-Lara, Frank Y. Liu, Jeffrey S. Vetter (ORNL)
This paper proposes an efficient data memory management approach for the Intelligent RuntIme System (IRIS) heterogeneous computing framework along with new data transfer policies. IRIS provides a task-based programming model for extreme heterogeneous computing (e.g., CPU, GPU, DSP, FPGA) with support for today’s most important programming languages (e.g., OpenMP, OpenCL, CUDA, HIP, OpenACC). However, the IRIS framework either forces the programmer to introduce data transfer commands for each task or relies on suboptimal memory management for automatic and transparent data transfers. The work described here extends IRIS with novel heterogeneous memory handling and introduces novel data transfer policies by employing the Distributed data MEMory handler (DMEM) for efficient and optimal movement of data among the various computing resources. The proposed approach achieves performance gains of up to 7× for tiled LU factorization and tiled DGEMM (i.e., matrix multiplication) benchmarks. Moreover, this approach also reduces data transfers by up to 71% when compared to previous IRIS heterogeneous memory management handlers. This work compares the performance results of the IRIS framework’s novel DMEM with the StarPU runtime and MAGMA math library for GPUs. Experiments show a performance gain of up to 1.95× over StarPU and 2.1× over MAGMA
The Aggressive Oversubscribing Scheduling for Interactive Jobs on a Supercomputing System
Shohei Minami, Toshio Endo, Akihiro Nomura (Tokyo Inst. of Tech.)
As interactive usages of supercomputing systems become popular, especially in the AI and machine learning (ML) field, the systems are expected to provide resources in real time. As interactive jobs have different features from traditional batch jobs, the systems should be designed to accept both types of jobs efficiently. This paper shows that the aggressive oversubscribing scheduling, in which multiple jobs share computational resources regardless of job types, can effectively process hybrid jobs. This paper investigates behaviors of the real interactive jobs with fluctuating CPU utilization. And a simulation method is described, which combines existing workload trace data and data on CPU utilization. Through the evaluation, we demonstrate oversubscribing scheduling achieves a short response time for interactive jobs. Also our solution eliminates the necessity of configuring dedicated queues for job types and achieves robustness towards the change of demand of interactive jobs.
In-Place Multi-Core SIMD FFTs
Benoît Dupont de Dinechin (Kalray), Julien Hascoët (INSA Rennes, IETR / Kalray), Orégane Desrentes (INSA Lyon, CITI and Kalray)
We revisit 1D Fast Fourier Transforms (FFT) implementation approaches in the context of compute units composed of multiple cores with SIMD ISA extensions and sharing a multi-banked local memory. A main constraint is to spare use of local memory, which motivates us to use in-place FFT implementations and to generate the twiddle factors with trigonometric recurrences. A key objective is to maximize bandwidth of the multi-banked local memory system by ensuring that cores issue maximum-width aligned non-temporal SIMD accesses. We propose combining the SIMD lane-slicing and sample partitioning techniques to derive multicore FFT implementations that do not require matrix transpositions and only involve one stage of bit-reverse unscrambling. This approach is demonstrated on the Kalray MPPA3 processor compute unit, where it outperforms the classic six-step algorithm for multicore FFT implementation.

5-S1: Advanced Multicore Software Technologies 2 Special (17:30-19:30)

Co-Chairs: D. Enright & TBD
Multiarchitecture Hardware Acceleration of Hyperdimensional Computing
Ian Peitzsch, Mark Ciora, Alan George (Univ. of Pittsburgh)
Hyperdimensional computing (HDC) is a machine-learning method that seeks to mimic the high-dimensional nature of data processing in the cerebellum. To achieve this goal, HDC represents data as large vectors, called hypervectors, and uses a set of well-defined operations to perform symbolic computations on these hypervectors. Using this paradigm, it is possible to create HDC models for classification tasks. These HDC models work by first transforming the input data into hypervectors, and then combining hypervectors of the same class to create a hypervector for representing that task. These HDC models can classify information by transforming new input data into hypervectors, comparing the similarity between data hypervector with each class hypervector, then classifying it based on which class has the highest similarity. Over the past few years, HDC models have greatly improved in accuracy and now compete with more common classification techniques for machine learning, such as neural networks. Additionally, manipulating hypervectors involve many repeated basic operations, making them easy to accelerate using different hardware platforms. This research seeks to exploit this ease of acceleration of HDC models and utilize oneAPI libraries with SYCL to create multiple accelerators for HDC learning tasks for CPUs, GPUs, and field-programmable gate arrays (FPGAs). The oneAPI tools are used in this research to accelerate single-pass learning, gradient-descent learning using the NeuralHD algorithm, and inference. Each of these tasks is benchmarked on the Intel Xeon Platinum 8256 CPU, Intel UHD 11th generation GPU, and Intel Stratix 10 FPGA. The GPU implementation showcased the fastest training times for single-pass training and NeuralHD training, with 0.89s and 126.55s, respectively. The FPGA implementation exhibited the lowest inference latency, with an average of 0.28ms.
Optimizing Compression Schemes for Parallel Sparse Tensor Algebra
Helen Xu (LBNL), Tao B. Schardl (MIT CSAIL), Michael Pellauer (NVIDIA), Joel Emer (MIT CSAIL)
This paper studies compression techniques for parallel in-memory sparse tensor algebra. We find that applying simple existing compression schemes can lead to performance loss in some cases. To resolve this issue, we introduce an optimized algorithm for processing compressed inputs that can improve both the space usage as well as the performance compared to uncompressed inputs. We implement the compression techniques on top of a suite of sparse matrix algorithms generated by taco, a compiler for sparse tensor algebra. On a machine with 48 hyperthreads, our empirical evaluation shows that compression reduces the space needed to store the matrices by over 2× without sacrificing algorithm performance.
ProtoX: A First Look
Het Mankad, Sanil Rao (Carnegie Mellon Univ.), Phillip Colella, Brian Van Straalen (LBNL), Franz Franchetti (Carnegie Mellon Univ.)
We present a first look at ProtoX, a code generation framework for stencil and pointwise operations that occur frequently in the numerical solution of partial differential equations. ProtoX has Proto as its library frontend and SPIRAL as the backend. Proto is a C++ based domain specific library which optimizes the algorithms used to compute the numerical solution of partial differential equations. Meanwhile, SPIRAL is a code generation system that focuses on generating highly optimized target code. Although the current design layout of Proto and its high level of abstractions provide a user friendly set up, there is still a room for improving it’s performance by applying various techniques either at a compiler level or at an algorithmic level. Hence, in this paper we propose adding SPIRAL as the library backend for Proto enabling abstraction fusion, which is usually difficult to perform by any compiler. We demonstrate the construction of ProtoX by considering the 2D Poisson equation as a model problem from Proto. We provide the final generated code for CPU, Multi-core CPU, and GPU as well as some performance numbers for CPU.
Dynamic Data Partitioning in the WAFL File System
Jian Hu, Matthew Curtis-Maury, Vinay Devadas (NetApp)
The WAFL file system leverages data partitioning to manage parallel processing of client and internal operations. Such operations are mapped to data partitions based on the data they touch. Changes in load over time can result in imbalances between data partitions, which can limit system parallelism. Ideally, the system would respond by periodically re-optimizing the partition mappings. However, there are a variety of challenges in making such changes on-the-fly without disrupting client accesses. In this paper, we present an approach that dynamically repartitions file system objects to balance partition load and increase parallelism. Our evaluation shows substantial performance gains across a variety of realistic workloads and configurations.
Accelerating Training Data Generation Using Optimal Parallelization and Thread Counts
Jonathan Levine, Leonard MacEachern (Carleton Univ.)
This paper presents a method for accelerating train ing data generation by optimizing the thread allocation and number of simulations run in parallel on commercially available numerical simulation software targeting consumer-level CPUs. Hardware facilities for thread management and disparate CPU core capabilities are addressed by the method. The method scales with CPU cores and a demonstrated speed-up in data generation throughput of approximately 550% compared to relevant previous work is reported to support the method. In general the proposed method involves a relatively minor pre- processing step that enables drastic throughput improvements in subsequent dataset generation steps, with direct application to neural network development.