4-K: Keynote Session (10:30-11:00)
Co-Chairs: J. Kepner & A. Reuther
Co-Chairs: H.Badawy & D.Cousins
In high-performance computing (HPC), the demand for efficient parallel programming models has grown dramatically since the end of Dennard Scaling and the subsequent move to multi-core CPUs. OpenMP stands out as a popular choice due to its simplicity and portability, offering a directive-driven approach for shared-memory parallel programming. Despite its wide adoption, however, there is a lack of comprehensive data on the actual usage of OpenMP.
This paper presents a statistical analysis of OpenMP usage based on a novel and extensive database, HPCORPUS, compiled from GitHub repositories containing C, C++, and Fortran code. The results reveal that OpenMP is the dominant parallel programming model, accounting for 45% of all analyzed parallel APIs. Furthermore, it has demonstrated steady and continuous growth in popularity over the past decade. Analyzing specific OpenMP constructs, the study provides in-depth insights into their usage patterns and preferences across the three languages. Notably, we found that while OpenMP has a strong “common core” of constructs in common usage (while the rest of the API is less used), use of newer constructs such as simd, target directives for accelerated computing, and tasks for irregular parallelism are growing as well.
Overall, this study sheds light on OpenMP’s significance in HPC applications and provides valuable data for researchers and practitioners. It showcases OpenMP’s versatility, evolving adoption, and relevance in contemporary parallel programming, underlining its continued role in HPC applications and beyond. These statistical insights are essential for making informed decisions about parallelization strategies and provide a foundation for further advancements in parallel programming models and techniques.
HPCORPUS, as well as the analysis scripts and raw results, are available at: https://github.com/Scientific-Computing-Lab-NRCN/HPCorpus
Analog arrays of non-volatile crossbars leverage physics to compute approximate matrix-vector multiplications in a rapid, in-memory fashion. In this paper we consider exploiting this technology to precondition the Generalized Minimum Residual iterative solver (GMRES). Since the preconditioner must be applied through matrix-vector multiplication, approximate inverse preconditioners are a natural fit. At the same time, the errors introduced by the analog hardware render an iteration matrix that changes from one iteration to another. To remedy this, we propose to combine analog approximate inverse preconditioning with a flexible GMRES algorithm which naturally incorporates variations of the preconditioner into its model. The benefit of this approach is that the analog circuit is much simpler than correcting the errors at the hardware level. Our experiments with a simulator for analog hardware show that such an analog-flexible scheme can lead to fast convergence.
Poster Session: 4-P (12:15-14:15) Poster Session
Organizer(s): TBD & TBD
Co-Chairs: C. Long & B. Thoelen
Graph Neural Networks (GNNs) have gained significant momentum recently due to their capability to learn on unstructured graph data. Dynamic GNNs (DGNNs) are the current state-of-the-art for point cloud applications; such applications (viz. autonomous driving) require real-time processing at the edge with tight latency and memory constraints. Conducting performance analysis on such DGNNs, thus, becomes a crucial task to evaluate network suitability.
This paper presents a profiling analysis of EdgeConv-based DGNNs applied to point cloud inputs. We assess their inference performance in terms of end-to-end latency and memory consumption on state-of-the-art CPU and GPU platforms. The EdgeConv layer has two stages: (1) dynamic graph generation using k-Nearest Neighbors (kNN) and, (2) node feature updation. The addition of dynamic graph generation via kNN in each (EdgeConv) layer enhances network performance compared to networks that work with the same static graph in each layer; such performance enhancement comes, however, at the added computational cost associated with the dynamic graph generation stage (via kNN algorithm). Understanding its costs is essential for identifying the performance bottleneck and exploring potential avenues for hardware acceleration. To this end, this paper aims to shed light on the performance characteristics of EdgeConv-based DGNNs for point cloud inputs. Our performance analysis on a state-of-the-art EdgeConv network for classification shows that the dynamic graph construction via kNN takes up upwards of 95% of network latency on the GPU and almost 90% on the CPU. Moreover, we propose a quasi-Dynamic Graph Neural Network (qDGNN) that halts dynamic graph updates after a specific depth within the network to significantly reduce the latency on both CPU and GPU whilst matching the original networks inference accuracy.
As FPGAs and GPUs continue to make inroads into high-performance computing (HPC), the need for languages and frameworks that offer performance, productivity, and portability across heterogeneous platforms, such as FPGAs and GPUs, continues to grow.
OpenCL and SYCL have emerged as frameworks that offer cross-platform functional portability between FPGAs and GPUs.
While functional portability across a diverse set of platforms is an important feature of portable frameworks, achieving performance portability often requires vendor and platform-specific optimizations. Achieving performance portability, therefore, comes at the expense of productivity.
This paper presents a quantification of the tradeoffs between performance, portability, and productivity of OpenCL and SYCL. It extends and complements our prior work on quantifying performance-productivity tradeoffs between Verilog and OpenCL for the FPGA. In addition to evaluating the performance-productivity tradeoffs between OpenCL and SYCL, this work quantifies the performance portability (PP) of OpenCL and SYCL as well as their code convergence (CC), i.e., a measure of productivity across different platforms (e.g., FPGA and GPU).
Using two applications as case studies (i.e., edge detection using the Sobel filter, and graph link prediction using the Jaccard similarity index), we characterize the tradeoffs between performance, portability, and productivity.
Our results show that OpenCL and SYCL offer complementary tradeoffs. While OpenCL delivers better performance portability than SYCL, SYCL offers better code convergence and a 1.6x improvement in source lines of code over OpenCL.
Co-Chairs: S.Gottlieb & B.Sroka
Leveraging Mixed Precision in Exponential Time Integration Methods [Outstanding Paper Award]
Co-Chairs: S.Gottlieb & B.Sroka
There is a recent trend in modern space systems to move processing that until now was transmitted to ground for processing, on board the satellite. Synthetic Aperture Radar (SAR) is an example of such processing. However, such a computationally intensive processing requires high performance hardware. In this paper we present the CPU and GPU acceleration of a SAR processing application, part of ESA’s open source benchmarking suite OBPMark.
We benchmark several embedded multicore and GPU platforms which are promising candidates for future on-board systems. Our results show that both embedded multicores and especially GPUs can provide significant speedups in this type of processing, and achieve performance level similar to the ones of high performance ground stations.
Co-Chairs: H. Badawy & J. Mullen
Co-Chairs: J. Kepner & A. Reuther