IEEE High Performane Extreme Computing

2021 IEEE High Performance Extreme Computing Virtual Conference 20 - 24 September 2021

Tuesday, September 21 2-V: Policy Spotlight Session (10:30-11:00) Session Chair(s): Albert Reuther Invited Talk: Public Investments In Science for the Economic Security of the United States Prof. Jonathan Gruber (MIT Sloan School)

Welcome

Organizers

Advisory Board

Technical Committee

Spectral Graph Partitioning Using Geodesic Distance-based Projection Yasunori Futamura (University of Tsukuba)*; Ryota Wakaki (University of Tsukuba); Tetsuya Sakurai (University of Tsukuba) The graph partitioning problem is one of the most fundamental combinatorial problems with a wide variety of applications. In this paper, we propose an efficient approach for implementing spectral graph partitioning that has a solid theoretical foundation but is practically less efficient than standard multilevel partitioners. The proposed method is based on the approximation of the eigenvectors of a graph-Laplacian matrix derived using the basis vectors by geodesic distance-based approach, instead of a standard linear algebraic approach. This geodesic distance-based approach is the key to making our method competitive in comparison to the multilevel partitioners. The primary building blocks of our method are the breadth-first search and the sparse matrix-vector product, which are the main kernels intensively studied in the high-performance computing field. Based on a performance evaluation in the shared-memory parallel setting, we demonstrate that our method is comparable to standard partitioners in terms of both quality and speed. We also show that our method has the potential to facilitate reproducibility-aware parallel partitioning. Vertical, Temporal, and Horizontal Scaling of Hierarchical Hypersparse GraphBLAS Matrices Jeremy Kepner (MIT Lincoln Laboratory)*; Timothy A Davis (Texas A&M University); Chansup Byun (MIT Lincoln Laboratory); William Arcand (MIT); David Bestor (MIT); William Bergeon (MIT); Vijay Gadepally (MIT Lincoln Laboratory); Michael Hurray (MIT); Matthew Hubbell (MIT Lincoln Laboratory); Michael S Jones (MIT Lincoln Laboratory); Anna Klein (MIT); Lauren Milechin (MIT); Julie S Mullen (MIT Lincoln Laboratory); Andrew Prout (MIT); Albert Reuther (MIT Lincoln Laboratory); Antonio Rosa (MIT); Siddharth Samsi (MIT Lincoln Laboratory); Charles Yee (MIT Lincoln Laboratory); Peter Michaleas (MIT Lincoln Laboratory) Hypersparse matrices are a powerful enabler for a variety of network, health, finance, and social applications. Hierarchical hypersparse GraphBLAS matrices enable rapid streaming updates while preserving algebraic analytic power and convenience. In many contexts, the rate of these updates sets the bounds on performance. This paper explores hierarchical hypersparse update performance on a variety of hardware with identical software configurations. The high-level language bindings of the GraphBLAS readily enable performance experiments on simultaneous diverse hardware. The best single process performance measured was 20,000,000 updates per second. The best single node performance measured was 170,000,000 updates per second. The hardware used spans nearly a decade and allows a direct comparison of hardware improvements for this computation over this time range; showing a 2x increase in single-core performance, a 3x increase in single process performance, and a 5x increase in single node performance. Running on nearly two thousand MIT SuperCloud nodes simultaneously achieved a sustained update rate of over 200,000,000,000 updates per second. Hierarchical hypersparse GraphBLAS allows the MIT SuperCloud to analyze extremely large streaming network data sets. Delayed Asynchronous Iterative Graph Algorithms Mark P Blanco (Carnegie Mellon University)*; Tze Meng Low (Carnegie Mellon University); Scott McMillan (CMU Software Engineering Institute) Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated output values are used as inputs for computations in current or subsequent iterations; hence the number of iterations required for values to converge can potentially reduce if the newest values are asynchronously made available to other updates computed in the same iteration. In a multi-threaded shared memory system, the immediate propagation of updated values can cause memory contention that may offset the benefit of propagating updates sooner. In some cases, the benefit of a smaller number of iterations may be diminished by each iteration taking longer. Our key idea is to combine the low memory contention that synchronous approaches have with the faster information sharing of asynchronous approaches. Our hybrid approach buffers updates from threads locally before committing them to the global store to control how often threads may cause conflicts for others while still sharing data within one iteration and hence speeding convergence. On a 112-thread CPU system, our hybrid approach attains up to 4.5\% - 19.4\% speedup over an asynchronous approach for Pagerank and up to 1.9\% - 17\% speedup over asynchronous Bellman Ford SSSP. Further, our hybrid approach attains 2.56x better performance than the synchronous approach. Finally, we provide insights as to why delaying updates is not helpful on certain graphs where connectivity is clustered on the main diagonal of the adjacency matrix. Inverse-Deletion BFS - Revisiting Static Graph BFS Traversals with Dynamic Graph Operations Oded Green (NVIDIA)* Speaker Slides Breadth-First Search (BFS) traversals appear in a wide range of applications and domains. BFS traversals determine the distance between key vertices and the remaining vertices in the network. The distance between the vertices often called the number of hops, is the shortest path between a root and the remaining vertices in the graph. Given its applicability across multiple domains, BFS has received significant attention from theoretical and applied communities, including many algorithms for parallelizing BFS. BFS traversals are typically conducted on a graph snapshot (commonly referred to as a static graph). In this paper, we show a novel algorithm for executing a BFS traversal. While our algorithm also uses a static graph for its input, we show how to perform a BFS traversal using dynamic graph operations. Specifically, in each BFS level, we remove the subset of the edges that are unnecessary for the next phase. These edges do not impact finding the vertices in the next level and reduce random memory accesses. We show a top-down BFS variation of our new algorithm. While our implementation does not outperform state-of-the-art implementations, its performance is competitive. Furthermore, it shows a novel way to implement BFS and opens up many research opportunities. Invited Talk: The New Internet: From Communication Medium to Transaction Medium Prof. Sandy Pentland (MIT Media Lab)

2-2: Graph Analytics & Network Science 2 Session (12:30-13:45) Session Co-Chairs: Sadas Shankar & Nikos Pitsianis

2-1: Graph Analytics & Network Science 1 Session (11:00-12:15) Session Co-Chairs: Sadas Shankar & TBD

The GraphBLAS in Julia and Python:the PageRank and Triangle Centralities Michel Pelletier (Graphegon)*; William R Kimmerer (Massachusetts Institute of technology); Timothy A Davis (Texas A&M University); Timothy Mattson (Intel) The GraphBLAS is a standard API for expressing Graphs in the language of Linear algebra. The goal is to provide high performance while exploiting the fundamental simplicity of Graph algorithms in terms of a common set of “Basic Linear Algebra Subprograms”. A robust parallel implementation of the GraphBLAS C specification is available as the SuiteSparse:GraphBLAS library. The simplicity of the GraphBLAS, so apparent “in the math”, is diminished when expressed in terms of a low level language such as C. To see the full expressive power of the GraphBLAS, a high level interface is needed so that the elegance of the mathematical underpinnings of the GraphBLAS is clearly apparent in the code. In this paper we introduce the Julia interface to the SuiteSparse:GraphBLAS library and compare it to the Python interface [2]. We implement the Page Rank and Triangle Centrality algorithms with remarkably little code and show that no significant performance is sacrificed by moving from C to the more productive Python and Julia interfaces. HyPC-Map: A Hybrid Parallel Community Detection Algorithm Using Information-Theoretic Approach Md Abdul Motaleb Faysal (University of New Orleans)*; Shaikh M. Arifuzzaman (University of New Orleans); Cy Chan (Berkeley Lab); Maximilian Bremer (Berkeley Lab); Doru Thom Popovici (Lawrence Berkeley National Laboratory); John Shalf (Lawrence Berkeley National Laboratory) Speaker Slides Community detection has become an important graph analysis kernel due to the tremendous growth of social networks and genomics discoveries. Even though there exist a large number of algorithms in the literature, studies show that community detection based on an information-theoretic approach (known as Infomap) delivers better quality solutions than others. Being inherently sequential, the Infomap algorithm does not scale well for large networks. In this work, we develop a hybrid parallel approach for community detection in graphs using Information Theory. We perform extensive benchmarking and analyze hardware parameters to identify and address performance bottlenecks. Additionally, we use cache-optimized data structures to improve cache locality. All of these optimizations lead to an efficient and scalable community detection algorithm, HyPC- Map, which demonstrates a 25-fold speedup (much higher than the state-of-the-art map-based techniques) without sacrificing the quality of the solution. Extended Abstract: K-Truss Implementation in Arkouda Joseph T Patchett (New Jersey Institute of Technology)*; ZHIHUI DU (New Jersey Institute of Technology); David Bader (New Jersey Institute of Technology) K-Truss Implementation in Arkouda, a data science framework for graph analytics. Arkouda is an open source framework. A GraphBLAS implementation of Triangle Centrality Fuhuan Li (New Jersey Institute of Technology); David Bader (New Jersey Institute of Technology)* Identifying key members in large social network graphs is an important graph analytic. Recently, a new centrality measure called triangle centrality finds members based on the triangle support of vertices in graph. In this paper, we describe our rapid implementation of triangle centrality using Graph-BLAS, an API specification for describing graph algorithms in the language of linear algebra. We use triangle centrality’s algebraic algorithm and easily implement it using the SuiteSparse GraphBLAS library. A set of experiments on large, sparse graph datasets is conducted to verify the implementation.

2-3: Graph Analytics & Network Science 3 Session (14:15-15:30) Session Co-Chairs: Scott McMillan & Nikos Pitsianis

Enabling Exploratory Large Scale Graph Analytics through Arkouda ZHIHUI DU (New Jersey Institute of Technology); Oliver A Alvarado Rodriguez (New Jersey Institute of Technology)*; David Bader (New Jersey Institute of Technology) Exploratory graph analytics helps maximize the informational value from a graph. However, the increasing graph size makes it impossible for existing popular exploratory data analysis tools to handle dozens-of-terabytes or even larger data sets in the memory of a common laptop/personal computer. Arkouda is a framework under early-development that brings together the productivity of Python at the user side with the high-performance of Chapel at the server side. In this paper, we present preliminary work on overcoming the memory limit and high performance computing coding roadblock for high level Python users to perform large graph analysis. A simple and succinct graph data structure design and implementation at both the Python front-end and the Chapel back-end in the Arkouda framework are provided. A typical graph algorithm, Breadth-First Search (BFS), is used to show how we can use Chapel to develop high performance parallel graph algorithm productively. Two Chapel-based parallel Breadth-First Search (BFS) algorithms, one high level version and one corresponding low level version, have been implemented in Arkouda to support analyzing large graphs. Multiple graph benchmarks are used to evaluate the performance of the provided graph algorithms. Ex- perimental results show that we can optimize the performance by tuning the selection of different Chapel high level data structures and parallel constructs. Our code is open source and available from GitHub (https://github.com/Bader-Research/arkouda). Digraph Clustering by the BlueRed Method Tiancheng Liu (Duke University)*; Dimitris Floros (Aristotle University of Thessaloniki); Nikos P Pitsianis (Aristotle University of Thessaloniki); Xiaobai Sun (Duke University) We introduce a new method for vertex clustering or community detection on directed graphs (digraphs). The new method is an extension of the BlueRed method introduced initially for undirected graphs. Complementary to supervised or semisupervised classification, unsupervised graph clustering is indispensable to exploratory data analysis and knowledge discovery. Conventional graph clustering methods are fundamentally hindered in effectiveness and efficiency by either the resolution limit or various problems with resolution parameter selection. BlueRed is originative in analysis, modeling, and solution approach. Its clustering process is simple, fully autonomous and unsupervised. Among other potential impacts, BlueRed breaks new ground for high-throughput, low-cost and high-performance graph clustering computation, as it has removed the barrier of parameter tuning/selection. We report benchmark studies with real-world graph data for evaluating the new method. The clustering results are in remarkable agreement with the ground truth labels. We also present an important study on the U.S. patent citation graph CITE75_99. More than a quarter of 3.7 million patents have no electronic records of category codes. With BlueRed, we are able to efficiently and economically give a semantic presentation of the patents without category codes. An Efficient Algorithm for the Construction of Dynamically Updating Trajectory Networks Deniz Gurevin (University of Connecticut)*; Chris Michael (Naval Research Laboratory); Omer Khan (University of Connecticut) Trajectory based spatiotemporal networks (STN) are useful in a wide range of applications, such as crowd behavior analysis. Significant portion of trajectory network based research focuses on optimizing the analysis of STN to characterize, control, and predict network behavior. However, these mining algorithms are typically carried out on a pre- constructed network structure that tracks all moving objects and their trajectories in real time. The construction of such a trajectory network is itself a computationally expensive task and it is becoming a bigger burden with advancements in analysis algorithms. The traditional approach is to construct static networks from the temporal snapshots of trajectory data that cannot handle spatiotemporally changing data. This paper proposes an efficient algorithm that successfully generates and maintains an ST network from raw trajectory data. The proposed method is based on a customized R-Tree based constructor to keep track of object trajectories and their interactions. It avoids redundant updates as trajectories evolve over time, resulting in a significant reduction in STN construction time. Based on the experiments we conducted, our method reduces the number of updates by 71.25% as compared to the static naive STN construction method. Graph Embedding and Field Based Detection of Non-Local Webs in Large Scale Free Networks Michael E Franusich (Carnegie Mellon University)*; Franz Franchetti (CMU) Scalable graph algorithms are usually discrete and heuristics-based and their performance may strongly depend on the topology and size of the targeted graph. In this paper we investigate an alternative approach where we translate graph problems into a continuous space problem: we embed graphs into an higher dimensional Euclidean space using an N- body approach. Then we define a synthetic field as function of the Euclidean space vertex locations and find peaks in the field. These peaks are used to identify regions and thus vertices of interest, and are an emergent property. We test the approach to identify a low-degree web that is connecting otherwise unrelated vertices in a power law graph. We implement the algorithm in CUDA and show good scalability to graphs with 7 million nodes.

2-4: Data Intensive Computing Session (15:45-17:00) Session Co-Chairs: Seung Woo Son & Tim Mattson

Reconfigurable Low-latency Memory System for Sparse Matricized Tensor Times Khatri-Rao Product on FPGA Sasindu Wijeratne (University of Southern California )*; Rajgopal Kannan (USC); Viktor K Prasanna (Unversity of Southern California) Tensor decomposition has become an essential tool in many applications in various domains, including machine learning. Sparse Matricized Tensor Times Khatri-Rao Product (MTTKRP) is one of the most computationally expensive kernels in tensor computations. Despite having significant computational parallelism, MTTKRP is a challenging kernel to optimize due to its irregular memory access characteristics. This paper focuses on a multi-faceted memory system, which explores the spatial and temporal locality of the data structures of MTTKRP. Further, users can reconfigure our design depending on the behavior of the compute units used in the FPGA accelerator. Our system efficiently accesses all the MTTKRP data structures while reducing the total memory access time, using a distributed cache and Direct Memory Access (DMA) subsystem. Moreover, our work improves the memory access time by 3.5x compared with commercial memory controller IPs. Also, our system shows 2x and 1.26x speedups compared with cache-only and DMA-only memory systems, respectively. Non-Volatile Memory Accelerated Geometric Multi-Scale Resolution Analysis Andrew E Wood (Boston University)*; Moshik Hershcovitch (IBM Research); Daniel G Waddington (IBM Research); Sarel Cohen (Hasso Plattner Institute); Meredith G Wolf (Williams College); Hongjun Suh (Seoul National University); Weiyu Zong (Boston University); Sang Chin (Boston University) Dimensionality reduction algorithms are standard tools in a researcher's toolbox. Dimensionality reduction algorithms are frequently used to augment downstream tasks such as machine learning, data science, and also are exploratory methods for understanding complex phenomena. For instance, dimensionality reduction is commonly used in Biology as well as Neuroscience to understand data collected from biological subjects. However, dimensionality reduction techniques are limited by the von-Neumann architectures that they execute on. Specifically, data intensive algorithms such as dimensionality reduction techniques often require fast, high capacity, persistent memory which historically hardware has been unable to provide at the same time. In this paper, we present a re-implementation of an existing dimensionality reduction technique called Geometric Multi-Scale Resolution Analysis (GMRA) which has been accelerated via novel persistent memory technology called Memory Centric Active Storage (MCAS). Our implementation uses a specialized version of MCAS called PyMM that provides native support for Python datatypes including NumPy arrays and PyTorch tensors. We compare our PyMM implementation against a DRAM implementation, and show that when data fits in DRAM, PyMM offers competitive runtimes. When data does not fit in DRAM, our PyMM implementation is still able to process the data. Large Scale String Analytics In Arkouda ZHIHUI DU (New Jersey Institute of Technology)*; Oliver A Alvarado Rodriguez (New Jersey Institute of Technology); David Bader (New Jersey Institute of Technology) Large scale data sets from the web, social networks, and bioinformatics are widely available and can often be represented by strings and suffix arrays are highly efficient data structures enabling string analysis. But, our personal devices and corresponding exploratory data analysis (EDA) tools cannot handle big data sets beyond the local memory. Arkouda is a framework under early development that brings together the productivity of Python at the user side with the high- performance of Chapel at the server-side. In this paper, an efficient suffix array data structure design and integration method are given first. A suffix array algorithm library integration method instead of one single suffix algorithm is presented to enable runtime performance optimization in Arkouda since different suffix array algorithms may have very different practical performances for strings in various applications. A parallel suffix array construction algorithm framework is given to further exploit hierarchical parallelism on multiple locales in Chapel. A corresponding benchmark is developed to evaluate the feasibility of the provided suffix array integration method and measure the end-to-end performance. Experimental results show that the proposed solution can provide data scientists an easy and efficient method to build suffix arrays with high performance in Python. All our codes are open source and available from GitHub (https://github.com/Bader- Research/arkouda/tree/string-suffix-array-functionality). Streaming Detection and Classification Performance of a POWER9 Edge Supercomputer Wesley Brewer (HPCMP PET)*; Chris Geyer (Parsons Corporation); Dardo Kleiner (CIPS, Corp.); Connor Horne (Naval Research Laboratory) We present training and inference benchmarks on a POWER9 edge supercomputer called SCOUT using two production- level artificial intelligence systems: the Video Processing Exploitation Framework (VPEF) and the Ordnance Threat Target Automated Recognition (OTTAR) system. For training benchmarks, we use Horovod to train ResNet-50 on synthetic data for a baseline benchmark, and then train Faster R-CNN on an available WASABI dataset using SCOUT's six-way V100 GPU nodes. For inference benchmarks, we use GStreamer to stream both synthetic and real Motion Imagery (MI) and then use Single-Shot Detector trained on MobileNetV2 data with real and synthetic MI at four different resolutions (720p, 1080p, 1280p, and 2160p) while measuring average inference time. We also test reduced-precision inference performance using TensorRT and distributed inference performance using the Inference Benchmark ``iBench'' framework. We compare our results with equivalent work on x86\_64 systems and provide suggestions on tuning for optimal performance. We find that V100s work well for training and offline inferencing in batches, while T4 GPUs outperform V100 GPUs only in very specific usage scenarios, such as streaming detection at reduced precision. Scaling of Evolutionary Search of Algorithm Space to Speed-Up Scientific Image Understanding Workflows Nicholas J Grabill (University of Michigan); Kai Pinckard (Reed College); Dirk Colbry (Michigan State University)* Scientific image understanding is an integral part of many research workflows. Studies in everything from self-driving cars to the makeup of cells rely on image understanding through computer vision techniques. Unfortunately, almost every new scientific question requires a new algorithm to be developed by researchers. Exploring the possible algorithms (and their parameters) to find one that fits a particular scientific problem requires time and a broad understanding of the many options available. For this reason, many scientists resort to making measurements "by hand" instead of making the considerable investment required to develop a tool that can automate their specialized workflow. This paper explores the scaling of a tool (SEE-Segment) that searches the "algorithm space" of possible image segmentation algorithms and their parameters for solutions to unique scientific problems. The goal is to have the researchers do this manual annotation of their images ("measure by hand") using a Graphical User Interface front end while the SEE-Segment program searches through the algorithms and their parameters on the back end. In the best case, the SEE-Segment tool finds a good candidate algorithm that can be used on the remaining images in the researcher's data-set. In the worst case, the researchers annotate their images as they would without the tool. Unfortunately, the search space for different segmentation algorithms and their parameters is quite large and non-linear. This research leverages the pleasantly parallel nature of Genetic Algorithms and explores the use of large-scale computing to speed up the search both on a local HPC and on the cloud using Kubernetes. 2-S1: GraphBLAS BoF Special (17:30-19:30) Organizer(s): Tim Mattson & Scott McMillan Invited Talk: The GraphBLAS C++ Interface Ben Brock (UC Berkeley) Invited Talk: Binary Storage Formats Erik Welsh (Anaconda) Lighning Talks: Updates/news from the GraphBLAS implementors

2-S2: Remote Sensing for Disaster Relief Special (17:30-19:30) Organizer(s): John Aldridge & Dan Dumanis & Andrew Weinert

Supercomputing Enabled Deployable Analytics for Disaster Response Jeremy Kepner (MIT Lincoln Laboratory); Kaira M Samuel (MIT)* First responders and other forward deployed essential workers can benefit from advanced analytics. Limited network access and software security requirements prevent the usage of standard cloud based microservice analytic platforms that are typically used in industry. One solution is to precompute a wide range of analytics as files that can be used with standard preinstalled software that does not require network access or additional software and can run on a wide range of legacy hardware. In response to the COVID-19 pandemic, this approach was tested for providing geo-spatial census data to allow quick analysis of demographic data for better responding to emergencies. These data were processed using the MIT SuperCloud to create several thousand Google Earth and Microsoft Excel files representative of many advanced analytics. The fast mapping of census data using Google Earth and Microsoft Excel has the potential to give emergency responders a powerful tool to improve emergency preparedness. Our approach displays relevant census data (total population, population under 15, population over 65, median age) per census block, sorted by county, through a Microsoft Excel spreadsheet (xlsx file) and Google Earth map (kml file). The spreadsheet interface includes features that allow users to convert between different longitude and latitude coordinate units. For the Google Earth files, a variety of absolute and relative colors maps of population density have been explored to provide an intuitive and meaningful interface. Using several hundred cores on the MIT SuperCloud, new analytics can be generated in a few minutes.First responders and other forward deployed essential workers can benefit from advanced analytics. Limited network access and software security requirements prevent the usage of standard cloud based microservice analytic platforms that are typically used in industry. One solution is to precompute a wide range of analytics as files that can be used with standard preinstalled software that does not require network access or additional software and can run on a wide range of legacy hardware. In response to the COVID-19 pandemic, this approach was tested for providing geo spatial census data to allow quick analysis of demographic data for better responding to emergencies. These data were processed using the MIT SuperCloud to create several thousand Google Earth and Microsoft Excel files representative of many advanced analytics. The fast mapping of census data using Google Earth and Microsoft Excel has the potential to give emergency responders a powerful tool to improve emergency preparedness. Our approach displays relevant census data (total population, population under 15, population over 65, median age) per census block, sorted by county, through a Microsoft Excel spreadsheet (xlsx file) and Google Earth map (kml file). The spreadsheet interface includes features that allow users to convert between different longitude and latitude coordinate units. For the Google Earth files, a variety of absolute and relative colors maps of population density have been explored to provide an intuitive and meaningful interface. Using several hundred cores on the MIT SuperCloud, new analytics can be generated in a few minutes.

Invited Talk: What do we want to know about the internet Dr. David Clark (MIT CSAIL)

Invited Talk: Bot-Match: Social Bot Detection with Semi-Supervised Recursove Nearest Neighbors Search Ltc Dr. David Beskow (Army Cyber)

Invited Talk: TBD TBD Invited Talk: TBD Matthew Lengel Invited Talk: TBD Dr. Brian Vegetabile (RAND) Invited Talk: Augmented Reality for Rapid Disaster Response Capt Victor “Salsa” Lopez (USAF Al Accelerator) Invited Talk: Multimodal Vision for Synthetic Aperture Radar TSgt Armando Cabrera (USAF Al Accelerator)

2021 Abstract Book

IEEE Nondiscrimination Policy