2021
IEEE High Performance Extreme Computing
Virtual Conference
20 - 24 September 2021
Tuesday, September 21
2-V: Policy Spotlight Session (10:30-11:00)
Session Chair(s): Albert Reuther
Invited Talk: Public Investments In Science for the Economic Security of the United States
Prof. Jonathan Gruber (MIT Sloan School)
Spectral Graph Partitioning Using Geodesic Distance-based Projection
Yasunori Futamura (University of Tsukuba)*; Ryota Wakaki (University of Tsukuba); Tetsuya Sakurai (University of
Tsukuba)
The graph partitioning problem is one of the most fundamental combinatorial problems with a wide variety of
applications. In this paper, we propose an efficient approach for implementing spectral graph partitioning that has a
solid theoretical foundation but is practically less efficient than standard multilevel partitioners. The proposed method
is based on the approximation of the eigenvectors of a graph-Laplacian matrix derived using the basis vectors by
geodesic distance-based approach, instead of a standard linear algebraic approach. This geodesic distance-based
approach is the key to making our method competitive in comparison to the multilevel partitioners. The primary
building blocks of our method are the breadth-first search and the sparse matrix-vector product, which are the main
kernels intensively studied in the high-performance computing field. Based on a performance evaluation in the
shared-memory parallel setting, we demonstrate that our method is comparable to standard partitioners in terms of
both quality and speed. We also show that our method has the potential to facilitate reproducibility-aware parallel
partitioning.
Vertical, Temporal, and Horizontal Scaling of Hierarchical Hypersparse GraphBLAS Matrices
Jeremy Kepner (MIT Lincoln Laboratory)*; Timothy A Davis (Texas A&M University); Chansup Byun (MIT Lincoln
Laboratory); William Arcand (MIT); David Bestor (MIT); William Bergeon (MIT); Vijay Gadepally (MIT Lincoln
Laboratory); Michael Hurray (MIT); Matthew Hubbell (MIT Lincoln Laboratory); Michael S Jones (MIT Lincoln
Laboratory); Anna Klein (MIT); Lauren Milechin (MIT); Julie S Mullen (MIT Lincoln Laboratory); Andrew Prout (MIT);
Albert Reuther (MIT Lincoln Laboratory); Antonio Rosa (MIT); Siddharth Samsi (MIT Lincoln Laboratory); Charles Yee
(MIT Lincoln Laboratory); Peter Michaleas (MIT Lincoln Laboratory)
Hypersparse matrices are a powerful enabler for a variety of network, health, finance, and social applications.
Hierarchical hypersparse GraphBLAS matrices enable rapid streaming updates while preserving algebraic analytic
power and convenience. In many contexts, the rate of these updates sets the bounds on performance. This paper
explores hierarchical hypersparse update performance on a variety of hardware with identical software
configurations. The high-level language bindings of the GraphBLAS readily enable performance experiments on
simultaneous diverse hardware. The best single process performance measured was 20,000,000 updates per
second. The best single node performance measured was 170,000,000 updates per second. The hardware used
spans nearly a decade and allows a direct comparison of hardware improvements for this computation over this time
range; showing a 2x increase in single-core performance, a 3x increase in single process performance, and a 5x
increase in single node performance. Running on nearly two thousand MIT SuperCloud nodes simultaneously
achieved a sustained update rate of over 200,000,000,000 updates per second. Hierarchical hypersparse
GraphBLAS allows the MIT SuperCloud to analyze extremely large streaming network data sets.
Delayed Asynchronous Iterative Graph Algorithms
Mark P Blanco (Carnegie Mellon University)*; Tze Meng Low (Carnegie Mellon University); Scott McMillan (CMU
Software Engineering Institute)
Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated
output values are used as inputs for computations in current or subsequent iterations; hence the number of iterations
required for values to converge can potentially reduce if the newest values are asynchronously made available to
other updates computed in the same iteration. In a multi-threaded shared memory system, the immediate
propagation of updated values can cause memory contention that may offset the benefit of propagating updates
sooner. In some cases, the benefit of a smaller number of iterations may be diminished by each iteration taking
longer. Our key idea is to combine the low memory contention that synchronous approaches have with the faster
information sharing of asynchronous approaches. Our hybrid approach buffers updates from threads locally before
committing them to the global store to control how often threads may cause conflicts for others while still sharing
data within one iteration and hence speeding convergence. On a 112-thread CPU system, our hybrid approach
attains up to 4.5\% - 19.4\% speedup over an asynchronous approach for Pagerank and up to 1.9\% - 17\% speedup
over asynchronous Bellman Ford SSSP. Further, our hybrid approach attains 2.56x better performance than the
synchronous approach. Finally, we provide insights as to why delaying updates is not helpful on certain graphs where
connectivity is clustered on the main diagonal of the adjacency matrix.
Inverse-Deletion BFS - Revisiting Static Graph BFS Traversals with Dynamic Graph Operations
Oded Green (NVIDIA)*
Speaker Slides
Breadth-First Search (BFS) traversals appear in a wide range of applications and domains. BFS traversals determine
the distance between key vertices and the remaining vertices in the network. The distance between the vertices often
called the number of hops, is the shortest path between a root and the remaining vertices in the graph. Given its
applicability across multiple domains, BFS has received significant attention from theoretical and applied
communities, including many algorithms for parallelizing BFS. BFS traversals are typically conducted on a graph
snapshot (commonly referred to as a static graph). In this paper, we show a novel algorithm for executing a BFS
traversal. While our algorithm also uses a static graph for its input, we show how to perform a BFS traversal using
dynamic graph operations. Specifically, in each BFS level, we remove the subset of the edges that are unnecessary
for the next phase. These edges do not impact finding the vertices in the next level and reduce random memory
accesses. We show a top-down BFS variation of our new algorithm. While our implementation does not outperform
state-of-the-art implementations, its performance is competitive. Furthermore, it shows a novel way to implement BFS
and opens up many research opportunities.
Invited Talk: The New Internet: From Communication Medium to Transaction Medium
Prof. Sandy Pentland (MIT Media Lab)
2-2: Graph Analytics & Network Science 2 Session
(12:30-13:45)
Session Co-Chairs: Sadas Shankar & Nikos Pitsianis
2-1: Graph Analytics & Network Science 1 Session (11:00-12:15)
Session Co-Chairs: Sadas Shankar & TBD
The GraphBLAS in Julia and Python:the PageRank and Triangle Centralities
Michel Pelletier (Graphegon)*; William R Kimmerer (Massachusetts Institute of technology); Timothy A Davis (Texas A&M
University); Timothy Mattson (Intel)
The GraphBLAS is a standard API for expressing Graphs in the language of Linear algebra. The goal is to provide high
performance while exploiting the fundamental simplicity of Graph algorithms in terms of a common set of “Basic Linear
Algebra Subprograms”. A robust parallel implementation of the GraphBLAS C specification is available as the
SuiteSparse:GraphBLAS library. The simplicity of the GraphBLAS, so apparent “in the math”, is diminished when
expressed in terms of a low level language such as C. To see the full expressive power of the GraphBLAS, a high level
interface is needed so that the elegance of the mathematical underpinnings of the GraphBLAS is clearly apparent in the
code. In this paper we introduce the Julia interface to the SuiteSparse:GraphBLAS library and compare it to the Python
interface [2]. We implement the Page Rank and Triangle Centrality algorithms with remarkably little code and show
that no significant performance is sacrificed by moving from C to the more productive Python and Julia interfaces.
HyPC-Map: A Hybrid Parallel Community Detection Algorithm Using Information-Theoretic Approach
Md Abdul Motaleb Faysal (University of New Orleans)*; Shaikh M. Arifuzzaman (University of New Orleans); Cy Chan
(Berkeley Lab); Maximilian Bremer (Berkeley Lab); Doru Thom Popovici (Lawrence Berkeley National Laboratory); John Shalf
(Lawrence Berkeley National Laboratory)
Speaker Slides
Community detection has become an important graph analysis kernel due to the tremendous growth of social networks and
genomics discoveries. Even though there exist a large number of algorithms in the literature, studies show that community
detection based on an information-theoretic approach (known as Infomap) delivers better quality solutions than others. Being
inherently sequential, the Infomap algorithm does not scale well for large networks. In this work, we develop a hybrid parallel
approach for community detection in graphs using Information Theory. We perform extensive benchmarking and analyze
hardware parameters to identify and address performance bottlenecks. Additionally, we use cache-optimized data structures
to improve cache locality. All of these optimizations lead to an efficient and scalable community detection algorithm, HyPC-
Map, which demonstrates a 25-fold speedup (much higher than the state-of-the-art map-based techniques) without sacrificing
the quality of the solution.
Extended Abstract: K-Truss Implementation in Arkouda
Joseph T Patchett (New Jersey Institute of Technology)*; ZHIHUI DU (New Jersey Institute of Technology); David Bader
(New Jersey Institute of Technology)
K-Truss Implementation in Arkouda, a data science framework for graph analytics. Arkouda is an open source framework.
A GraphBLAS implementation of Triangle Centrality
Fuhuan Li (New Jersey Institute of Technology); David Bader (New Jersey Institute of Technology)*
Identifying key members in large social network graphs is an important graph analytic. Recently, a new centrality measure
called triangle centrality finds members based on the triangle support of vertices in graph. In this paper, we describe our
rapid implementation of triangle centrality using Graph-BLAS, an API specification for describing graph algorithms in
the language of linear algebra. We use triangle centrality’s algebraic algorithm and easily implement it using the
SuiteSparse GraphBLAS library. A set of experiments on large, sparse graph datasets is conducted to verify the
implementation.
2-3: Graph Analytics & Network Science 3 Session (14:15-15:30)
Session Co-Chairs: Scott McMillan & Nikos Pitsianis
Enabling Exploratory Large Scale Graph Analytics through Arkouda
ZHIHUI DU (New Jersey Institute of Technology); Oliver A Alvarado Rodriguez (New Jersey Institute of Technology)*;
David Bader (New Jersey Institute of Technology)
Exploratory graph analytics helps maximize the informational value from a graph. However, the increasing graph size
makes it impossible for existing popular exploratory data analysis tools to handle dozens-of-terabytes or even larger data
sets in the memory of a common laptop/personal computer. Arkouda is a framework under early-development that brings
together the productivity of Python at the user side with the high-performance of Chapel at the server side. In this paper,
we present preliminary work on overcoming the memory limit and high performance computing coding roadblock for high
level Python users to perform large graph analysis. A simple and succinct graph data structure design and implementation
at both the Python front-end and the Chapel back-end in the Arkouda framework are provided. A typical graph algorithm,
Breadth-First Search (BFS), is used to show how we can use Chapel to develop high performance parallel graph algorithm
productively. Two Chapel-based parallel Breadth-First Search (BFS) algorithms, one high level version and one
corresponding low level version, have been implemented in Arkouda to support analyzing large graphs. Multiple graph
benchmarks are used to evaluate the performance of the provided graph algorithms. Ex- perimental results show that we
can optimize the performance by tuning the selection of different Chapel high level data structures and parallel constructs.
Our code is open source and available from GitHub (https://github.com/Bader-Research/arkouda).
Digraph Clustering by the BlueRed Method
Tiancheng Liu (Duke University)*; Dimitris Floros (Aristotle University of Thessaloniki); Nikos P Pitsianis (Aristotle
University of Thessaloniki); Xiaobai Sun (Duke University)
We introduce a new method for vertex clustering or community detection on directed graphs (digraphs). The new method
is an extension of the BlueRed method introduced initially for undirected graphs. Complementary to supervised or
semisupervised classification, unsupervised graph clustering is indispensable to exploratory data analysis and knowledge
discovery. Conventional graph clustering methods are fundamentally hindered in effectiveness and efficiency by either the
resolution limit or various problems with resolution parameter selection. BlueRed is originative in analysis, modeling, and
solution approach. Its clustering process is simple, fully autonomous and unsupervised. Among other potential impacts,
BlueRed breaks new ground for high-throughput, low-cost and high-performance graph clustering computation, as it has
removed the barrier of parameter tuning/selection. We report benchmark studies with real-world graph data for evaluating
the new method. The clustering results are in remarkable agreement with the ground truth labels. We also present an
important study on the U.S. patent citation graph CITE75_99. More than a quarter of 3.7 million patents have no electronic
records of category codes. With BlueRed, we are able to efficiently and economically give a semantic presentation of the
patents without category codes.
An Efficient Algorithm for the Construction of Dynamically Updating Trajectory Networks
Deniz Gurevin (University of Connecticut)*; Chris Michael (Naval Research Laboratory); Omer Khan (University of
Connecticut)
Trajectory based spatiotemporal networks (STN) are useful in a wide range of applications, such as crowd behavior
analysis. Significant portion of trajectory network based research focuses on optimizing the analysis of STN to
characterize, control, and predict network behavior. However, these mining algorithms are typically carried out on a pre-
constructed network structure that tracks all moving objects and their trajectories in real time. The construction of such a
trajectory network is itself a computationally expensive task and it is becoming a bigger burden with advancements in
analysis algorithms. The traditional approach is to construct static networks from the temporal snapshots of trajectory data
that cannot handle spatiotemporally changing data. This paper proposes an efficient algorithm that successfully generates
and maintains an ST network from raw trajectory data. The proposed method is based on a customized R-Tree based
constructor to keep track of object trajectories and their interactions. It avoids redundant updates as trajectories evolve
over time, resulting in a significant reduction in STN construction time. Based on the experiments we conducted, our
method reduces the number of updates by 71.25% as compared to the static naive STN construction method.
Graph Embedding and Field Based Detection of Non-Local Webs in Large Scale Free Networks
Michael E Franusich (Carnegie Mellon University)*; Franz Franchetti (CMU)
Scalable graph algorithms are usually discrete and heuristics-based and their performance may strongly depend on the
topology and size of the targeted graph. In this paper we investigate an alternative approach where we translate graph
problems into a continuous space problem: we embed graphs into an higher dimensional Euclidean space using an N-
body approach. Then we define a synthetic field as function of the Euclidean space vertex locations and find peaks in the
field. These peaks are used to identify regions and thus vertices of interest, and are an emergent property. We test the
approach to identify a low-degree web that is connecting otherwise unrelated vertices in a power law graph. We implement
the algorithm in CUDA and show good scalability to graphs with 7 million nodes.
2-4: Data Intensive Computing Session (15:45-17:00)
Session Co-Chairs: Seung Woo Son & Tim Mattson
Reconfigurable Low-latency Memory System for Sparse Matricized Tensor Times Khatri-Rao Product on FPGA
Sasindu Wijeratne (University of Southern California
)*; Rajgopal Kannan (USC); Viktor K Prasanna (Unversity of
Southern California)
Tensor decomposition has become an essential tool in many applications in various domains, including machine learning.
Sparse Matricized Tensor Times Khatri-Rao Product (MTTKRP) is one of the most computationally expensive kernels in
tensor computations. Despite having significant computational parallelism, MTTKRP is a challenging kernel to optimize due
to its irregular memory access characteristics. This paper focuses on a multi-faceted memory system, which explores the
spatial and temporal locality of the data structures of MTTKRP. Further, users can reconfigure our design depending on the
behavior of the compute units used in the FPGA accelerator. Our system efficiently accesses all the MTTKRP data
structures while reducing the total memory access time, using a distributed cache and Direct Memory Access (DMA)
subsystem. Moreover, our work improves the memory access time by 3.5x compared with commercial memory controller
IPs. Also, our system shows 2x and 1.26x speedups compared with cache-only and DMA-only memory systems,
respectively.
Non-Volatile Memory Accelerated Geometric Multi-Scale Resolution Analysis
Andrew E Wood (Boston University)*; Moshik Hershcovitch (IBM Research); Daniel G Waddington (IBM Research); Sarel
Cohen (Hasso Plattner Institute); Meredith G Wolf (Williams College); Hongjun Suh (Seoul National University); Weiyu Zong
(Boston University); Sang Chin (Boston University)
Dimensionality reduction algorithms are standard tools in a researcher's toolbox. Dimensionality reduction algorithms are
frequently used to augment downstream tasks such as machine learning, data science, and also are exploratory methods
for understanding complex phenomena. For instance, dimensionality reduction is commonly used in Biology as well as
Neuroscience to understand data collected from biological subjects. However, dimensionality reduction techniques are
limited by the von-Neumann architectures that they execute on. Specifically, data intensive algorithms such as
dimensionality reduction techniques often require fast, high capacity, persistent memory which historically hardware has
been unable to provide at the same time. In this paper, we present a re-implementation of an existing dimensionality
reduction technique called Geometric Multi-Scale Resolution Analysis (GMRA) which has been accelerated via novel
persistent memory technology called Memory Centric Active Storage (MCAS). Our implementation uses a specialized
version of MCAS called PyMM that provides native support for Python datatypes including NumPy arrays and PyTorch
tensors. We compare our PyMM implementation against a DRAM implementation, and show that when data fits in DRAM,
PyMM offers competitive runtimes. When data does not fit in DRAM, our PyMM implementation is still able to process the
data.
Large Scale String Analytics In Arkouda
ZHIHUI DU (New Jersey Institute of Technology)*; Oliver A Alvarado Rodriguez (New Jersey Institute of Technology); David
Bader (New Jersey Institute of Technology)
Large scale data sets from the web, social networks, and bioinformatics are widely available and can often be represented
by strings and suffix arrays are highly efficient data structures enabling string analysis. But, our personal devices and
corresponding exploratory data analysis (EDA) tools cannot handle big data sets beyond the local memory. Arkouda is a
framework under early development that brings together the productivity of Python at the user side with the high-
performance of Chapel at the server-side. In this paper, an efficient suffix array data structure design and integration
method are given first. A suffix array algorithm library integration method instead of one single suffix algorithm is presented
to enable runtime performance optimization in Arkouda since different suffix array algorithms may have very different
practical performances for strings in various applications. A parallel suffix array construction algorithm framework is given to
further exploit hierarchical parallelism on multiple locales in Chapel. A corresponding benchmark is developed to evaluate
the feasibility of the provided suffix array integration method and measure the end-to-end performance. Experimental
results show that the proposed solution can provide data scientists an easy and efficient method to build suffix arrays with
high performance in Python. All our codes are open source and available from GitHub (https://github.com/Bader-
Research/arkouda/tree/string-suffix-array-functionality).
Streaming Detection and Classification Performance of a POWER9 Edge Supercomputer
Wesley Brewer (HPCMP PET)*; Chris Geyer (Parsons Corporation); Dardo Kleiner (CIPS, Corp.); Connor Horne (Naval
Research Laboratory)
We present training and inference benchmarks on a POWER9 edge supercomputer called SCOUT using two production-
level artificial intelligence systems: the Video Processing Exploitation Framework (VPEF) and the Ordnance Threat Target
Automated Recognition (OTTAR) system. For training benchmarks, we use Horovod to train ResNet-50 on synthetic data
for a baseline benchmark, and then train Faster R-CNN on an available WASABI dataset using SCOUT's six-way V100
GPU nodes. For inference benchmarks, we use GStreamer to stream both synthetic and real Motion Imagery (MI) and then
use Single-Shot Detector trained on MobileNetV2 data with real and synthetic MI at four different resolutions (720p, 1080p,
1280p, and 2160p) while measuring average inference time. We also test reduced-precision inference performance using
TensorRT and distributed inference performance using the Inference Benchmark ``iBench'' framework. We compare our
results with equivalent work on x86\_64 systems and provide suggestions on tuning for optimal performance. We find that
V100s work well for training and offline inferencing in batches, while T4 GPUs outperform V100 GPUs only in very specific
usage scenarios, such as streaming detection at reduced precision.
Scaling of Evolutionary Search of Algorithm Space to Speed-Up Scientific Image Understanding Workflows
Nicholas J Grabill (University of Michigan); Kai Pinckard (Reed College); Dirk Colbry (Michigan State University)*
Scientific image understanding is an integral part of many research workflows. Studies in everything from self-driving cars
to the makeup of cells rely on image understanding through computer vision techniques. Unfortunately, almost every new
scientific question requires a new algorithm to be developed by researchers. Exploring the possible algorithms (and their
parameters) to find one that fits a particular scientific problem requires time and a broad understanding of the many options
available. For this reason, many scientists resort to making measurements "by hand" instead of making the considerable
investment required to develop a tool that can automate their specialized workflow. This paper explores the scaling of a tool
(SEE-Segment) that searches the "algorithm space" of possible image segmentation algorithms and their parameters for
solutions to unique scientific problems. The goal is to have the researchers do this manual annotation of their images
("measure by hand") using a Graphical User Interface front end while the SEE-Segment program searches through the
algorithms and their parameters on the back end. In the best case, the SEE-Segment tool finds a good candidate algorithm
that can be used on the remaining images in the researcher's data-set. In the worst case, the researchers annotate their
images as they would without the tool. Unfortunately, the search space for different segmentation algorithms and their
parameters is quite large and non-linear. This research leverages the pleasantly parallel nature of Genetic Algorithms and
explores the use of large-scale computing to speed up the search both on a local HPC and on the cloud using Kubernetes.
2-S1: GraphBLAS BoF Special (17:30-19:30)
Organizer(s): Tim Mattson & Scott McMillan
Invited Talk: The GraphBLAS C++ Interface
Ben Brock (UC Berkeley)
Invited Talk: Binary Storage Formats
Erik Welsh (Anaconda)
Lighning Talks: Updates/news from the GraphBLAS implementors
2-S2: Remote Sensing for Disaster Relief Special
(17:30-19:30)
Organizer(s): John Aldridge & Dan Dumanis & Andrew Weinert
Supercomputing Enabled Deployable Analytics for Disaster Response
Jeremy Kepner (MIT Lincoln Laboratory); Kaira M Samuel (MIT)*
First responders and other forward deployed essential workers can benefit from advanced analytics. Limited network
access and software security requirements prevent the usage of standard cloud based microservice analytic platforms
that are typically used in industry. One solution is to precompute a wide range of analytics as files that can be used with
standard preinstalled software that does not require network access or additional software and can run on a wide range
of legacy hardware. In response to the COVID-19 pandemic, this approach was tested for providing geo-spatial census
data to allow quick analysis of demographic data for better responding to emergencies. These data were processed using
the MIT SuperCloud to create several thousand Google Earth and Microsoft Excel files representative of many advanced
analytics. The fast mapping of census data using Google Earth and Microsoft Excel has the potential to give emergency
responders a powerful tool to improve emergency preparedness. Our approach displays relevant census data (total
population, population under 15, population over 65, median age) per census block, sorted by county, through a Microsoft
Excel spreadsheet (xlsx file) and Google Earth map (kml file). The spreadsheet interface includes features that allow
users to convert between different longitude and latitude coordinate units. For the Google Earth files, a variety of absolute
and relative colors maps of population density have been explored to provide an intuitive and meaningful interface. Using
several hundred cores on the MIT SuperCloud, new analytics can be generated in a few minutes.First responders and
other forward deployed essential workers can benefit from advanced analytics. Limited network access and software
security requirements prevent the usage of standard cloud based microservice analytic platforms that are typically used
in industry. One solution is to precompute a wide range of analytics as files that can be used with standard preinstalled
software that does not require network access or additional software and can run on a wide range of legacy hardware. In
response to the COVID-19 pandemic, this approach was tested for providing geo spatial census data to allow quick
analysis of demographic data for better responding to emergencies. These data were processed using the MIT
SuperCloud to create several thousand Google Earth and Microsoft Excel files representative of many advanced
analytics. The fast mapping of census data using Google Earth and Microsoft Excel has the potential to give emergency
responders a powerful tool to improve emergency preparedness. Our approach displays relevant census data (total
population, population under 15, population over 65, median age) per census block, sorted by county, through a Microsoft
Excel spreadsheet (xlsx file) and Google Earth map (kml file). The spreadsheet interface includes features that allow
users to convert between different longitude and latitude coordinate units. For the Google Earth files, a variety of absolute
and relative colors maps of population density have been explored to provide an intuitive and meaningful interface. Using
several hundred cores on the MIT SuperCloud, new analytics can be generated in a few minutes.
Invited Talk: What do we want to know about the internet
Dr. David Clark (MIT CSAIL)
Invited Talk: Bot-Match: Social Bot Detection with Semi-Supervised Recursove Nearest Neighbors Search
Ltc Dr. David Beskow (Army Cyber)
Invited Talk: TBD
TBD
Invited Talk: TBD
Matthew Lengel
Invited Talk: TBD
Dr. Brian Vegetabile (RAND)
Invited Talk: Augmented Reality for Rapid Disaster Response
Capt Victor “Salsa” Lopez (USAF Al Accelerator)
Invited Talk: Multimodal Vision for Synthetic Aperture Radar
TSgt Armando Cabrera (USAF Al Accelerator)
2021 Abstract Book