2021
IEEE High Performance Extreme Computing
Virtual Conference
20 - 24 September 2021
4-S1: Graph Challenge Special (17:30-19:30)
Organizer(s): Jeremy Kepner
Fast Sparse Deep Neural Network Inference with Flexible SpMM Optimization Space Exploration
Jie Xin (Huazhong University of Science and Technology); Xianqi Ye (Huazhong University of Science and Technology);
Long Zheng (Huazhong University of Science and Technology)*; Qinggang Wang (Huazhong University of Science and
Technology
); Yu Huang (Huazhong University of Science and Technology
); Pengcheng Yao (Huazhong University of
Science and Technology); Linchen Yu (Huazhong University of Science and Technology
); Xiaofei Liao (HUST); Hai Jin
(Huazhong University of Science and Technology)
Deep neural networks (DNN) have been widely used in many fields. With the ever-increasing model size, the DNN
scalability suffers. Sparse deep neural networks (SpDNN) are promising to resolve this problem, but the sparse data
makes it difficult to execute efficiently on GPUs due to load imbalance and irregular memory accesses. The recent
MIT/IEEE/Amazon GraphChallenge has shown several big advances to fit sparse DNNs into GPUs, but we observe that
none of these earlier efforts can be an absolute winner for all dataset cases due to their limited optimization space
considerations. In this paper, we identify some new opportunities in optimizing the execution of SpDNN via a
comprehensive analysis of previous works. Based on this new large design space of SpDNN, we present sparsity-aware
SpMM algorithms that can systematically explore a performance-optimal solution of SpDNN execution on GPU, and further
generate optimized SpMM kernel implementations. Compared to the 2020 HPEC Sparse DNN Challenge champions, our
approach achieves up to 55.6 TeraEdges per second inference throughput with the speedups of up to 13.74x [1] and
22.29x [2] on a single NVIDIA V100 GPU. We also show that our approach under 4 GPUs can be even superior to the
2020 Challenge champion using 768 GPUs in many cases. The source codes are available at https://github.com/CGCL-
codes/Graphchallenge21.
Faster Stochastic Block Partition using Aggressive Initial Merging, Compressed Representation, and Parallelism
Control
Ahsen J Uppal (The George Washington University)*; H. Howie Huang (The George Washington University); Jaeseok Choi
(The George Washington University); Thomas Rolinger (Laboratory for Physical Sciences)
The community detection problem continues to be challenging, particularly for large graph data. Although optimal graph
partitioning is NP-hard, stochastic methods, such as in the IEEE HPEC GraphChallenge, can provide good approximate
solutions in reasonable time. But the scalability with increasing graph size of such solutions remains a challenge. In this
work, we describe three new techniques to speed up the stochastic block partition algorithm. The first technique relies on
reducing the initial number of communities via aggressive agglomerative merging (a portion of the algorithm with high
parallel scalability) to quickly reduce the amount of data that must be processed, resulting in an independent speedup of
1.85x for a 200k node graph. Our second technique uses a novel compressed data structure to store the main
bookkeeping information of the algorithm. Our compressed representation allows the processing of large graphs that
would otherwise be impossible due to memory constraints, and has a speedup of up to 1.19x over our uncompressed
baseline representation. The third technique carefully manages the amount of parallelism during different phases of the
algorithm. Compared to our best baseline configuration using a fixed number of threads, this technique yields an
independent speedup of 2.26x for a 200k node graph. Combined together, our techniques result in speedups of 3.78x for a
50k node graph, 4.71x for a 100k node graph, and 5.13x for a 200k node graph over our previous best parallel algorithm.
Towards Distributed Square Counting in Large Graphs
Trevor Steil (Lawrence Livermore National Laboratory)*; Geoffrey Sanders (LLNL); Roger Pearce (Lawrence Livermore
National Laboroatry)
Counting all squares (4-cycles) in a large graph is an expensive but important graph computation for network scientists to
understand the higher-order structures within their large relational datasets. This is particularly true for bipartite graphs,
where 4-cycles are fundamental (as one of the smallest motifs not enumerable locally at vertices or edges) and highly
useful for measuring notions of connectivity and clustering. We build a distributed and asynchronous approach to counting
squares and demonstrate its ability to perform this task on large real-world datasets. We employ several techniques from
previous approaches, most notably degree-based vertex ordering, and base our approach on them. We note that a
bipartite graph with one vertex set significantly smaller than the other is a special and common case, and we describe the
benefits of our approach applied to this case. Although our approach is able to count hundreds of trillions of squares in a
massive distributed graph, we report poor scaling, setting the stage for much algorithmic improvement by the
GraphChallenge community. We discuss the bottlenecks within our approach that must be mitigated in order to achieve a
highly performant and scalable method.
Sparse Deep Neural Network Acceleration on HBM-Enabled FPGA Platform
Abhishek K Jain (Xilinx)*; Sharan Kumar (Xilinx); Aashish Tripathi (Xilinx); Dinesh Gaitonde (Xilinx)
Domain-specific architectures (DSAs) for deep neural networks (DNNs) are becoming mainstream because of high energy-
efficiency and performance. Currently, most of these DSAs perform dense linear algebra efficiently by exploiting temporal
and spatial locality, high data reuse and regular memory access patterns. However, state-of-the-art DNN models require
DSAs with very high amount of compute, interconnect and memory resources. Since sparsity in DNN parameters can be
exploited to reduce the resources and complexity of the DSAs, emerging DSAs are adding native support for sparsity. This
paper presents a sparse DNN inference engine on HBM-enabled Alveo U280 platform, built around FPGA-optimized DSA
blocks. The DSA blocks have native support for completely unstructured sparsity. We demonstrate the effectiveness of
proposed inference engine by running inference models of the Sparse DNN Challenge. Results for the challenge
benchmarks show that the proposed inference engine and multi-DSA parallelization on HBM-enabled Alveo U280 achieve
up to 3.7 billion edges per second inference throughput.
Productive High-Performance k-Truss Decomposition on GPU Using Linear Algebra
Runze Wang (Huazhong University of Science and Technology)*; Linchen Yu (Huazhong University of Science and
Technology); Qinggang Wang (Huazhong University of Science and Technology
); Jie Xin (Huazhong University of
Science and Technology); Long Zheng (Huazhong University of Science and Technology)
The k-truss decomposition algorithm aims to find the cohesive subgraphs in a graph. It includes a two-step procedure of
triangle counting and weak edge deletion. Due to the significant differences of graph typologies, the best performer
appearing in the last year’s MIT/IEEE/Amazon GraphChallenge presents to manually explore a well-optimized combination
from many k-truss techniques for each graph dataset, but this requires expertise from multiple disciplines. Linear algebra
provides a unified framework to write parallel and scalable k-truss decomposition algorithm easily. However, the work
imbalance and low memory efficiency remain. In this work, we propose a productive yet efficient k-truss decomposition
implementation on GPU based on fine-grained linear algebra. We propose a preprocessing method that renumbers the
vertices to eliminate the longest adjacency list to balance the workloads and putting the adjacency list of the same source
vertex into shared memory that exploits the shared memory effectively to improve memory access efficiency. Results show
that our approach outperforms a linear algebra-based implementation– eager (a champion in 2019) by up to 24.81x and a
manual optimization KTRUSSEXPLORER by up to 2.69x.
HyKernel: A Hybrid Selection of One/Two-Phase Kernels for Triangle Counting on GPUs
Mohammad Almasri (University of Illinois at Urbana-Champaign)*; Neo Vasudeva (University of Illinois at Urbana-
Champaign); Rakesh Nagi (University of Illinois at Urbana-Champaign); Jinjun Xiong (IBM Thomas J. Watson Research
Center); Wen-Mei Hwu (University of Illinois at Urbana-Champaign)
We present a new two-phase dynamic triangle counting approach on GPUs. The approach targets graphs with high
maximum degrees and high degree standard deviations. The first phase analyzes the edges of the graph and determines
the most appropriate list intersection method and thread granularity that achieve the highest performance for each edge.
The second phase performs the list intersection with the thread granularity selected in the first phase. Furthermore, we
improve our static triangle counting kernels used in the two-phase approach, namely the two-pointer and binary-search
intersections. We introduce a hierarchical binary-search approach that loads the initial levels of the logical search tree into
the GPU fast shared memory to reduce redundant accesses to global memory. The search space reduction is another
improvement to our static kernels that reduces the size of adjacency lists when orienting the graph by vertex ID. We also
present HyKernel, a hybrid approach that selects between our two-phase dynamic and our 2019 one-phase dynamic
approaches using simple heuristics. Lastly, a study of the impact of two graph orientation approaches on the graph
preprocessing times and kernel times for our 2019 approach and the proposed two-phase approach are provided. For
evaluation, we compare our two-phase and HyKernel approaches with our 2019 submission and the 2019 Champion
approach, H-Index. When orienting the graph by vertex ID, our proposed two-phase and HyKernel approaches significantly
outperform the H-Index with geomean speedups of 4.0x (up to 31.2x) and 5.1x (up to 31.2x), resp. Our proposed two-
phase and HyKernel approaches significantly outperform the H-Index with geomean speedups of 4.7x (up to 26.8x) and
7.2x (up to 43.6x) with vertex degree orientation, resp.
DPGS Graph Summarization Preserves Community Structure
L JK Durbeck (Virginia Tech)*; Peter Athanas (Virginia Tech)
Community detection, or clustering, is an NP-hard problem for which computationally efficient approximations have long
been sought. Methods have sought a balance between accuracy, simplicity, latency, and memory usage. In this paper, we
assess whether a recently developed graph summarization technique preserves the underlying community structure of the
graph. The technique is Degree-Preserving Graph Summarization (DPGS). It lossily encodes the graph using fewer nodes
and edges but does not assume a uniform distribution of node degrees, as is common in most Minimum Description
Length-based GS methods. Instead, it strives to improve the reconstruction model for graphs with realistic skewed degrees
using the configuration model. We show that the supernodes produced by DPGS capture the latent communities within the
graph for a series of graphs with known community structure, while reducing the graph description length by 25%.
4-S2: TBD Special (17:30-19:30)
Organizer(s):
2021 Abstract Book