IEEE High Performane Extreme Computing

IEEE Nondiscrimination Policy

2021 IEEE High Performance Extreme Computing Virtual Conference 20 - 24 September 2021

4-S1: Graph Challenge Special (17:30-19:30) Organizer(s): Jeremy Kepner

Fast Sparse Deep Neural Network Inference with Flexible SpMM Optimization Space Exploration Jie Xin (Huazhong University of Science and Technology); Xianqi Ye (Huazhong University of Science and Technology); Long Zheng (Huazhong University of Science and Technology)*; Qinggang Wang (Huazhong University of Science and Technology ); Yu Huang (Huazhong University of Science and Technology ); Pengcheng Yao (Huazhong University of Science and Technology); Linchen Yu (Huazhong University of Science and Technology ); Xiaofei Liao (HUST); Hai Jin (Huazhong University of Science and Technology) Deep neural networks (DNN) have been widely used in many fields. With the ever-increasing model size, the DNN scalability suffers. Sparse deep neural networks (SpDNN) are promising to resolve this problem, but the sparse data makes it difficult to execute efficiently on GPUs due to load imbalance and irregular memory accesses. The recent MIT/IEEE/Amazon GraphChallenge has shown several big advances to fit sparse DNNs into GPUs, but we observe that none of these earlier efforts can be an absolute winner for all dataset cases due to their limited optimization space considerations. In this paper, we identify some new opportunities in optimizing the execution of SpDNN via a comprehensive analysis of previous works. Based on this new large design space of SpDNN, we present sparsity-aware SpMM algorithms that can systematically explore a performance-optimal solution of SpDNN execution on GPU, and further generate optimized SpMM kernel implementations. Compared to the 2020 HPEC Sparse DNN Challenge champions, our approach achieves up to 55.6 TeraEdges per second inference throughput with the speedups of up to 13.74x [1] and 22.29x [2] on a single NVIDIA V100 GPU. We also show that our approach under 4 GPUs can be even superior to the 2020 Challenge champion using 768 GPUs in many cases. The source codes are available at https://github.com/CGCL- codes/Graphchallenge21. Faster Stochastic Block Partition using Aggressive Initial Merging, Compressed Representation, and Parallelism Control Ahsen J Uppal (The George Washington University)*; H. Howie Huang (The George Washington University); Jaeseok Choi (The George Washington University); Thomas Rolinger (Laboratory for Physical Sciences) The community detection problem continues to be challenging, particularly for large graph data. Although optimal graph partitioning is NP-hard, stochastic methods, such as in the IEEE HPEC GraphChallenge, can provide good approximate solutions in reasonable time. But the scalability with increasing graph size of such solutions remains a challenge. In this work, we describe three new techniques to speed up the stochastic block partition algorithm. The first technique relies on reducing the initial number of communities via aggressive agglomerative merging (a portion of the algorithm with high parallel scalability) to quickly reduce the amount of data that must be processed, resulting in an independent speedup of 1.85x for a 200k node graph. Our second technique uses a novel compressed data structure to store the main bookkeeping information of the algorithm. Our compressed representation allows the processing of large graphs that would otherwise be impossible due to memory constraints, and has a speedup of up to 1.19x over our uncompressed baseline representation. The third technique carefully manages the amount of parallelism during different phases of the algorithm. Compared to our best baseline configuration using a fixed number of threads, this technique yields an independent speedup of 2.26x for a 200k node graph. Combined together, our techniques result in speedups of 3.78x for a 50k node graph, 4.71x for a 100k node graph, and 5.13x for a 200k node graph over our previous best parallel algorithm. Towards Distributed Square Counting in Large Graphs Trevor Steil (Lawrence Livermore National Laboratory)*; Geoffrey Sanders (LLNL); Roger Pearce (Lawrence Livermore National Laboroatry) Counting all squares (4-cycles) in a large graph is an expensive but important graph computation for network scientists to understand the higher-order structures within their large relational datasets. This is particularly true for bipartite graphs, where 4-cycles are fundamental (as one of the smallest motifs not enumerable locally at vertices or edges) and highly useful for measuring notions of connectivity and clustering. We build a distributed and asynchronous approach to counting squares and demonstrate its ability to perform this task on large real-world datasets. We employ several techniques from previous approaches, most notably degree-based vertex ordering, and base our approach on them. We note that a bipartite graph with one vertex set significantly smaller than the other is a special and common case, and we describe the benefits of our approach applied to this case. Although our approach is able to count hundreds of trillions of squares in a massive distributed graph, we report poor scaling, setting the stage for much algorithmic improvement by the GraphChallenge community. We discuss the bottlenecks within our approach that must be mitigated in order to achieve a highly performant and scalable method. Sparse Deep Neural Network Acceleration on HBM-Enabled FPGA Platform Abhishek K Jain (Xilinx)*; Sharan Kumar (Xilinx); Aashish Tripathi (Xilinx); Dinesh Gaitonde (Xilinx) Domain-specific architectures (DSAs) for deep neural networks (DNNs) are becoming mainstream because of high energy- efficiency and performance. Currently, most of these DSAs perform dense linear algebra efficiently by exploiting temporal and spatial locality, high data reuse and regular memory access patterns. However, state-of-the-art DNN models require DSAs with very high amount of compute, interconnect and memory resources. Since sparsity in DNN parameters can be exploited to reduce the resources and complexity of the DSAs, emerging DSAs are adding native support for sparsity. This paper presents a sparse DNN inference engine on HBM-enabled Alveo U280 platform, built around FPGA-optimized DSA blocks. The DSA blocks have native support for completely unstructured sparsity. We demonstrate the effectiveness of proposed inference engine by running inference models of the Sparse DNN Challenge. Results for the challenge benchmarks show that the proposed inference engine and multi-DSA parallelization on HBM-enabled Alveo U280 achieve up to 3.7 billion edges per second inference throughput. Productive High-Performance k-Truss Decomposition on GPU Using Linear Algebra Runze Wang (Huazhong University of Science and Technology)*; Linchen Yu (Huazhong University of Science and Technology); Qinggang Wang (Huazhong University of Science and Technology ); Jie Xin (Huazhong University of Science and Technology); Long Zheng (Huazhong University of Science and Technology) The k-truss decomposition algorithm aims to find the cohesive subgraphs in a graph. It includes a two-step procedure of triangle counting and weak edge deletion. Due to the significant differences of graph typologies, the best performer appearing in the last year’s MIT/IEEE/Amazon GraphChallenge presents to manually explore a well-optimized combination from many k-truss techniques for each graph dataset, but this requires expertise from multiple disciplines. Linear algebra provides a unified framework to write parallel and scalable k-truss decomposition algorithm easily. However, the work imbalance and low memory efficiency remain. In this work, we propose a productive yet efficient k-truss decomposition implementation on GPU based on fine-grained linear algebra. We propose a preprocessing method that renumbers the vertices to eliminate the longest adjacency list to balance the workloads and putting the adjacency list of the same source vertex into shared memory that exploits the shared memory effectively to improve memory access efficiency. Results show that our approach outperforms a linear algebra-based implementation– eager (a champion in 2019) by up to 24.81x and a manual optimization KTRUSSEXPLORER by up to 2.69x. HyKernel: A Hybrid Selection of One/Two-Phase Kernels for Triangle Counting on GPUs Mohammad Almasri (University of Illinois at Urbana-Champaign)*; Neo Vasudeva (University of Illinois at Urbana- Champaign); Rakesh Nagi (University of Illinois at Urbana-Champaign); Jinjun Xiong (IBM Thomas J. Watson Research Center); Wen-Mei Hwu (University of Illinois at Urbana-Champaign) We present a new two-phase dynamic triangle counting approach on GPUs. The approach targets graphs with high maximum degrees and high degree standard deviations. The first phase analyzes the edges of the graph and determines the most appropriate list intersection method and thread granularity that achieve the highest performance for each edge. The second phase performs the list intersection with the thread granularity selected in the first phase. Furthermore, we improve our static triangle counting kernels used in the two-phase approach, namely the two-pointer and binary-search intersections. We introduce a hierarchical binary-search approach that loads the initial levels of the logical search tree into the GPU fast shared memory to reduce redundant accesses to global memory. The search space reduction is another improvement to our static kernels that reduces the size of adjacency lists when orienting the graph by vertex ID. We also present HyKernel, a hybrid approach that selects between our two-phase dynamic and our 2019 one-phase dynamic approaches using simple heuristics. Lastly, a study of the impact of two graph orientation approaches on the graph preprocessing times and kernel times for our 2019 approach and the proposed two-phase approach are provided. For evaluation, we compare our two-phase and HyKernel approaches with our 2019 submission and the 2019 Champion approach, H-Index. When orienting the graph by vertex ID, our proposed two-phase and HyKernel approaches significantly outperform the H-Index with geomean speedups of 4.0x (up to 31.2x) and 5.1x (up to 31.2x), resp. Our proposed two- phase and HyKernel approaches significantly outperform the H-Index with geomean speedups of 4.7x (up to 26.8x) and 7.2x (up to 43.6x) with vertex degree orientation, resp. DPGS Graph Summarization Preserves Community Structure L JK Durbeck (Virginia Tech)*; Peter Athanas (Virginia Tech) Community detection, or clustering, is an NP-hard problem for which computationally efficient approximations have long been sought. Methods have sought a balance between accuracy, simplicity, latency, and memory usage. In this paper, we assess whether a recently developed graph summarization technique preserves the underlying community structure of the graph. The technique is Degree-Preserving Graph Summarization (DPGS). It lossily encodes the graph using fewer nodes and edges but does not assume a uniform distribution of node degrees, as is common in most Minimum Description Length-based GS methods. Instead, it strives to improve the reconstruction model for graphs with realistic skewed degrees using the configuration model. We show that the supernodes produced by DPGS capture the latent communities within the graph for a series of graphs with known community structure, while reducing the graph description length by 25%.

4-S2: TBD Special (17:30-19:30) Organizer(s):

2021 Abstract Book

IEEE Nondiscrimination Policy

4-S1: Graph Challenge Special (17:30-19:30) Organizer(s): Jeremy Kepner

Fast Sparse Deep Neural Network Inference with Flexible SpMM Optimization Space Exploration Jie Xin (Huazhong University of Science and Technology); Xianqi Ye (Huazhong University of Science and Technology); Long Zheng (Huazhong University of Science and Technology)*; Qinggang Wang (Huazhong University of Science and Technology ); Yu Huang (Huazhong University of Science and Technology ); Pengcheng Yao (Huazhong University of Science and Technology); Linchen Yu (Huazhong University of Science and Technology ); Xiaofei Liao (HUST); Hai Jin (Huazhong University of Science and Technology) Deep neural networks (DNN) have been widely used in many fields. With the ever-increasing model size, the DNN scalability suffers. Sparse deep neural networks (SpDNN) are promising to resolve this problem, but the sparse data makes it difficult to execute efficiently on GPUs due to load imbalance and irregular memory accesses. The recent MIT/IEEE/Amazon GraphChallenge has shown several big advances to fit sparse DNNs into GPUs, but we observe that none of these earlier efforts can be an absolute winner for all dataset cases due to their limited optimization space considerations. In this paper, we identify some new opportunities in optimizing the execution of SpDNN via a comprehensive analysis of previous works. Based on this new large design space of SpDNN, we present sparsity-aware SpMM algorithms that can systematically explore a performance-optimal solution of SpDNN execution on GPU, and further generate optimized SpMM kernel implementations. Compared to the 2020 HPEC Sparse DNN Challenge champions, our approach achieves up to 55.6 TeraEdges per second inference throughput with the speedups of up to 13.74x [1] and 22.29x [2] on a single NVIDIA V100 GPU. We also show that our approach under 4 GPUs can be even superior to the 2020 Challenge champion using 768 GPUs in many cases. The source codes are available at https://github.com/CGCL-codes/Graphchallenge21. Faster Stochastic Block Partition using Aggressive Initial Merging, Compressed Representation, and Parallelism Control Ahsen J Uppal (The George Washington University)*; H. Howie Huang (The George Washington University); Jaeseok Choi (The George Washington University); Thomas Rolinger (Laboratory for Physical Sciences) The community detection problem continues to be challenging, particularly for large graph data. Although optimal graph partitioning is NP-hard, stochastic methods, such as in the IEEE HPEC GraphChallenge, can provide good approximate solutions in reasonable time. But the scalability with increasing graph size of such solutions remains a challenge. In this work, we describe three new techniques to speed up the stochastic block partition algorithm. The first technique relies on reducing the initial number of communities via aggressive agglomerative merging (a portion of the algorithm with high parallel scalability) to quickly reduce the amount of data that must be processed, resulting in an independent speedup of 1.85x for a 200k node graph. Our second technique uses a novel compressed data structure to store the main bookkeeping information of the algorithm. Our compressed representation allows the processing of large graphs that would otherwise be impossible due to memory constraints, and has a speedup of up to 1.19x over our uncompressed baseline representation. The third technique carefully manages the amount of parallelism during different phases of the algorithm. Compared to our best baseline configuration using a fixed number of threads, this technique yields an independent speedup of 2.26x for a 200k node graph. Combined together, our techniques result in speedups of 3.78x for a 50k node graph, 4.71x for a 100k node graph, and 5.13x for a 200k node graph over our previous best parallel algorithm. Towards Distributed Square Counting in Large Graphs Trevor Steil (Lawrence Livermore National Laboratory)*; Geoffrey Sanders (LLNL); Roger Pearce (Lawrence Livermore National Laboroatry) Counting all squares (4-cycles) in a large graph is an expensive but important graph computation for network scientists to understand the higher-order structures within their large relational datasets. This is particularly true for bipartite graphs, where 4-cycles are fundamental (as one of the smallest motifs not enumerable locally at vertices or edges) and highly useful for measuring notions of connectivity and clustering. We build a distributed and asynchronous approach to counting squares and demonstrate its ability to perform this task on large real-world datasets. We employ several techniques from previous approaches, most notably degree- based vertex ordering, and base our approach on them. We note that a bipartite graph with one vertex set significantly smaller than the other is a special and common case, and we describe the benefits of our approach applied to this case. Although our approach is able to count hundreds of trillions of squares in a massive distributed graph, we report poor scaling, setting the stage for much algorithmic improvement by the GraphChallenge community. We discuss the bottlenecks within our approach that must be mitigated in order to achieve a highly performant and scalable method. Sparse Deep Neural Network Acceleration on HBM- Enabled FPGA Platform Abhishek K Jain (Xilinx)*; Sharan Kumar (Xilinx); Aashish Tripathi (Xilinx); Dinesh Gaitonde (Xilinx) Domain-specific architectures (DSAs) for deep neural networks (DNNs) are becoming mainstream because of high energy- efficiency and performance. Currently, most of these DSAs perform dense linear algebra efficiently by exploiting temporal and spatial locality, high data reuse and regular memory access patterns. However, state-of-the-art DNN models require DSAs with very high amount of compute, interconnect and memory resources. Since sparsity in DNN parameters can be exploited to reduce the resources and complexity of the DSAs, emerging DSAs are adding native support for sparsity. This paper presents a sparse DNN inference engine on HBM- enabled Alveo U280 platform, built around FPGA-optimized DSA blocks. The DSA blocks have native support for completely unstructured sparsity. We demonstrate the effectiveness of proposed inference engine by running inference models of the Sparse DNN Challenge. Results for the challenge benchmarks show that the proposed inference engine and multi-DSA parallelization on HBM-enabled Alveo U280 achieve up to 3.7 billion edges per second inference throughput. Productive High-Performance k-Truss Decomposition on GPU Using Linear Algebra Runze Wang (Huazhong University of Science and Technology)*; Linchen Yu (Huazhong University of Science and Technology); Qinggang Wang (Huazhong University of Science and Technology ); Jie Xin (Huazhong University of Science and Technology); Long Zheng (Huazhong University of Science and Technology) The k-truss decomposition algorithm aims to find the cohesive subgraphs in a graph. It includes a two-step procedure of triangle counting and weak edge deletion. Due to the significant differences of graph typologies, the best performer appearing in the last year’s MIT/IEEE/Amazon GraphChallenge presents to manually explore a well-optimized combination from many k- truss techniques for each graph dataset, but this requires expertise from multiple disciplines. Linear algebra provides a unified framework to write parallel and scalable k-truss decomposition algorithm easily. However, the work imbalance and low memory efficiency remain. In this work, we propose a productive yet efficient k-truss decomposition implementation on GPU based on fine-grained linear algebra. We propose a preprocessing method that renumbers the vertices to eliminate the longest adjacency list to balance the workloads and putting the adjacency list of the same source vertex into shared memory that exploits the shared memory effectively to improve memory access efficiency. Results show that our approach outperforms a linear algebra-based implementation– eager (a champion in 2019) by up to 24.81x and a manual optimization KTRUSSEXPLORER by up to 2.69x. HyKernel: A Hybrid Selection of One/Two-Phase Kernels for Triangle Counting on GPUs Mohammad Almasri (University of Illinois at Urbana- Champaign)*; Neo Vasudeva (University of Illinois at Urbana- Champaign); Rakesh Nagi (University of Illinois at Urbana- Champaign); Jinjun Xiong (IBM Thomas J. Watson Research Center); Wen-Mei Hwu (University of Illinois at Urbana- Champaign) We present a new two-phase dynamic triangle counting approach on GPUs. The approach targets graphs with high maximum degrees and high degree standard deviations. The first phase analyzes the edges of the graph and determines the most appropriate list intersection method and thread granularity that achieve the highest performance for each edge. The second phase performs the list intersection with the thread granularity selected in the first phase. Furthermore, we improve our static triangle counting kernels used in the two-phase approach, namely the two-pointer and binary-search intersections. We introduce a hierarchical binary-search approach that loads the initial levels of the logical search tree into the GPU fast shared memory to reduce redundant accesses to global memory. The search space reduction is another improvement to our static kernels that reduces the size of adjacency lists when orienting the graph by vertex ID. We also present HyKernel, a hybrid approach that selects between our two-phase dynamic and our 2019 one-phase dynamic approaches using simple heuristics. Lastly, a study of the impact of two graph orientation approaches on the graph preprocessing times and kernel times for our 2019 approach and the proposed two-phase approach are provided. For evaluation, we compare our two-phase and HyKernel approaches with our 2019 submission and the 2019 Champion approach, H-Index. When orienting the graph by vertex ID, our proposed two-phase and HyKernel approaches significantly outperform the H-Index with geomean speedups of 4.0x (up to 31.2x) and 5.1x (up to 31.2x), resp. Our proposed two-phase and HyKernel approaches significantly outperform the H-Index with geomean speedups of 4.7x (up to 26.8x) and 7.2x (up to 43.6x) with vertex degree orientation, resp. DPGS Graph Summarization Preserves Community Structure L JK Durbeck (Virginia Tech)*; Peter Athanas (Virginia Tech) Community detection, or clustering, is an NP-hard problem for which computationally efficient approximations have long been sought. Methods have sought a balance between accuracy, simplicity, latency, and memory usage. In this paper, we assess whether a recently developed graph summarization technique preserves the underlying community structure of the graph. The technique is Degree-Preserving Graph Summarization (DPGS). It lossily encodes the graph using fewer nodes and edges but does not assume a uniform distribution of node degrees, as is common in most Minimum Description Length-based GS methods. Instead, it strives to improve the reconstruction model for graphs with realistic skewed degrees using the configuration model. We show that the supernodes produced by DPGS capture the latent communities within the graph for a series of graphs with known community structure, while reducing the graph description length by 25%.

4-S2: TBD Special (17:30-19:30) Organizer(s):