27th Annual
IEEE High Performance Extreme Computing Virtual Conference
25 - 29 September 2023

HPEC 2023 AGENDA

Tuesday, September 26

2-K: Keynote Session (10:30-11:00)

Co-Chairs: J. Kepner & A. Reuther

Mission Critical: Power of Operationalizing Data & AI: Eileen Vidrine (Dept. of the Air Force Chief Data & AI Officer)

2-1: Graph Analytics & Network Science 1 Session (11:00-12:15)

Co-Chairs: M. Barnell & X. Sun

Focusing and Calibration of Large Scale Network Sensors using GraphBLAS Anonymized Hypersparse Matrices: Jeremy Kepner, Michael S Jones (MIT Lincoln Laboratory), Phil Dykstra (HPCMP DREN), Chansup Byun (MIT Lincoln Laboratory), Timothy Davis (Texas A&M), Hayden Jananthan, William Arcand, David Bestor, William Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Anna Klein (MIT Lincoln Laboratory), Lauren Milechin (MIT), Guillermo Morales, Julie Mullen, Ritesh Patel (MIT Lincoln Laboratory), Alex Pentland (MIT), Sandeep Pisharody, Andrew Prout, Albert Reuther, Antonio Rosa, Siddharth Samsi, Tyler Trigg, Charles Yee, Peter Michaleas (MIT Lincoln Laboratory); Defending community-owned cyber space requires community-based efforts. Large-scale network observations that uphold the highest regard for privacy are key to protecting our shared cyberspace. Deployment of the necessary network sensors requires careful sensor placement, focusing, and calibration with significant volumes of network observations. This paper demonstrates novel focusing and calibration procedures on a multi-billion packet dataset using high-performance GraphBLAS anonymized hyperspace matrices. The run-time performance on a real-world data set confirms previously observed real-time processing rates for high-bandwidth links while achieving significant data compression. The output of the analysis demonstrates the effectiveness of these procedures at focusing the traffic matrix and revealing the underlying stable heavy-tail statistical distributions that are necessary for anomaly detection. A simple model of the corresponding probability of detection (Pd) and probability of false alarm (Pfa) for these distributions highlights the criticality of network sensor focusing and calibration. Once a sensor is properly focused and calibrated it is then in a position to carry out two of the central tenets of good cybersecurity: (1) continuous observation of the network and (2) minimizing unbrokered network connections.
Parallel Longest Common SubSequence Analysis In Chapel: Soroush Vahidi, Baruch Schieber (New Jersey Inst. of Tech.), Zhihui Du, David Bader (New Jersey Inst. of Tech.); One of the most critical problems in the field of string algorithms is the longest common subsequence problem (LCS). The problem is NP-hard for an arbitrary number of strings but can be solved in polynomial time for a fixed number of strings. In this paper, we select a typical parallel LCS algorithm and integrate it into our large-scale string analysis algorithm library to support different types of large string analysis. Specifically, we take advantage of the high-level parallel language, Chapel, to integrate Lu and Liu’s parallel LCS algorithm into Arkouda, an open-source framework. Through Arkouda, data scientists can easily handle large string analytics on the back-end high-performance computing resources from the front-end Python interface. The Chapel-enabled parallel LCS algorithm can identify the longest common subsequences of two strings, and experimental results are given to show how the number of parallel resources and the length of input strings can affect the algorithm’s performance.
Exploiting Fusion Opportunities in Linear Algebraic Graph Query Engines: Yuttapichai Kerdcharoen, Upasana Sridhar, Tze Meng Low (Carnegie Mellon Univ.); Queries in a graph database are often converted into a sequence of graph operations by a graph query engine. In recent years, it has been recognized that the query engine benefits from using high-performance graph libraries via the GraphBLAS interface to implement time-consuming operations such as graph traversal. However, using GraphBLAS requires explicitly casting data into linear algebra objects and decomposing the query into multiple operations, some of which are expressible by the GraphBLAS. The combination of these two requirements translates into increased memory footprints and additional execution times. In this paper, we show that fusing different stages of the query engines into GraphBLAS calls can reduce the size of the intermediate data generated during the query. Furthermore, by relaxing the semi-ring constraints imposed by GraphBLAS, more aggressive fusions of the stages can be performed. We show a speedup of up to 1235.89x (8.82x on geometric average) relative to an open-source graph query engine using GraphBLAS (i.e. RedisGraph) for processing undirected subgraph enumeration queries.
Parallel Clustering with Resolution Variation: Nikos P Pitsianis (Aristotle Univ. of Thessaloniki), Dimitris Floros, Tiancheng Liu, Xiaobai Sun (Duke University); We introduce a novel approach for parallel data clustering with resolution variation. Conventional graph clustering is typically governed by a function defined over all possible cluster configurations but at a fixed value of the resolution hyperparameter denoted as 𝛾. Such clustering suffers from issues related to the so-called resolution limit or requires resolution tuning. This has been changed by recent theories and algorithms for graph clustering with resolution variation. The requirement for specifying or tuning the 𝛾-hyperparameter is effectively removed, and the clustering function becomes or transforms to a functional form with 𝛾 as an internal resolution variable. We address a standing and significant challenge in parallel clustering with resolution variation. We identify and remove the key bottlenecks in search operations confined to a specific 𝛾 value, and reduce and minimize redundant search operations at different 𝛾 values. We show impressive performance achieved with our parallel approach on real-world datasets.

Poster Session: 2-P (12:15-14:15) Poster Session

Chair(s)/Host(s): TBD & TBD

Photonic Accelerators for Image Segmentation in Autonomous Driving and Defect Detection: Lakshmi V Nair, David Widemann, Brad Turcott, Nick Moore, Alexandra Wleklinski, Darius Bunandar (Lightmatter), Ioannis Papavasileiou, Shihu Wang, Eric Logan (Corning); Photonic computing promises faster and more energy-efficient deep neural network (DNN) inference than traditional digital hardware. Advances in photonic computing can have profound impacts on applications such as autonomous driving and defect detection that depend on fast, accurate and energy efficient execution of image segmentation models. In this paper, we investigate image segmentation on photonic accelerators to explore: a) the types of image segmentation DNN architectures that are best suited for photonic accelerators, and b) the throughput and energy efficiency of executing the different image segmentation models on photonic accelerators, along with the trade-offs involved therein. Specifically, we demonstrate that certain segmentation models exhibit negligible loss in accuracy (compared to digital Float32 models) when executed on photonic accelerators, and explore the empirical reasoning for their robustness. We also discuss techniques for recovering accuracy in the case of models that do not perform well. Further, we compare throughput (inferences-per-second) and energy consumption estimates for different image segmentation workloads on photonic accelerators. We discuss the challenges and potential optimizations that can help improve the application of photonic accelerators to such computer vision tasks.
Deep Learning Recommendation Model Training Co-design with the Dynamic Opera Network: Connor Imes, Andrew J Rittenbach, Peng Xie, Dong-In Kang, John Paul Walters, Stephen Crago (USC Information Sciences Institute); Training deep learning recommendation models (DLRMs) is increasingly dominated by all-to-all and many-to-many communication patterns. Current solutions often involve designing and implementing fully-connected—and costly—high-speed interconnects. The recently proposed dynamic Opera network optimizes bulk data flows using direct forwarding through time-varying circuits and has been shown to be particularly useful for all-to-all traffic patterns while remaining cost-equivalent with static network topologies. We propose co-designing DLRM models with the Opera network to improve training time while matching network infrastructure cost with a traditional fat-tree topology. We simulate strong-scaling DLRM training on Opera networks up to 1024 nodes, identify shifting bottlenecks, and suggest where co-designers should focus their efforts.
Similarity Computation based on the Tails of the Rank Distributions and the Related Graphs: Cecilia Bolea (Romanian Academy), Mike Teodorescu (Univ. of Washington), Silviu Bejnariu, Daniela Gifu (Romanian Academy), Horia Teodorescu (Technical Univ. Iasi), Vasile Apopei (Romanian Academy); Tools for documents analysis, characterization, and retrieval are introduced based on a rigorous framework. The procedure is based on a rigorous alignment method for text of different lengths. A discussion of the algorithmic approach is also presented. The algorithm, which is suitable for parallelization, will be presented in the extended paper.
Automated and Waterless Cleaning of Solar Panels using Unmanned Aerial Vehicle and Machine Learning: Sahaj Dargan, Abhinav Vaddiraju, Ayush Singh, Jigyasu Pant (Vellore Inst. of Tech.); Energy from the Sun is a very abundant resource, the uses of which are manifold. Water is another essential resource and carries importance for the sustenance of life on Earth. According to a report, the amount of water needed to clean a single solar panel is 3-5 litres per solar panel in normal areas and 7-8 litres per panel in desert areas. While the installation of solar panels has a key role in saving the environment, the wastage of water in its maintenance is undoubtedly a serious concern as the purpose of protecting the environment is not met. Thus, the proposed work introduces a waterless and automated mechanism for selective cleaning of solar panels using unmanned aerial vehicle and machine learning. The term selective indicates that only those panels will be cleaned which are dirty rather than the traditional approach where all the panels are cleaned irrespective of whether they are clean or dirty. The novelty in our approach is data collection using UAV and enabling an automated cleaning mechanism based on machine learning which does not use water. The Unmanned Aerial Vehicle performs the monitoring of solar panels and collects their images from time to time. The collected images are compared to the preloaded dataset and are classified as either clean or dirty. According to the status of the panels, non-abrasive microfibers on a pole mounted on the solar panel are activated for cleaning.
Accuracy Analysis of Hotel Review Information using Machine Learning: Abu Asaduzzaman, Md Raihan Uddin, Ghana S. Kutala, Yoel Woldeyes (Wichita State Univ.); Websites and online programs that offer hotel booking services have sections where customers can provide reviews about their experience at the hotel. However, users often cannot filter through all the available reviews due to the sheer volume of information. Sentiment analysis can help solve this problem by categorizing reviews into positive or negative attitudes. This study aims to explore how lemmatization and bi-gram techniques can be applied in sentiment analysis classification. The study involves the following steps: collecting hotel review data from the GitHub website; cleaning and formatting the data; and implementing lemmatization using various n-gram methods, including bigram, opinion lexicon, Valence Aware Dictionary and sEntiment Reasoner (VADER) lexicon, and TextBlob. Based on the outcomes, we classify the reviews using support vector machine (SVM) and logistic regression methods and evaluate their accuracy. It is found that SVM produces very accurate (up to 99.72%) results. Applying Synthetic Minority Oversampling Technique (SMOTE) with SVM may improve accuracy.
Big Data Opportunities in Production Records at Air Force Maintenance Depots: Braden Eichmeier (US Air Force); The Ogden Air Logistics Complex at Hill Air Force Base performs depot maintenance for a wide range of systems. The depot maintains a database to record work, manage logistics, forecast future workload, and facilitate report generation. This study investigates big data and machine learning applications within these data. Findings show various categories of data with sufficient quantity for big data analysis. Weaknesses of the database include infrequent and unreliable connections to legacy systems, a lack of automated user notifications, and low-quality labor hour data. Suggestions for better utilizing the current data include developing data cleaning and imputation tools to improve labor time reporting, implementing user alerts for anomalous data according to a predictive maintenance framework, developing a digital twin scheme to rapidly analyze and forecast base-wide maintenance conditions, and researching the feasibility of generating preliminary workload plans and schedules using reinforcement learning.
Examining the Impact of Artificial Intelligence on Cybersecurity within the Internet of Things: Mayur Rele (Parachute Health), Dipti Patil (Univ. of Cumberlands); The explosive growth of the Internet of Things (IoT) has created unprecedented cybersecurity challenges and opportunities. As the Internet of Things expands to include more connected devices, it becomes more difficult to safeguard the security and privacy of sensitive data. In this context, artificial intelligence (AI) has become a potent instrument for enhancing the security of Internet of Things (IoT) devices. This article seeks a greater understanding of how artificial intelligence may support cybersecurity in the IoT ecosystem. To reduce vulnerabilities, identify threats, and enhance the overall resilience of Internet of Things (IoT) systems, this paper examines the application of AI techniques. The proposed research investigates the numerous applications of AI in IoT cybersecurity, such as threat intelligence, behavioral analysis, anomaly detection, and predictive modeling. The study will include a summary of current AI-driven IoT security methods and algorithms and an evaluation of their strengths and limitations. In addition, it will underscore the importance of combining AI with other cybersecurity technologies, such as blockchain and cloud computing, to construct a robust defense system. Additionally, the paper discusses the ethical implications of AI use in IoT cybersecurity. It will discuss potential issues such as biases in AI algorithms, privacy concerns, and the need for accountability and transparency in decision-making. In conclusion, this article provides useful insights into AI’s role in IoT cybersecurity and its potential to alter IoT system defense fundamentally.
Assessing Generative Adversarial Networks for Advanced Deepfake Creation Using Network Analysis: Minakshi Arya (NDSU), Shubhavi Arya (Indiana Univ.), Saatvik Arya (Univ. of Washington); GAN architecture is comprised of a generator model for outputting new plausible synthetic images and a discriminator model that classifies images as authentic (from the dataset) or fake (generated). First, the generator model fools the discriminator model and updates itself via the discriminator model. Then, the discriminator model updates itself directly. Deepfake GAN is applied to image generation, high-resolution image generation, 3D object generation, age estimation, cartoon character / animation / sketches generation, natural language processing, video generation, data augmentation, and detection. However, there are different models to do different tasks. Therefore, we want a combination of the Deepfake GAN models which can carry out the tasks of all the categories. After analysis of the different studies, we concluded that AttGAN and Pix2Pix models are the top models with highest degree centrality and this combination can generate deepfakes in all the categories.
Human Capital Risk Frameworks: Taylor Hilliard, Rebekah C Magness, Jordan Tribble (DAF-MIT/AIA); Effective leadership should be informed by data. In the Department of the Air Force (DAF), the DAF/A1 community stores many human capital datasets that hold potential for predicting risk and strategic advantage. Accurate predictions are crucial in this context, as they can guide decision-making from the unit level to the headquarters level. To achieve the highest level of accuracy, we propose utilizing Deep Learning, a subset of Artificial Intelligence and Machine Learning, which has demonstrated exceptional performance in digesting and interpreting complex data. Such methods are being widely adopted across various industries to enhance employee satisfaction and effectiveness. In the DAF, this approach could be harnessed to improve readiness.
E-Learning Platform for Medical Students: Dany Poly, Ishika Gupta, Hanah Susan Zachariah, Shagufta Rajguru, Rakhi Kalantri (Fr. Conceicao Rodrigues Inst. of Tech.); E-learning is an integral part of smart education. With the growth of e-learning platforms, the need for 3D learning environments is at its peak. Our challenge is to integrate a web application with 3D learning environment for students pursuing in medical field. With this, learner will be able to interpret human anatomy including surrounding structures. Additionally, students would be able to upload their own DICOM files and get a live 3D representation for the same. The design of this E-learning platform also includes tests which would help the students to be able to check their understanding.

2-2: Graph Analytics & Network Science 2 Session (12:30-13:45)

Co-Chairs: K.Cain & F. Indiviglio

Property Graphs in Arachne: Oliver Alvarado Rodriguez, Fernando Vera Buschmann, Zhihui Du, David Bader (New Jersey Inst. of Tech.); Analyzing large-scale graphs poses challenges due to their increasing size and the demand for interactive and user-friendly analytics tools. These graphs arise from various domains, including cybersecurity, social sciences, health sciences, and network sciences, where networks can represent interactions between humans, neurons in the brain, or malicious flows in a network. Exploring these large graphs is crucial for revealing hidden structures and metrics that are not easily computable without parallel computing. Currently, Python users can leverage the open-source Arkouda framework to efficiently execute Pandas and NumPy-related tasks on thousands of cores. To address large-scale graph analysis, Arachne, an extension to Arkouda, enables easy transformation of Arkouda dataframes into graphs. This paper proposes and evaluates three distributable data structures for property graphs, implemented in Chapel, that are integrated into Arachne. Enriching Arachne with support for property graphs will empower data scientists to extend their analysis to new problem domains. Property graphs present additional complexities, requiring efficient storage for extra information on vertices and edges, such as labels, relationships, and properties.
Opportunistic Query Execution on SmartNICs for Analyzing In-Transit Data: Jianshen Liu, Carlos Maltzahn (UC Santa Cruz), Craig Ulmer (SNL); High-performance computing (HPC) systems researchers have proposed using current, programmable network interface cards (or SmartNICs) to offload data management services that would otherwise consume host processor cycles in a platform. While this work has successfully mapped data pipelines to a collection of SmartNICs, users require a flexible means of inspecting in-transit data to assess the live state of the system. In this paper, we explore SmartNIC-driven opportunistic query execution, i.e., enabling the SmartNIC to make a decision about whether to execute a query operation locally (i.e., “offload”) or defer execution to the client (i.e., “push-back”). Characterizations of different parts of the end-to-end query path allow the decision engine to make complexity predictions that would not be feasible by the client alone.
Parallel Algorithms for Computing Jaccard Weights on Graphs using Linear Algebra: Elaheh Hassani, Md Taufique Hussain, Ariful Azad (Indiana Univ.); Jaccard similarity between a pair of vertices in a graph measures the relative overlap among their adjacent vertices. This metric is used to estimate the strength of existing edges and predict new edges between pairs of disconnected vertices. Computing Jaccard similarity for all pairs of vertices or for all edges is computationally expensive. Existing sequential and parallel algorithms are either too slow or do not scale well for large scale graphs. We present a shared-memory parallel algorithm for computing Jaccard weights. Our algorithm relies on sparse linear algebraic operations that utilize masking, semirings, vector iterators, and other GraphBLAS features for performance. Our implementation, albeit simple, outperforms recent state-of-the-art implementations by a factor of up to 20 times and exhibits an average speedup of 9 times.
Fast Spectral Graph Partitioning with a Randomized Eigensolver: Heliezer J Espinoza (Cal Poly Pomona), Jennifer A Loe, Erik Boman (SNL); A known problem in parallel computing is how to partition a matrix such that work can be distributed among several processors efficiently. One technique to do this is spectral graph partitioning, which uses the eigenvectors of the graph Laplacian to determine the optimal way for the matrix to be divided. This partitioning method is particularly suited for parallelization, specifically for GPUs, as it mainly relies on linear algebra operations. However, this increased parallelism may come at the cost of accuracy. In this work, we present a novel improvement to spectral graph partitioning by replacing the exact eigensolver (LOBPCG) with a randomized eigensolver roughly an order of magnitude faster. While the accuracy of the eigensolver is typically worse, we show that for graph partitioning this is sufficient. Our algorithm is implemented in the Sphynx spectral graph partitioner, contained in the Zoltan2 package of Trilinos. Results show this randomized method in general gives a substantial speedup with minimal loss in the quality of the edge cut. In some cases the randomized method even gives slightly better edge cuts than the LOBPCG eigensolver.

2-3: Graph Analytics & Network Science 3 Session (14:15-15:30)

Co-Chairs: C.Hillegas & X. Sun

Decomposition Based Refinement for the Network Interdiction Problem: Krish Matta (Carnegie Mellon Univ.), Xiaoyuan Liu (Fujitsu Research), Ilya Safro (Univ. of Delaware); The shortest path network interdiction (SPNI) problem poses significant computational challenges due to its NP-hardness. Current solutions, primarily based on integer programming methods, are inefficient for large-scale instances. In this paper, we introduce a novel hybrid algorithm that can utilize Ising Processing Units (IPUs) alongside classical solvers. This approach decomposes the problem into manageable sub-problems, which are then offloaded to the slow but high-quality classical solvers or IPU. Results are subsequently recombined to form a global solution. Our method demonstrates comparable quality to existing whole problem solvers while reducing computational time for large-scale instances. Furthermore, our approach is amenable to parallelization, allowing for simultaneous processing of decomposed sub-problems.
TenSQL: An SQL Database Built on GraphBLAS [Outstanding Paper Award]: Jonathan P Roose (SNL), Miheer Vaidya, Ponnuswamy Sadayappan (Univ. of Utah), Siva Rajamanickam (SNL); Relational Database Management Systems (RDBMS) have been the most prominent form of database in the world for several decades. While relational databases are often applied within high-frequency/low-volume transactional applications such as website backends, the poor performance of relational databases on low-frequency/high-volume queries often precludes their application to big data analysis fields like graph analytics. This work explores the construction of an RDBMS solution that uses the GraphBLAS API to execute Structured Query Language (SQL) in an effort to improve performance on high-volume queries. Tables are redefined to be collections of sparse scalars, vectors, matrices, and more generally sparse tensors. The explicit values (nonzeros) in these sparse tensors define the rows and NULL values within the tables. A prototype database called TenSQL was constructed and evaluated against several SQL implementations including PostgreSQL. Preliminary results comparing the performance on queries common in graph analysis applications offer performance improvements as high as 1,400x over PostgreSQL for moderately sized datasets when returning results in a columnar format.
A Framework for Analyzing the Robustness of Graph Models: Khaled Abdelaal, Richard Veras (Univ. of Oklahoma); Graphs — and sparse matrices — provide a powerful representation for expressing the complex structural relationship between elements in a set, which is why they are used extensively in graph machine learning, network analytics, and scientific computing. One of the challenges in this field is obtaining large scale graph data for performance evaluation. Here, parameterized graph models and their corresponding generators fill in the gap. While there is much work on how well these models represent real data, there are open questions as to how sensitive, or robust, these dials are to noise. In this paper we present a framework for evaluating parameterized graph models in order to study how perturbations to these parameters affect the global structure of the resulting graph. We discuss how this framework is extensible to any graph model and choice of graph features. Further, we provide a case study for Kronecker graphs and analyze the effects of varying the parameters of the Kronecker Graph’s initiator matrix, along with injecting noise into the graph on global features. What we will see is that certain features have varying degrees of robustness relative to parameter being modified.
A Look into a GraphBLAS entry point into an LLVM Lowering Pass, with A Precision Formatting Example: Roy P Gulla (Oasis Gaming); The Posits standard has shown itself to be well suited to high performance, and particularly, AI based processors as well as being a viable storage formatting alternative to IEEE754 float types. It contains an optional field in its format, the fractional bit field, which many times simply is bypassed if the number of storage bits does not allow for it. The purpose of the precision computing circuit presented here is to present a preprocessing approach to optimize processors for the storage space issues found with the implementation of the new formatting, and to show its compatibility to new offloaded memory array structures. The backbone of the compilation approach to instruction set optimization will be implementing single stage forwarding constraint, much as a carry flag is in traditional adder circuits, via a new computing pathway presented for the graphBLAS toolchain.
Mapping of Internet “Coastlines” via Large Scale Anonymized Network Source Correlations: Hayden R Jananthan, Jeremy Kepner, Michael Jones, William Arcand, David Bestor, William Bergeron, Chansup Byun (MIT Lincoln Laboratory), Timothy Davis (Texas A&M), Vijay Gadepally (MIT Lincoln Laboratory), Daniel Grant (GreyNoise), Michael Houle, Matthew Hubbell, Anna Klein (MIT Lincoln Laboratory), Lauren Milechin (MIT), Guillermo Morales (MIT Lincoln Laboratory), Andrew Morris (GreyNoise), Julie Mullen, Ritesh Patel (MIT Lincoln Laboratory), Alex Pentland (MIT), Sandeep Pisharody, Andrew Prout, Albert Reuther, Antonio Rosa, Siddharth Samsi, Tyler Trigg, Gabriel Wachman, Charles Yee, Peter Michaleas (MIT Lincoln Laboratory); Expanding the scientific tools available to protect computer networks can be aided by a deeper understanding of the underlying statistical distributions of network traffic and their potential geometric interpretations. Analysis of large scale network observations provide a unique window into these phenomena. Newly developed GraphBLAS hyperspace matrices and D4M associative array technologies enable the efficient anonymized analysis of these data on the scale of trillions of events. This work analyzes over 100,000,000,000 anonymized packets from the largest Internet telescope (CAIDA) and over 10,000,000 anonymized sources from the largest commercial honeyfarm (GreyNoise). Neither of these locations actively emit Internet traffic and provide distinct observations of unsolicited Internet traffic (primarily botnets and scanners). Analysis of these observations confirms the previously observed Cauchy-like distributions describing temporal correlations between Internet sources. The Gull lighthouse problem is a well-known geometric characterization of the standard Cauchy distribution and motivates a potential geometric interpretation for Internet observations. This work generalizes the Gull lighthouse problem to accommodate larger classes of coastlines, deriving a closed-form solution for the resulting probability distributions, stating and examining the inverse problem of identifying an appropriate coastline given a continuous probability distribution, identifying a geometric heuristic for solving this problem computationally, and applying that heuristic to examine the temporal geometry of different subsets of network observations. Application of this method to the CAIDA and GreyNoise data reveals a several orders of magnitude difference between known benign and other traffic which can lead to potentially novel ways to protect networks.

2-4: AI / Machine Learning 1 Session (15:45-17:00)

Co-Chairs: F. Indiviglio & D. Ricke

Advanced Ultra Low-Power Deep Learning Applications with Neuromorphic Computing: Mark Barnell, Courtney Raymond, Lisa Loomis (AFRL), Darrek Isereau, Daniel Brown, Francesca Vidal, Steven Smiley (SRC); The latest Intel neuromorphic processor, Loihi 2, provides a breakthrough in Artificial Intelligence (AI) for computing at the edge, where sensor information is collected. The computing architecture does this by leveraging computations at the transistor level in a fashion analogous to the human brain’s biological neural networks (vs. a Von Neumann compute architecture). The Loihi 2’s high performance, small form factor, and low-power consumption makes it a unique capability that is well suited for use in devices. Our technical approach and findings support extreme computing needs for the internet of things (IoT) and various airborne platforms’ applications. The recently released Loihi 2 and the novel research completed on this effort were combined to accelerate development and demonstration of a new concept of operation for machine learning at the edge. This research included the development of spiking neural networks (SNN) on sensor data representative of information sources from a small research platform. Our concept uses the representative sensor data to predict the platform mode through machine learning. Importantly, our technical approach allowed us to rapidly scale from IBM’s TrueNorth Corelet framework to the Lava framework, which Intel’s Loihi 2 neuromorphic processor utilizes. The use of the Lava framework demonstrates the art-of-the-possible in edge computing by demonstrating capabilities on small airborne platform sensor data and wide extensibility to other domains that can use this neuromorphic compute hardware. In summary, this research included the use of new compute frameworks, novel processing algorithms, and a unique concept of operation. This technical approach resulted in the classification of the platform mode given the sensor information with accuracies up to 97.6%.
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference: Siddharth Samsi (MIT Lincoln Laboratory), Dan Zhao (NYU), Joseph McDonald (MIT Lincoln Laboratory), Baolin Li (Northeastern Univ.), Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner (MIT Lincoln Laboratory), Divesh Tiwari (Northeastern Univ.) Vijay Gadepally (MIT Lincoln Laboratory); Generative AI, and in particular, large language models (LLM) have exploded in popularity due to significant new capabilities in text generation that go far beyond prior state- of-the art. These technologies are increasingly being leveraged in a wide range of domains including education, engineering, government, law, finance, medicine, and many more. However, training and deploying these models pose significant computa- tional challenges. In particular, the compute and energy costs required for inference already receive less attention than the energy costs of training LLMs—despite how often these large models are called on to conduct inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see increasing usage and deployment in various domains, a better understanding of their resource utilization is crucial for cost-savings, scaling performance, efficient hardware usage, and optimal inference strategies. In this paper, we describe experiments on compute and energy usage for a large language model running inference. We benchmark the LLaMA model on two generations of GPUs and use two different datasets for inference.
Towards the FAIR Asset Tracking across Models, Datasets, and Performance Evaluation Scenarios: Piotr Luszczek, Tokey Tahmid (Univ. of Tennessee); In order to ensure the reproducibility and give full account of the results obtained from scientific simulations assisted by ML/AI models, a new set of methodological innovations have to take place, that account for provenance, deployment, usage, updates, and archiving of variety of digital assets. We present a design and implementation of a methodology, that not only addresses these very aspects of modern computational science but is also a major step towards practically achieving this lofty goal for a variety of established models and their datasets, that are of particular importance to the progress of science simulations utilizing ML/AI models. We also show experimental results of applying our approach to a specific evaluation scenario and show how it maintains the performance efficiency, delivers accurate training results, and captures sufficiently rich context of the runtime behavior to inform both the particular domain science and machine learning communities.
Continuous Deep Equilibrium Models: Training Neural ODEs Faster by Integrating Them to Infinity [Best Student Paper Award]: Avik Pal, Alan Edelman, Christopher Rackauckas (MIT); Implicit models separate the definition of a layer from the description of its solution process. While implicit layers allow features such as depth to adapt automatically to new scenarios and inputs, this adaptivity makes its computational expense challenging to predict. In this manuscript, we increase the “implicitness” of the DEQ by redefining the method in terms of an infinite time neural ODE, which paradoxically decreases the training cost over a standard neural ODE by 2-4x. Additionally, we address the question: is there a way to simultaneously achieve the robustness of implicit layers while allowing the reduced computational expense of an explicit layer? To solve this, we develop Skip and Skip Reg. DEQ, an implicit-explicit (IMEX) layer that simultaneously trains an explicit prediction followed by an implicit correction. We show that training this explicit predictor is free and even decreases the training time by 1.11} – 3.19 x. Together, this manuscript shows how bridging the dichotomy of implicit and explicit deep learning can combine the advantages of both techniques.
Energy Estimates Across Layers of Computing: From Devices to Large-Scale Applications in Natural Language Processing, Scientific Computing, and Cryptocurrency Mining: Sadasivan Shankar (Stanford Univ.); We estimate energy usage in all layers of computing from devices to algorithms and software. Building on our previous analysis [3], we map the energy estimates from single devices to large-scale computing applications including AI/Machine Learning for Natural Language Processing, Scientific Simulations, and Cryptocurrency Mining. In contrast to the switching level, in which transistors have become energetically efficient, higher energy is expended at the instructions level. Additionally, the analysis based on AI/ML Accelerators indicate that architectures enable comparable energy efficiency with an older process as a newer semiconductor technology. As we go from the bit level to system and application level, our analysis indicates the large energy requirements for instructions and algorithms, corresponding to over 1024 in magnitude in energy requirements. Our work underscores the need for energy efficiency in computing and including energy as a design parameter for enabling growing needs of digitalization.

2-S1: New Application Frontiers Special (17:30-19:30)

Co-Chairs: TBD & TBD

MAVR: Multi-functional Point Cloud Annotations Using Virtual Reality: Xiao Zhang, Zhanhong Huang, Xinming Huang (WPI); Learning-based point cloud perception methods rely on labeled data for training data-driven models, necessitating the development of precise and efficient tools for point cloud annotations. In this paper, we propose MAVR, a multi-functional annotation framework based on virtual reality (VR) technology, capable of accurately labeling point cloud data for diverse applications, including part segmentation and object detection. We begin by evaluating the user interface (UI) efficiency through interactive efficiency analysis. Subsequently, a comprehensive three-step process is introduced, which consists of pre-processing, point selection, and post-tagging. For 3D object part segmentation and scene perception, we propose two distinct tagging pipelines. Our experimental results on various datasets validate the effectiveness of MAVR in accurately annotating point clouds from different data sources within an immersive workspace.
Addressing Endpoint-induced Congestion for Accelerator Scale-Out in a Medium-Scale Domain: Timothy Chong, Venkata Krishnan (Intel); The rapid advancement of network link bandwidth in modern data centers, coupled with the relatively slower processing capabilities of host systems that include accelerators, has given rise to a new challenge: endpoint congestion. This paper presents a reactive scheme built upon a standard reliability protocol to mitigate the impact of endpoint congestion in a medium-scale domain of endpoints—one that is reachable within a few switch hops. This is arguably the sweet-spot for an accelerator scale-out domain. The proposed policy, which overloads duplicate-ACK as a reactive congestion signal, enables controlled pacing of packets from the initiator to match with target processing bandwidth, thereby avoiding packet loss due to endpoint congestion. Our results demonstrate that, for unicast and incast streaming PUT (RDMA write) flows, the proposed scheme effectively mitigates packet drops and achieves minimal, or in most cases zero, packet retransmissions when there is a drop in the endpoint processing speed. Traditional approaches fail to achieve this behavior even with high target queue capacity. Thus, our scheme also has the potential benefit of reducing the buffering requirements at the endpoint and consequently, the cost. To the best of our knowledge, our scheme is the first to explicitly consider and mitigate packet loss due to endpoint congestion, offering an effective approach to address this emerging challenge.
Automatic Differentiation for Inverse Problems with Applications in Quantum Transport: Ivan Williams, Eric Polizzi (UMass Amherst); A neural solver and differentiable simulation of the quantum transmitting boundary model is presented for the inverse quantum transport problem. The neural solver is used to engineer continuous transmission properties and the differentiable simulation is used to engineer current-voltage characteristics.
Parallel Quasi-concave Set Function Optimization for Scalability Even without Submodularity: Praneeth Vepakomma (MIT), Yulia Kempner (Holon Inst. of Tech.), Rodmy Paredes Alfaro, Ramesh Raskar (MIT); Classes of set functions along with a choice of ground set are a bedrock to determine corresponding variants of greedy algorithms. These algorithms in turn obtain approximate and efficient solutions for combinatorial optimization of these set functions. The class of constrained submodular optimization has seen huge advances at the intersection of good computational efficiency, versatility and approximation guarantees while unconstrained submodular optimization is NP-hard. What is an alternative to situations when submodularity does not hold? Can efficient and globally exact solutions be obtained? We introduce one such new frontier: The class of quasi-concave set functions induced as a dual class to monotone linkage functions. We provide a parallel algorithm with a time complexity over $n$ processors of $\mathcal{O}(n^2g) +\mathcal{O}(\log{\log{n}})$ where $n$ is the cardinality of the ground set and $g$ is the complexity to compute the monotone linkage function that induces a corresponding quasi-concave set function via a duality. The complexity reduces to $\mathcal{O}(gn\log(n))$ on $n^2$ processors and to $\mathcal{O}(gn)$ on $n^3$ processors. Our approach reduces the currently existing cubic computational complexity to those mentioned above. Our algorithm provides a globally optimal solution to a maxi-min problem as opposed to submodular optimization which is approximate. We show a potential for widespread applications via an example of diverse feature subset selection with exact global maxi-min guarantees upon showing that a statistical dependency measure called distance correlation can be used to induce a quasi-concave set function.

2-S2: GraphBLAS BoF Special (17:30-19:30)

Co-Chairs: T. Mattson and S. McMillan

IEEE HPEC 2023

27th Annual IEEE High Performance Extreme Computing Virtual Conference 25 - 29 September 2023