2021
IEEE High Performance Extreme Computing
Virtual Conference
20 - 24 September 2021
Wednesday, September 22
3-1: AI / Machine Learning 1 Session (11:00-12:15)
Session Co-Chairs: Siddarth Samsi & Sanmukh Rao Kuppannagari
Efficient Neighbor-Sampling-based GNN Training on CPU-FPGA Heterogeneous Platform
Bingyi Zhang (University of Southern California)*; Sanmukh Rao Kuppannagari (University of Southern California); Rajgopal
Kannan (Army Research Lab-West); Viktor K Prasanna (Unversity of Southern California)
Graph neural networks (GNNs) become increasingly important in many real-world applications. However, training GNN on
the large-scale real-world graphs is still challenging. Many sampling-based GNN training algorithms have been proposed to
facilitate mini-batch style training process. The well-known Neighbor-Sampling-base (NS) GNN training algorithms, such as
GraphSAGE, have shown their great advantages in terms of accuracy, generalization, and scalability on large-scale graphs.
Nevertheless, efficient hardware acceleration for such algorithms have not been systematically studied. In this paper, we
perform experimental study to understand the computation characteristics of NS GNN training. The evaluation results show
that \emph{ neighbor sampling} and \emph{feature aggregation} take the majority if the execution time due to the irregular
memory accesses and extensive memory traffic. Then, we propose a system design for NS GNN training by exploiting the
CPU-FPGA heterogeneous platform. We develop an optimized parallel neighbor sampling implementation and develop an
efficient FPGA accelerator to enable high-throughput GNN training. We propose the neighbor sharing and task pipelining
techniques to improve the training throughput. We implement a prototype system on an FPGA-equiped server. The evaluation
results demonstrate our CPU-FPGA design achieves $12-21\times$ speedup than CPU-only platform and $0.4-3.2\times$
speedup than CPU-GPU platform. Moreover, our FPGA accelerator achieves $2.3\times$ more energy efficiency than the
GPU board.
Serving Machine Learning Inference Using Heterogeneous Hardware
Baolin Li (Northeastern University)*; Vijay Gadepally (MIT Lincoln Laboratory); Siddharth Samsi (MIT Lincoln Laboratory);
Mark Veillette (MIT Lincoln Laboratory); Devesh Tiwari (Northeastern University)
The growing popularity of machine learning algorithms and the wide availability of hardware accelerators have brought up
new challenges on inference serving. This paper explores the opportunity to serve inference queries with a heterogeneous
system. The system has a central optimizer that allocates heterogeneous hardware resources to cooperatively serve queries.
The optimizer supports both energy minimization and throughput maximization while satisfying a latency target. The optimized
heterogeneous serving system is evaluated against a homogeneous system, on two representative real-world applications of
radar nowcasting and object detection. Our evaluation results show that the power-optimized heterogeneous system can
achieve up to 36% of power saving, and the throughput-optimized heterogeneous system can increase query throughput by
up to 53%.
Improved Compression for Word Embeddings by Scaling Principal Components
Joseph P McDonald (MIT Lincoln Laboratory)*; Siddharth Samsi (MIT Lincoln Laboratory); Daniel Edelman (Massachusetts
Institute of Technology); Jeremy Kepner (MIT Lincoln Laboratory); Chansup Byun (MIT Lincoln Laboratory); Vijay Gadepally
(MIT Lincoln Laboratory)
Word embeddings have been adopted as a fundamental component of many natural language processing applications for
their ability to capture meaningful semantic relationships. However they often present a significant computational bottleneck
due to memory requirements. In this article we present a postprocessing technique for embeddings, based on modifying their
principal components, that enables compression while maintaining comparable if not better performance relative to the
original embedding. Specifically, our technique can reduce the overall memory footprint of popular embeddings such as GloVe
and word2vec by 50% while maintaining the same performance on different metrics, including commonly used similarity and
analogy tasks as well as on end-to-end tasks such as text classification. Compared to the original embeddings and previous
postprocessing methods, this approach improves accuracy on these tasks and leads to better semantic vector
representations particularly when using compressed versions of these vectors for memory and performance savings. While
compressing the original vectors is possible, this approach outperforms these as well as other postprocessing methods
across a range of compressed sizes.
Instance Segmentation of Neuronal Nuclei Leveraging Domain Adaptation
Kevin Brady (MIT Lincoln Laboratory)*; Pooya Khorrami (MIT Lincoln Laboratory); Lars Gjesteby (MIT Lincoln Laboratory);
Laura Brattain (MIT Lincoln Laboratory)
The detection and localization of individual cell nuclei in dense neural scenes collected by microscopy traditionally depends
on human-expert-intensive manual markup for training and evaluating automatic algorithms. These approaches are
expensive, time-intensive, and require domain expertise. To develop automatic approaches, the annotated content needs to
match the collection conditions (e.g. stain, cell-type) and small changes to these conditions often requires additional matching
annotated content. Our approach leverages supervised domain adaptation with an application to the instance segmentation of
nuclei in the brain. The efficacy of this approach is demonstrated experimentally by characterizing the performance of
adapting models learned on content not well matched to the target domain. Quantitative results demonstrate performance
improvements relative to previous related work.
Even Faster SNN Simulation with Lazy+Event-driven Plasticity and Shared Atomics
Dennis Bautembach (FORTH)*; Iason Oikonomidis (FORTH); Antonis A Argyros (CSD-UOC and ICS-FORTH)
We present two novel optimizations that accelerate clock-based spiking neural network (SNN) simulators. The first one targets
spike timing dependent plasticity (STDP). It combines lazy- with event-driven plasticity and efficiently facilitates the
computation of pre- and post-synaptic spikes using bitfields and integer intrinsics. It offers higher bandwidth than event-driven
plasticity alone and achieves a 1.5x-2x speedup over our closest competitor. The second optimization targets spike delivery.
We partition our graph representation in a way that bounds the number of neurons that need be updated at any given time
which allows us to perform said update in shared memory instead of global memory. This is 2x-2.5x faster than our closest
competitor. Both optimizations represent the final evolutionary stages of years of iteration on STDP and spike delivery inside
"Spice" (/spaIk/), our state of the art SNN simulator. The proposed optimizations are not exclusive to our graph representation
or pipeline but are applicable to a multitude of simulator designs. We evaluate our performance on three well-established
models and compare ourselves against three other state of the art simulators.
3-2: AI / Machine Learning 2 Session (12:30-13:45)
Session Co-Chairs: Siddarth Samsi & Julie Mullen
Model Quantization and Synthetic Aperture Data Analyses Increasing Throughput and Energy Efficiency
Mark Barnell (Air Force Research Laboratory)*; Darrek Isereau (-); Courtney Raymond (-); Anthony Salmin (SRC); Daniel
Brown (SRC, Inc.)
New model quantization techniques have been researched to inform future information/data processing approaches. This
hardware evaluation and the associated data exploitation methods developed support systems that require high performance
computing (HPC) and machine learning (ML) models where significant throughput is desired. Specifically, these applications
include, but are not limited to, low cost compute that supports the detection and classification of objects in a scene, where it
is not feasible to spend valuable resources on HPC capabilities. Additional applications include data exploitation upstream,
near sensors, where size, weight and power (SWAP) are constrained. This research included the analyses of representative
data, synthetic aperture radar (SAR) imagery called the Moving and Stationary Target Acquisition and Recognition (MSTAR)
data. The NVIDIA Tesla, Xavier, and Titan compute architectures were used and analyzed as part of this research. These
graphics processing units (GPU) represent architectures that span various operating powers (~10 to over a few 100 Watts).
Additionally, the energy utilization per frame was determined and analyzed, e.g., the energy use of the Tesla went from 104.4
to 50.68 micro-Joules/frame when quantization was reduced to an 8-bit integer. The Tesla architecture also improved
processing throughput from 448 frames per second (FPS) to 1085 FPS when quantized to 8-bit integers. An important part of
this new research showed the compute systems retained SAR detection/classification performance of over 97% mean
average precision (mAP) on the MSTAR imagery data after quantization – thereby retaining its capability to detect and
classify objects.
Non-Volatile Memory Accelerated Posterior Estimation
Andrew E Wood (Boston University)*; Moshik Hershcovitch (IBM Research); Daniel G Waddington (IBM Research); Sarel
Cohen (Hasso Plattner Institute); Sang Chin (Boston University)
Bayesian inference allows machine learning models to express uncertainty. Current machine learning models use only a
single learnable parameter combination when making predictions, and as a result are highly overconfident when their
predictions are wrong. To use more learnable parameter combinations efficiently, these samples must be drawn from the
posterior distribution. Unfortunately computing the posterior directly is infeasible, so often researchers approximate it with a
well known distribution such as a Gaussian. In this extended abstract, we show that through the use of high capacity
persistent storage, models whose posterior distribution was too big to approximate are now feasible, leading to improved
predictions in downstream tasks.
A Survey: Handling Irregularities in Neural Network Acceleration with FPGAs
Tong Geng (Pacific Northwest National Laboratory)*; Chunshu Wu (Boston University); Cheng Tan (Pacific Northwest
National Laboratory); Chenhao Xie (Pacific Northwest National Laboratory); Anqi Guo (Boston University); Pouya Haghi
(Boston University); Sarah Yuan He (Boston College); Jiajia Li (Pacific Northwest National Laboratory
); Martin Herbordt
(Boston University); Ang Li (Pacific Northwest National Laboratory)
In the last decade, Artificial Intelligence (AI) through Deep Neural Networks (DNNs) has penetrated virtually every aspect of
science, technology, and business. Many types of DNNs have been and continue to be developed, including Convolutional
Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs). The overall problem for
all of these Neural Networks (NNs) is that their target applications generally pose stringent constraints on latency and
throughput, while also having strict accuracy requirements. There have been many previous efforts in creating hardware to
accelerate NNs. The problem designers face is that optimal NN models typically have significant irregularities, making them
hardware-unfriendly. In this paper, we first define the problems in NN acceleration by characterizing common irregularities in
NN processing into 4 types; then we summarize the existing works that handle the four types of irregularities efficiently using
hardware, especially FPGAs; finally, we provide a new vision of next-generation FPGA-based NN acceleration: that the
emerging heterogeneity in the next-generation FPGAs is the key to achieving higher performance.
Classification frameworks comparison on 3D point clouds
Francis P Medina (Yeshiva University)*; Randy Paffenroth (Worcester Polytechnic Institute)
We present a preliminary comparison study for various classification frameworks that includes several types of feature
engineering. The presentation is this paper is based on our work in [10]. We demonstrate that providing context by
augmenting each point in the data about its neighboring points can improve the performance of downstream learning
algorithms. We also experiment with several dimension reduction strategies, ranging from Principal Component Analysis
(PCA) to neural network based auto-encoders, and demonstrate how they affect classification performance in LiDAR point
clouds. For example, we observe that by combining feature engineering with a dimension reduction method such as PCA,
there is an improvement in the accuracy of the classification with respect to doing a straightforward classification with the raw
data. We test such classification strategies in 3D Lidar point clouds. Index Terms—Classification, 3D point clouds, dimension
reduction, PCA, auto-encoders, neural networks, machine learning.
3-3: AI / Machine Learning 3 Session (14:15-15:30)
Session Co-Chairs: Julie Mullen & Sanmukh Rao
Kuppannagari
Toward HDL Extensions for Rapid AI/ML Accelerator Generation
Ryan Kabrick (Tactical Computing Labs)*; David Donofrio (Tactical Computing Labs); John Leidel (Tactical Computing
Labs)
StoneCutter, a language construct and compiler embedded in the OpenSoC System Architect family of tools, is designed to
provide software architects the ability to rapidly prototype instruction set extensions and hardware accelerators. The
StoneCutter compilation flow ingests high level syntax and outputs optimized and pipelined Chisel HDL for further
compilation to platform-specific RTL. However, unlike other HDL approaches, StoneCutter is rooted in the notion that users
define syntactic blocks that map directly to individual instruction definitions as opposed to classic finite state machines.
When integrated with the adjacent System Architect design flow, StoneCutter provides a familiar, C-like language construct
by which to develop the implementation for individual, programmable instructions. The LLVM-based StoneCutter compiler
performs individual instruction and whole-ISA optimizations in order to generate a high performance, Chisel HDL
representation of the target design. Utilizing the existing Chisel tools, users can also generate C++ cycle accurate
simulation models as well as Verilog representations of the target design. As a result, the StoneCutter language and
associated tooling provides a very rapid, instruction set-centric design environment for rapid development and
experimentation. This work describes initial efforts to extend the StoneCutter infrastructure in order to encapsulate linear
algebraic constructs for direct compilation into optimized AI/ML instructions. This functionality provides users and architects
the ability to utilize the StoneCutter high-level language constructs to develop target and domain specific AI/ML instructions
using optimized linear algebraic constructs compiled directly to target-specific RTL. This enables users to create highly
optimized AI/ML hardware implementations with minimal effort in traditional hardware develop flows.
Filtered Tensor Construction and Decomposition for Drug Repositioning
Dimitri Leggas (Reservoir Labs, Inc.)*; Muthu M Baskaran (Reservoir Labs); James Ezick (Reservoir Labs); Brendan von
Hofe (Reservoir Labs, Inc.)
Drug repositioning (also called "drug repurposing") is a drug development strategy that saves time and money by finding
new uses for existing drugs. While a variety of computational approaches to drug repositioning exist, recent work has
shown that tensor decomposition, an unsupervised learning technique for finding latent structure in multidimensional data,
is a useful tool for drug repositioning. The known relationships between drugs, targets, and diseases can easily be encoded
as a tensor, and by learning a low-rank representation of this tensor, decompositions can complete missing entries and
therefore predict novel drug-disease associations. Multiple recent works, in the context of cancer and COVID-19 drug
discovery, have used joint tensor decompositions to suggest drug repositioning candidates. While these methods make
high-quality predictions, they rely on specialized decompositions formulated for specific problems. In this work, we use
ENSIGN, a suite of tensor decomposition tools, to show that CP tensor decompositions of a single tensor encoding drug-
target-disease associations are capable of predicting verifiable drug repositioning candidates. Because the tensors
generated by drug repositioning problems are sparse, we introduce a filtered tensor construction to limit the span of the
tensor without losing information needed to learn the relevant associations. We show that our method predicts verifiable
novel drug-disease associations in cancer and COVID-19 data. The simplicity of our approach makes it an attractive tool for
biomedical researchers looking for out-of-the-box solutions, and ENSIGN brings an added level of usability and scalability.
Machine Learning Fairness is Computationally Difficult and Algorithmically Unsatisfactorily Solved
Mike H.M. Teodorescu (Boston College)*; Xinyu Yao (Rice University)
The main purpose of the paper is to analyze the computational difficulties of selecting the suitable classification algorithms
that satisfy specific ethical criteria, when real data is used in training. Employing an imbalanced credit decision dataset
largely used for credit scoring and applying a set of algorithms and several fairness criteria, we show that many typical
classification algorithms do not satisfy in a reasonable manner more than one fairness criterion when considering more than
one protected attribute. This adds a layer of difficulty to the ones represented by the need of large databases and data- and
computationally-intensive decision-making systems as used in domains such as credit scoring and hiring. A novel analysis
of this study is directly relating ML/AI fairness criteria and computational complexity. We reframe the problem of complexity
by connecting it to the search of an ethically acceptable solution instead of just an accurate solution. The results suggest
the continued need for human input in fairness decisions, especially when deciding tradeoffs between fairness criteria.
Using Computation Effectively for Scalable Poisson Tensor Factorization: Comparing Methods Beyond
Computational Efficiency
Jeremy M Myers (College of William & Mary, Sandia National Laboratories)*; Daniel Dunlavy (Sandia National Laboratories)
Poisson Tensor Factorization (PTF) is an important data analysis method for analyzing patterns and relationships in
multiway count data. In this work, we consider several algorithms for computing a low-rank PTF of tensors with sparse
count data values via maximum likelihood estimation. Such an approach reduces to solving a nonlinear, non-convex
optimization problem, which can leverage considerable parallel computation due to the structure of the problem. However,
since the maximum likelihood estimator corresponds to the global minimizer of this optimization problem, it is important to
consider how effective methods are at both leveraging this inherent parallelism as well as computing a good approximation
to the global minimizer. In this work we present comparisons of multiple methods for PTF that illustrate the tradeoffs in
computational efficiency and accurately computing the maximum likelihood estimator. We present results using synthetic
and real-world data tensors to demonstrate some of the challenges when choosing a method for a given tensor.
3-4: Case Studies & Benchmarking 1
Session (15:45-17:00)
Session Co-Chairs: Chansup Byun & Kurt Keville
StressBench: A Configurable Full System Network and I/O Benchmark Framework
Dean G Chester (University of Warwick)*; Taylor Groves (NERSC); Simon Hammond (Sandia National Laboratories); Tim
Law (AWE); Steven Wright (University of York); Richard Smedley-Stevenson (AWE); Suhaib A. Fahmy (University of
Warwick); Gihan Mudalige (University of Warwick); Stephen Jarvis (University of Birmingham)
We present StressBench, a network benchmarking framework written for testing MPI operations and file I/O concurrently. It is
designed specifically to execute MPI communication and file access patterns that are representative of real-world scientific
applications. Existing tools consider either the worst case congestion with small abstract patterns or peak performance with
simplistic patterns. StressBench allows for a richer study of congestion by allowing orchestration of network load scenarios
that are representative of those typically seen at HPC centres, something that is difficult to achieve with existing tools. We
demonstrate the versatility of the framework from microbenchmarks through to finely controlled congested runs across a
cluster. Validation of the results using four proxy application communication schemes within StressBench against parent
applications shows a maximum difference of 15%. Using the I/O modeling capabilities of StressBench, we are able to quantify
the impact of file I/O on application traffic showing how it can be used in procurement and performance studies.
Performance Evaluation of Mixed-Precision Runge-Kutta Methods
Ben Burnett (University of Massachusetts Dartmouth)*; Sigal Gottlieb (UMass Dartmouth); Alfa Heryudono (UMass
Dartmouth); Zachary Grant (Oak Ridge National Labs)
Additive Runge-Kutta methods designed for preserving highly accurate solutions in mixed-precision computation were
proposed and analyzed in [8]. These specially designed methods use reduced precision for the implicit computationsand full
precision for the explicit computations. We develop a FORTRAN code to solve a nonlinear system of ordinary differential
equations using the mixed precision additive Runge-Kutta (MP-ARK) methods on IBM POWER9 and Intel x86\_64 chips. The
convergence, accuracy, runtime, and energy consumption of these methods is explored. We show that these MP-ARK
methods efficiently produce accurate solutions with significant reductions in runtime (and by extension energy consumption).
A More Portable HeFFTe: Implementing a Fallback Algorithm for Scalable Fourier Transforms
Daniel Sharp (Massachusetts Institute of Technology )*; Stanimire Tomov (University of Tennessee); Miroslav K Stoyanov
(Oak Ridge National Laboratory); Jack Dongarra (University of Tennessee)
The Highly Efficient Fast Fourier Transform for Exascale (heFFTe) numerical library is a C++ implementation of distributed
multidimensional FFTs targeting heterogeneous and scalable systems. To date, the library has relied on users to provide at
least one installation from a selection of well-known libraries for the single node/MPI-rank one-dimensional FFT calculations
that heFFTe is built on. In this paper, we describe the development of a CPU-based backend to heFFTe as a reference, or
"stock'', implementation. This allows the user to install and run heFFTe without any external dependencies that may include
restrictive licensing or mandate specific hardware. Furthermore, this stock backend was implemented to take advantage of
SIMD capabilities on the modern CPU, and includes both a custom vectorized complex data-type and a run-time generated
call-graph for selecting which specific FFT algorithm to call. The performance of this backend greatly increases when
vectorized instructions are available and, when vectorized, it provides reasonable scalability in both performance and
accuracy compared to an alternative CPU-based FFT backend. In particular, we illustrate a highly-performant O( N log N )
code that is about 10 times faster compared to non-vectorized code for the complex arithmetic, and a scalability that matches
heFFTe's scalability when used with vendor or other highly-optimized 1D FFT backends. The same technology can be used to
derive other Fourier-related transformations that may be even not available in vendor libraries, e.g., the discrete sine (DST)
or cosine (DCT) transforms, as well as their extension to multiple dimensions and O( N log N ) timing.
A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential
Equation Solutions
Yingbo Ma (Julia Computing); Vaibhav Dixit (Julia Computing); Mike J Innes (Julia Computing); Xingjian Guo (New York
University); Christopher V Rackauckas (Massachusetts Institute of Technology)*
Derivatives of differential equation solutions are commonly for parameter estimation, fitting neural differential equations, and
as model diagnostics. However, with a litany of choices and a Cartesian product of potential methods, it can be difficult for
practitioners to understand which method is likely to be the most effective on their particular application. In this manuscript we
investigate the performance characteristics of Discrete Local Sensitivity Analysis implemented via Automatic Differentiation
(DSAAD) against continuous adjoint sensitivity analysis. Non-stiff and stiff biological and pharmacometric models, including a
PDE discretization, are used to quantify the performance of sensitivity analysis methods. Our benchmarks show that on small
stiff and non-stiff systems of ODEs (approximately $<100$ parameters+ODEs), forward-mode DSAAD is more efficient than
both reverse-mode and continuous forward/adjoint sensitivity analysis. The scalability of continuous adjoint methods is shown
to be more efficient than discrete adjoints and forward methods after crossing this size range. These comparative studies
demonstrate a trade-off between memory usage and performance in the continuous adjoint methods that should be
considered when choosing the technique, while numerically unstable backsolve techniques from the machine learning
literature are demonstrated as unsuitable for most scientific models. The performance of adjoint methods is shown to be
heavily tied to the reverse-mode AD method used for the vector-Jacobian product calculations, with tape-based AD methods
shown to be 2 orders of magnitude slower on nonlinear partial differential equations than static AD techniques. In addition,
these results demonstrate the out-of-the-box applicability of DSAAD to differential-algebraic equations, delay differential
equations, and hybrid differential equation systems where the event timing and effects are dependent on model parameters,
showcasing an ease of implementation advantage for DSAAD approaches. Together, these benchmarks provide a guide to
help practitioners to quickly identify the best mixture of continuous sensitivities and automatic differentiation for their
applications.
3-S1: AI Challenges Special (17:30-19:30)
Organizer(s): Vijay Gadepally
The MIT Supercloud Dataset
Siddharth Samsi (MIT Lincoln Laboratory)*; Matthew Weiss (MIT Lincoln Laboratory); Vijay Gadepally (MIT Lincoln
Laboratory)
Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute
workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to
changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches
to optimized resource usage, allocations and deployment of new AI frame- works, and capabilities such as Jupyter
notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand
cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in
resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we
introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the analysis of large scale
HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which
include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data. This paper
discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge
problems being developed using this data. Datasets and future challenge announcements will be available via
https://dcc.mit.edu.
Maneuver Identification Challenge
Kaira M Samuel (MIT)*; Jeremy Kepner (MIT Lincoln Laboratory)
AI algorithms that identify maneuvers from trajectory data could play an important role in improving flight safety and pilot
training. AI challenges allow diverse teams to work together to solve hard problems and are an effective tool for
developing AI solutions. AI challenges are also a key driver of AI computational requirements. The Maneuver
Identification Challenge hosted at maneuver-id.mit.edu provides thousands of trajectories collected from pilots
practicing in flight simulators, descriptions of maneuvers, and examples of these maneuvers performed by experienced
pilots. Each trajectory consists of positions, velocities, and aircraft orientations normalized to a common coordinate
system. Construction of the data set required significant data architecture to transform flight simulator logs into AI ready
data, which included using a supercomputer for deduplication and data conditioning. There are three proposed
challenges. The first challenge is separating physically plausible (good) trajectories from unfeasible (bad) trajectories.
Human labeled good and bad trajectories are provided to aid in this task. Subsequent challenges are to label
trajectories with their intended maneuvers and to assess the quality of those maneuvers.
Invited Talk: Robust Neural Differential Models for Navigation and Beyond
Maj Andrew Bowne (USAF Al Accelerator)
Invited Talk: Objective Performance Prediction & Optimization Using Physiological and Cognitive Metrics
Capt Kyle “Gouge” McAlpin (USAF Al Accelerator)
Invited Talk: Large-Scale simulation of mechanical instabilities in soft materials
Prof. Raul Radovitzky (MIT AeroAstro)
3-S2: OpenSuperComputing BoF Special (17:30-19:30)
Organizer(s): Kurt Keville
Invited Talk: Introduction to the Open-Source FPGA Foundation
Pierre-Emmanuel Gaillardon (University of Utah)
Invited Talk: An HPEC Retrospective
Guarav Mitra (Texas Instruments)
2021 Abstract Book