© Lorem Ipsum Dolor 2010
2013 IEEE High Performance
Extreme Computing Conference
(HPEC ‘13)
Seventeenth Annual HPEC Conference
10 - 12 September 2013
Westin Hotel, Waltham, MA USA
Robust Graph Traversal: Resiliency Techniques for Data Intensive Supercomputing
Saurabh Hukerikar, Information Sciences Institute; Pedro Diniz, University of Southern California; Robert Lucas, University of Southern
California
Abstract: Emerging large-scale, data intensive applications that use the graph abstraction to represent problems in a broad spectrum of
scientific and analytics applications have radically different features from floating point intensive scientific applications. These complex
graph applications, besides having large working datasets, exhibit very low spatial and temporal locality which makes designing
algorithmic fault tolerance for these quite challenging. They will run on future exascale class High Performance Computing (HPC)
systems that will contain massive number of components, and will be built from devices far less reliable than those used today. In this
paper we propose software based approaches that enable robustness for these data intensive, graph-based applications by managing
the resiliency in terms of the data flow progress and validation of pointer computations. Our experimental results show that such a
simple approach incurs fairly low execution time overheads while allowing these computations to survive a large number of faults that
would otherwise always result in the termination of the computation.
Task Scheduling for Reconfigurable Systems in Dynamic Fault-Rate Environments
Adam Jacobs, University of Florida; Alan George, University of Florida; Nicholas Wulf, University of Florida
Abstract: Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the capability to provide space applications with the
necessary performance, energy-efficiency, and adaptability to meet next-generation mission requirements. However, mitigating an
FPGA’s susceptibility to radiation-induced faults is challenging. Triple-modular redundancy (TMR) techniques are traditionally used to
mitigate radiation effects, but TMR incurs substantial overheads such as increased area and power requirements. Using partial
reconfiguration (PR), FPGAs could be used to dynamically adjust the fault-tolerance scheme as the radiation environment changes over
time. In order to manage these dynamic adjustments, a fault-tolerant task scheduler is necessary. We improve scheduling in the
presence of time-varying fault rates by developing a fault-tolerant scheduling heuristic. Our heuristic combines task execution time and
system fault rate to determine the optimal fault-tolerance mode for the task. The heuristic is evaluated using software simulations of a
system in periodic and burst fault environments. Results show our scheduling technique is capable of reducing the task rejection ratio in
periodic environments by 94% and in burst environments by 48% over static TMR, and the adaptive heuristic approaches the
performance of an optimal predetermined heuristic. Integration of our fault-tolerant scheduling heuristic with other pre-existing PR
architectures can enable their use in dynamic fault environments.
Adaptive Routing in Hexagonal Torus Interconnection Network
Arash Shamaei, Oregon State University; Bella Bose, ; Mary Flahive
Abstract: The hexagonal torus network is a degree six toroidal network with rich topological properties. It was used in the design of
HARTS machine at the University of Michigan, and more recently it has been proposed for cellular networks. The low diameter and less
average hop distance of this network make it advantageous over other 2D toroidal network such as meshes and tori. This paper
proposes a fully adaptive and deadlock-free routing algorithm for hexagonal torus networks based on virtual channel partitioning and
the algorithm requires three virtual channels per physical channel to remain deadlock-free. Simulation results show that this algorithm is
superior to the fully adaptive routing algorithm for 2D meshes and 2D tori of the same size.
A Novel Fast Modular Multiplier Architecture for 8,192-bit RSA Cryposystem
Wei Wang, Worcester Polytechnic Institut; Xinming Huang, Worcester Polytechnic Institute
Abstract: Modular multiplication is the most crucial component in RSA cryptosystem. In this paper, we present a new modular
multiplication architecture using the Strassen multiplication algorithm and Montgomery reduction. The architecture is different from the
interleaved version of Montgomery multiplication traditionally used in RSA design. By selecting different bases of 16 or 24 bits, it could
perform 8,192-bit or 12,288-bit modular multiplication. The design was synthesized on the Altera’s Stratix-V FPGA using Quartus II. It
performs one modular multiplication in 2,030 cycles. When operating at 209 MHz, the execution time for an 8K- or 12K-bit modular
multiplication is about 9.7 us.
Efficient Gbit/s Data Transceivers Designed for Verification and SoC Integration
William Ellersick, Analog Circuit Works
Abstract: High performance computing systems generate extreme amounts of data that must be reliably transported between chips and
modules. While SERDES research, including that of the author, continue to demonstrate extraordinary data rates per channel, practical
interfaces do not approach these limits, but find the sweet spot where circuit complexity, power and area start to increase faster than
the data rate. Advances in packaging allow massively parallel optical or electrical links where power and silicon area per bit can be more
important than the data rate per link. This presentation describes a transceiver design that has evolved through implementation and
testing of a number of massively parallel communication and imaging systems, with a focus on optimizing system cost, robustness, and
time to market. Based on circuits developed in 2001 for a 16 Gbps link that consumed 69mW/Gbps due to the complex equalization
needed, performance has scaled with process technology to links with simpler equalization that only require 2mW/Gbps and fit under a
pair of pads with room for ESD protection. Synchronization circuits are presented that simplify the chip-level design. Retiming circuits
cross clock domains from digital system clocks to transceiver clocks without the need for expensive FIFOs. Quarter or eighth bit-rate
clocks are distributed to minimize power and obviate the need for a PLL for each link. Duty cycle correcting block buffers allow clocks to
be driven across long chips without degradation. First silicon success is paramount in large systems-on-chips where digital design effort,
power and area often dwarf that of the data transceivers. Toward this end, transmit equalization is used, providing eye openings that
can be verified with straightforward simulations and measurements, at the expense of 1-2dB larger signal swings to compensate for
reduced SNR compared to complex and power-hungry decision feedback equalization (DFE). Real value behavioral models of the data
transceivers enable event-driven simulation in digital simulations with large digital blocks to ensure proper control and port
configuration. The portable syntax used in the behavioral models is presented, which allows the models to be verified in analog (SPICE-
based) simulators, ensuring accurate correspondence between the models and the transistor-level link designs. Finally, guidelines for
chip and board layout are discussed, which, along with design reviews by transceiver designers, help maintain healthy design margins.
The circuits described allow highly complex systems to be designed with robust and efficient data transceivers. The verification and
integration methodology allows the focus to remain on digital processing and system design.
Exploiting Free Silicon for Energy-Efficient Computing Directly in NAND Flash-based Solid-State Storage Systems
Peng Li, University of Minnesota; Kevin Gomez, Seagate Technology; David Lilja, University of Minnesota
Abstract: Energy consumption is a fundamental issue in today's data centers as data continue growing dramatically. How to process
these data in an energy-efficient way becomes more and more important. Prior work had proposed several methods to build an energy-
efficient system. The basic idea is to attack the memory wall issue (i.e., the performance gap between CPUs and main memories) by
moving computing closer to the data. However, these methods are not widely adopted due to high cost and limited performance
improvements. In this paper, we propose the storage processing unit (SPU), which adds computing power into NAND flash memories at
standard solid-state drive (SSD) cost. By pre-processing the data using the SPU, the data that needs to be transferred to host CPUs for
further processing are significantly reduced. Simulation results show that the SPU-based system is at least 100 times lower energy per
operation than a conventional system for data-intensive applications.
Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware
Qiuling Zhu*, Carnegie Mellon University; Tobias Graf, ; H. Ekin Sumbul, ; Larry Pileggi, ; Franz Franchetti
Abstract: This paper introduces a 3D-stacked logic-in-memory (LiM) system to accelerate the processing of sparse matrix data that is
held in a 3D DRAM system. We build a customized content addressable memory (CAM) hardware structure to exploit the inherent
sparse data patterns and model the LiM based hardware accelerator layers that are stacked in between DRAM dies for the efficient
sparse matrix operations. Through silicon vias (TSVs) are used to provide the required high inter-layer bandwidth. Furthermore, we
adapt the algorithm and data structure to fully leverage the underlying hardware capabilities, and develop the necessary design
framework to facilitate the design space evalu- ation and LiM hardware synthesis. Our simulation demonstrates more than two orders of
magnitude of performance and energy efficiency improvements compared with the traditional multi-threaded implementation on
modern processors.
Instruction Set Extensions for Photonic Synchronous Coalesced Accesses
Paul Keltcher, MIT Lincoln Labs
Abstract: Microprocessors have evolved over the last fortyplus years from purely sequential single operation machines, to pipelined
super-scalar, to threaded and SIMD, and finally to multi-core and massive multi-core/thread machines. Despite these advances, the
conceptual model programmers use to program them is still that of a single threaded register file bound math unit that can only be
loosely synchronized with other such processors. This lack of explicit synchrony, caused by limitations of metal interconnect, limits
parallel efficiency. Recent advances in silicon photonic-enabled architectures [1, 5, 7] promise to greatly enable high synchrony over long
distances (centimeters or more). In this paper, it is shown that global synchrony changes the way computers can be programmed by
introducing a new class of ISA level instruction: the globally-synchronous load-store. In the context of multiple load-store machines, the
globally synchronous load-store architecture allows the programmer to think about a collection of independent load-store machines as a
single load-store machine. This operation is described, and its ISA implications explored in the context of the distributed matrix
transpose, which exhibits a high degree of data non-locality, and is difficult to efficiently parallelize on modern architectures.
SIMD Acceleration of Modular Arithmetic on Contemporary Embedded Platforms
Krishna Pabbuleti, Virginia Tech; Deepak Mane, Virginia Tech; Avinash Desai, Virginia Tech; Curt Albert, Virginia Tech; Patrick Schaumont,
Virginia Tech
Abstract: Doublings on elliptic curves over finite fields. The execution time of ECC is completely dominated by modular multiplications in
these fields. In this contribution, we propose vector processing techniques to accelerate modular multiplications in prime fields. We
demonstrate implementations for the Venom (NEON) coprocessor in Qualcomm's Scorpion (ARM) CPU, as well as for the SSE2
instruction-set extensions in Intel's Atom CPU. Our implementations, which use NIST-standard prime-field curves, run more than two
times faster than the OpenSSL versions of the same ECC operations on the same processor.