Robust Graph Traversal: Resiliency Techniques for Data Intensive SupercomputingSaurabh Hukerikar, Information Sciences Institute; Pedro Diniz, University of Southern California; Robert Lucas, University of Southern California Abstract: Emerging large-scale, data intensive applications that use the graph abstraction to represent problems in a broad spectrum of scientific and analytics applications have radically different features from floating point intensive scientific applications. These complex graph applications, besides having large working datasets, exhibit very low spatial and temporal locality which makes designing algorithmic fault tolerance for these quite challenging. They will run on future exascale class High Performance Computing (HPC) systems that will contain massive number of components, and will be built from devices far less reliable than those used today. In this paper we propose software based approaches that enable robustness for these data intensive, graph-based applications by managing the resiliency in terms of the data flow progress and validation of pointer computations. Our experimental results show that such a simple approach incurs fairly low execution time overheads while allowing these computations to survive a large number of faults that would otherwise always result in the termination of the computation.Task Scheduling for Reconfigurable Systems in Dynamic Fault-Rate EnvironmentsAdam Jacobs, University of Florida; Alan George, University of Florida; Nicholas Wulf, University of Florida Abstract: Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the capability to provide space applications with the necessary performance, energy-efficiency, and adaptability to meet next-generation mission requirements. However, mitigating an FPGA’s susceptibility to radiation-induced faults is challenging. Triple-modular redundancy (TMR) techniques are traditionally used to mitigate radiation effects, but TMR incurs substantial overheads such as increased area and power requirements. Using partial reconfiguration (PR), FPGAs could be used to dynamically adjust the fault-tolerance scheme as the radiation environment changes over time. In order to manage these dynamic adjustments, a fault-tolerant task scheduler is necessary. We improve scheduling in the presence of time-varying fault rates by developing a fault-tolerant scheduling heuristic. Our heuristic combines task execution time and system fault rate to determine the optimal fault-tolerance mode for the task. The heuristic is evaluated using software simulations of a system in periodic and burst fault environments. Results show our scheduling technique is capable of reducing the task rejection ratio in periodic environments by 94% and in burst environments by 48% over static TMR, and the adaptive heuristic approaches the performance of an optimal predetermined heuristic. Integration of our fault-tolerant scheduling heuristic with other pre-existing PR architectures can enable their use in dynamic fault environments.Adaptive Routing in Hexagonal Torus Interconnection NetworkArash Shamaei, Oregon State University; Bella Bose, ; Mary Flahive Abstract: The hexagonal torus network is a degree six toroidal network with rich topological properties. It was used in the design of HARTS machine at the University of Michigan, and more recently it has been proposed for cellular networks. The low diameter and less average hop distance of this network make it advantageous over other 2D toroidal network such as meshes and tori. This paper proposes a fully adaptive and deadlock-free routing algorithm for hexagonal torus networks based on virtual channel partitioning and the algorithm requires three virtual channels per physical channel to remain deadlock-free. Simulation results show that this algorithm is superior to the fully adaptive routing algorithm for 2D meshes and 2D tori of the same size.A Novel Fast Modular Multiplier Architecture for 8,192-bit RSA CryposystemWei Wang, Worcester Polytechnic Institut; Xinming Huang, Worcester Polytechnic Institute Abstract: Modular multiplication is the most crucial component in RSA cryptosystem. In this paper, we present a new modular multiplication architecture using the Strassen multiplication algorithm and Montgomery reduction. The architecture is different from the interleaved version of Montgomery multiplication traditionally used in RSA design. By selecting different bases of 16 or 24 bits, it could perform 8,192-bit or 12,288-bit modular multiplication. The design was synthesized on the Altera’s Stratix-V FPGA using Quartus II. It performs one modular multiplication in 2,030 cycles. When operating at 209 MHz, the execution time for an 8K- or 12K-bit modular multiplication is about 9.7 us.Efficient Gbit/s Data Transceivers Designed for Verification and SoC IntegrationWilliam Ellersick, Analog Circuit Works Abstract: High performance computing systems generate extreme amounts of data that must be reliably transported between chips and modules. While SERDES research, including that of the author, continue to demonstrate extraordinary data rates per channel, practical interfaces do not approach these limits, but find the sweet spot where circuit complexity, power and area start to increase faster than the data rate. Advances in packaging allow massively parallel optical or electrical links where power and silicon area per bit can be more important than the data rate per link. This presentation describes a transceiver design that has evolved through implementation and testing of a number of massively parallel communication and imaging systems, with a focus on optimizing system cost, robustness, and time to market. Based on circuits developed in 2001 for a 16 Gbps link that consumed 69mW/Gbps due to the complex equalization needed, performance has scaled with process technology to links with simpler equalization that only require 2mW/Gbps and fit under a pair of pads with room for ESD protection. Synchronization circuits are presented that simplify the chip-level design. Retiming circuits cross clock domains from digital system clocks to transceiver clocks without the need for expensive FIFOs. Quarter or eighth bit-rate clocks are distributed to minimize power and obviate the need for a PLL for each link. Duty cycle correcting block buffers allow clocks to be driven across long chips without degradation. First silicon success is paramount in large systems-on-chips where digital design effort, power and area often dwarf that of the data transceivers. Toward this end, transmit equalization is used, providing eye openings that can be verified with straightforward simulations and measurements, at the expense of 1-2dB larger signal swings to compensate for reduced SNR compared to complex and power-hungry decision feedback equalization (DFE). Real value behavioral models of the data transceivers enable event-driven simulation in digital simulations with large digital blocks to ensure proper control and port configuration. The portable syntax used in the behavioral models is presented, which allows the models to be verified in analog (SPICE-based) simulators, ensuring accurate correspondence between the models and the transistor-level link designs. Finally, guidelines for chip and board layout are discussed, which, along with design reviews by transceiver designers, help maintain healthy design margins. The circuits described allow highly complex systems to be designed with robust and efficient data transceivers. The verification and integration methodology allows the focus to remain on digital processing and system design.Exploiting Free Silicon for Energy-Efficient Computing Directly in NAND Flash-based Solid-State Storage SystemsPeng Li, University of Minnesota; Kevin Gomez, Seagate Technology; David Lilja, University of Minnesota Abstract: Energy consumption is a fundamental issue in today's data centers as data continue growing dramatically. How to process these data in an energy-efficient way becomes more and more important. Prior work had proposed several methods to build an energy-efficient system. The basic idea is to attack the memory wall issue (i.e., the performance gap between CPUs and main memories) by moving computing closer to the data. However, these methods are not widely adopted due to high cost and limited performance improvements. In this paper, we propose the storage processing unit (SPU), which adds computing power into NAND flash memories at standard solid-state drive (SSD) cost. By pre-processing the data using the SPU, the data that needs to be transferred to host CPUs for further processing are significantly reduced. Simulation results show that the SPU-based system is at least 100 times lower energy per operation than a conventional system for data-intensive applications.Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory HardwareQiuling Zhu*, Carnegie Mellon University; Tobias Graf, ; H. Ekin Sumbul, ; Larry Pileggi, ; Franz FranchettiAbstract: This paper introduces a 3D-stacked logic-in-memory (LiM) system to accelerate the processing of sparse matrix data that is held in a 3D DRAM system. We build a customized content addressable memory (CAM) hardware structure to exploit the inherent sparse data patterns and model the LiM based hardware accelerator layers that are stacked in between DRAM dies for the efficient sparse matrix operations. Through silicon vias (TSVs) are used to provide the required high inter-layer bandwidth. Furthermore, we adapt the algorithm and data structure to fully leverage the underlying hardware capabilities, and develop the necessary design framework to facilitate the design space evalu- ation and LiM hardware synthesis. Our simulation demonstrates more than two orders of magnitude of performance and energy efficiency improvements compared with the traditional multi-threaded implementation on modern processors.Instruction Set Extensions for Photonic Synchronous Coalesced AccessesPaul Keltcher, MIT Lincoln Labs Abstract: Microprocessors have evolved over the last fortyplus years from purely sequential single operation machines, to pipelined super-scalar, to threaded and SIMD, and finally to multi-core and massive multi-core/thread machines. Despite these advances, the conceptual model programmers use to program them is still that of a single threaded register file bound math unit that can only be loosely synchronized with other such processors. This lack of explicit synchrony, caused by limitations of metal interconnect, limits parallel efficiency. Recent advances in silicon photonic-enabled architectures [1, 5, 7] promise to greatly enable high synchrony over long distances (centimeters or more). In this paper, it is shown that global synchrony changes the way computers can be programmed by introducing a new class of ISA level instruction: the globally-synchronous load-store. In the context of multiple load-store machines, the globally synchronous load-store architecture allows the programmer to think about a collection of independent load-store machines as asingle load-store machine. This operation is described, and its ISA implications explored in the context of the distributed matrix transpose, which exhibits a high degree of data non-locality, and is difficult to efficiently parallelize on modern architectures.SIMD Acceleration of Modular Arithmetic on Contemporary Embedded PlatformsKrishna Pabbuleti, Virginia Tech; Deepak Mane, Virginia Tech; Avinash Desai, Virginia Tech; Curt Albert, Virginia Tech; Patrick Schaumont, Virginia Tech Abstract: Doublings on elliptic curves over finite fields. The execution time of ECC is completely dominated by modular multiplications in these fields. In this contribution, we propose vector processing techniques to accelerate modular multiplications in prime fields. We demonstrate implementations for the Venom (NEON) coprocessor in Qualcomm's Scorpion (ARM) CPU, as well as for the SSE2 instruction-set extensions in Intel's Atom CPU. Our implementations, which use NIST-standard prime-field curves, run more than two times faster than the OpenSSL versions of the same ECC operations on the same processor.