High-performance Dynamic Programming on FPGAs with OpenCLSean Settle, Altera Corporation Abstract: Field programmable gate arrays (FPGAs) provide reconfigurable computing fabrics that can be tailored to a wide range of time and power sensitive applications. Traditionally, programming FPGAs required an expertise in complex hardware description languages (HDLs) or proprietary high-level synthesis (HLS) tools. Recently, Altera released the world’s first OpenCL conformant SDK for FPGAs. OpenCL is an open, royalty-free standard for cross-platform, parallel programming of heterogeneous systems that together with Altera extensions significantly reduces FPGA development time and costs in high-performance computing environments. In this paper, we demonstrate dynamic programming on FPGAs with OpenCL by implementing the Smith Waterman algorithm for DNA, RNA, or protein sequencing in bioinformatics in a manner readily familiar to both hardware and software developers. Results show that Altera FPGAs significantly outperform leading CPU and GPU parallel implementations by over an order of magnitude in both absolute performance and relative power efficiency.High Throughput Energy Efficient Parallel FFT Architecture on FPGAsRen Chen, USC; Neungsoo Park,, Konkuk University; Viktor Prasanna, University of Southern California Abstract: To process streaming data, throughput is a key performance metric for FFT designs. However, high throughput FFT architectures consume large amount of power due to complex routing and memory access. In this paper, we propose a high throughput energy efficient multi-FFT architecture based on Radix-x Cooley-Turkey algorithm. In the proposed architecture, multiple time-multiplexed pipeline FFT processors are used to achieve the same throughput of a fully spatial parallel FFT architecture. This design avoids complex routing, thus reducing the interconnection power. Furthermore, a dynamic memory activation scheme is developed to reduce the memory power. Post place-and-route results show that, for N-point FFT (64<=N<=4096), our designs improve the energy efficiency (defined as GOPS/Joule) by 17% to 26%, compared with a state-of-the-art design. For various throughput requirements, the proposed design achieves 50-63 GOPS/Joule, i.e., up to 78% of the peak energy efficiency of FFT designs on FPGAs.Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAsKiran Matam, USC; Hoang Le, USC; Viktor Prasanna, University of Southern California Abstract: Energy efficiency is emerging as one of the key metrics in scientific computing. In this work, we evaluate the energy efficiency of the floating point matrix multiplication on the state-of-the art FPGAs. First, we implement a design parameterized with the problem size and the type of storage memory. Next, to understand the efficiency of our implementations, we propose and implement a minimal architecture to measure an upper bound on the energy efficiency of any matrix multiplication implementation. Lastly, we model and estimate the energy efficiency of the large-scale matrix multiplication using external DRAM. Our implementations can achieve energy efficiency up to 7.07 and 2.28 GFlops / Joule at worst-case signal rate for single and double precision, respectively. When compared to the measured upper bound on energy efficiency, our implementations can sustain up to 72% and 84% for single and double precision, respectively. For large-scale matrix multiplication we estimate an energy efficiency of 5.21 and 1.60 GFlops / Joule for single and double precision, respectively.Image Search SystemPeter Cho, MIT Lincoln Lab; Michael Yee, MIT Lincoln Lab Abstract: Digital images are currently shot and stored in vast numbers. Billions of photos and video clips may now be accessed via public internet and private offline archives. Yet navigating through image repositories generally requires clicking through seas of thumbnails. Aside from occasional human-tagged keywords, little connection typically exists between archived images to help users find stills or frames of interest. New search capabilities are consequently needed to mine huge imagery volumes. In this talk, we present a prototype system which enables user exploration of global structure as well as individual picture drill-down for O(104) images. Our search engine builds upon advances made over the past decade in SIFT matching, image feature clustering and large data handling. Using our software tools, image analysts can investigate disparate data sets such as 30K photos shot semi-cooperatively around MIT, 6K Grand Canyon jpeg files downloaded from Flickr, and 2K video frames extracted from a news broadcast. All these data sets were processed on Lincoln Laboratory's grid cluster. Our system's netcentric design enables multi-user collaboration on different sets of archived pictures. The search engine is based upon a postgres database which stores topological relationships, space-time metadata and derived image attributes. Its front-end includes a web browser thin client and graph viewer thick client whose states remain synchronized. The browser displays a currently selected image as well as sibling thumbnails. The graph viewer represents digital pictures as nodes whose coloring indicates SIFT feature overlap. Hierarchical node clustering generates family pyramid structures for a priori unorganized sets of input images. Thin and thick client perusing of higher pyramid levels provides a practical means for gaining comprehensive insight into large data collections. Users can assign captions and annotate image regions of interest with our system. Moreover, various known or automatically derived attributes are searchable. For example, air/ground views and EO/IR sensing modalities are readily identified via graph node coloring. Color histograms for individual pictures and dominant color content for entire collections may similarly be retrieved from the database. Human face detection results based upon work by Kalal, Matas and Mikolajczyk (2008) are also exploitable in the thin and thick clients. We close by discussing relations automatically uncovered among YouTube video clips of the April 2013 Boston bombings. The results compellingly demonstrate our search system's potential for exploiting massive, social media imagery archives.Optimizing Performance of HPC Storage Systems: Optimzing performance for reads and writesTorben Kling Petersen, Xyratex; John Fragalla, Xyratex Abstract: The performance of HPC storage systems depend upon a variety of factors. The results of using any of the standard benchmark suites for storage depends not only on the storage architecture, but also on the type of disk drives, type and design of the interconnect, and the type and number of clients. In addition, each of the benchmark suites have a number of different parameters and test methodologies that require careful analysis to determine the optimal settings for a successful benchmark run. To reliably benchmark a storage solution, every stage of the solution needs to be analyzed including block and file performance of the RAID, network and client throughput to the entire filesystem and meta data servers. For a filesystem to perform at peak performance, there needs to be a balance between the actual performance of the disk drives, the SAS chain supporting the RAID sets, the RAID code used (whether hardware RAID controllers or software MD-RAID), the interconnect and finally the clients. This paper describes these issues with respect to the Lustre filesystem. The dependence of benchmark results with respect to various parameters is shown. Using a single storage enclosure consisting of 8 RAID sets (8+2 drives each) it is possible achieve both read and write performances in excess of 6 GB/s which translates to more than 36 GB/s per rack of measured client based throughput. This paper will focus on using Linux performance tool, obdfilter-survey, and IOR to measure different levels of the filesystem performance using Lustre.A Clustered Manycore Processor Architecture for Embedded and Accelerated ApplicationsBenoit Dupont de Dinechin, Kalray Abstract: The Kalray MPPA-256 processor integrates 256 user cores and 32 system cores on a chip with 28nm CMOS technology. Each core implements a 32-bit 5-issue VLIW architecture. These cores are distributed across 16 compute clusters of 16+1 cores, and 4 quad-core I/O subsystems. Each compute cluster and I/O subsystem owns a private address space, while communication and synchronization between them is ensured by data and control Networks-On-Chip (NoC). The MPPA-256 processor is also fitted with a variety of I/O controllers, in particular DDR, PCI, Ethernet, Interlaken and GPIO. We demonstrate that the MPPA-256 processor clustered manycore architecture is effective on two different classes of applications: embedded computing, with the implementation of a professional H.264 video encoder that runs in real-time at low power; and high-performance computing, with the acceleration of a financial option pricing application. In the first case, a cyclostatic dataflow programming environment is exploited, that automates application distribution over the execution resources. In the second case, a explicit parallel programming model based on POSIX processes, threads, and IPC is used.