GPU Accelerated Blood Flow Computation using the Lattice Boltzmann MethodCosmin Nita, Transilvania University of Brasov; Lucian Mihai Itu, Transilvania University of Bra; Constantin Suciu, Siemens Corporate Technology Abstract: We propose a numerical implementation based on a Graphics Processing Unit (GPU) for the acceleration of the execution time of the Lattice Boltzmann Method (LBM). The study focuses on the application of the LBM for patient-specific blood flow computations, and hence, to obtain higher accuracy, double precision computations are employed. The LBM specific operations are grouped into two kernels, whereas only one of them uses information from neighboring nodes. Since for blood flow computations regularly only 1/5 or less of the nodes represent fluid nodes, an indirect addressing scheme is used to reduce the memory requirements. Three GPU cards are evaluated with different 3D benchmark applications (Poisseuille flow, lid-driven cavity flow and flow in an elbow shaped domain) and the best performing card is used to compute blood flow in a patient-specific aorta geometry with coarctation. The speed-up over a multi-threaded CPU code is of 19.42x. The comparison with a basic GPU based LBM implementation demonstrates the importance of the optimization activities.Enhancing a Cross-Platform Hyperspectral Image Analysis Library for Heterogeneous CUDA SupportBrian Landrón, University of Puerto Rico; Jonathan Torres, University of PUerto RIco Mayagüez Campus; Nayda Santiago, University of Puerto RIco Mayagüez Campus Abstract: Modern hyperspectral image (HSI) analysis makes use of the high performance computing power provided by General Purpose Graphical Processing Units (GPGPUs) due to the large volume of hyperspectral sensor data involved. HSI analysis applications range in purpose from military reconnaissance devices to devices that aid medical diagnostics. These applications require a substantial amount of learning and development to be conducted in order to correctly implement GPGPU based algorithms. To aid rapid prototyping of HSI analysis platforms that make use of GPGPUs an open source software library, libdect, supported by the NVIDIA Compute Unified Device Architecture (CUDA), is being improved. The libdect library includes an implementation of the Reed-Xiaoli (RX) and Matched Filter (MF) target detection algorithms and its infrastructure is supported by the CMake build system, incorporating cross-platform compatibility into libdect. Testing of coding guidelines was integrated into libdect’s CMake infrastructure using KWStyle to help maintain code quality, which can positively impact the software's life cycle. A build log is also supported with CTest and CDash to facilitate development across heterogeneous platforms. This software library can only handle small data sets that allow the entire HSI to be stored in a CUDA capable device's global memory as a single workload. In an effort to solve this issue the libdect development team is designing algorithms that partition the entire HSI analysis process into a set of smaller tasks that a CUDA capable device can handle. The solution must consider choosing the correct amount of grids, blocks, and threads that can efficiently handle each detector's workloads, as well as determining which workloads are large enough to benefit from parallel execution. The libdect library is currently in its prototype phase and will emerge as an open source, cross-platform, HSI analysis software library with heterogeneous CUDA support. It will provide an encapsulation of detection algorithm implementations which future developers need not be concerned about while balancing performance.Acceleration of Particle Physics using FPGA-based Vector Processors with Narrow Custom Vector ALU OperationsAaron Severance, VectorBlox Computing, Inc.; Joe Edwards, VectorBlox Computing, Inc.; Guy Lemieux, VectorBlox Computing, Inc. Abstract: Embedded systems with fast, low-latency data processing needs are often built around FPGAs. Since most FPGA-based processors are slow, the computation is often implemented as custom logic which is inflexible and difficult to develop, debug and maintain. In contrast, an FPGA-based soft vector processor can achieve high processing rates using wide, SIMD-style parallelism. However, a key drawback of this SIMD parallelism is the high cost of replicating complex operators like square-root, divide, and floating-point across all parallel lanes. Since these operators are infrequently used, they can be implemented as narrow, custom ALU operators to achieve good performance when required, yet keep area overhead low. This paper demonstrates a novel way to connect narrow custom vector operators with minimal overhead, and uses a particle physics algorithm as a case study to demonstrate how to judiciously add custom vector ALU operators to achieve desired performance levels with limited resource usage.Efficient Parallel Runtime Bounds Checking with the TAU Performance SystemJohn Linford, ParaTools; Sameer Shende, ParaTools; Allen Malony, ParaTools; Andrew Wissink, Ames Research Center Abstract: Memory errors, such as an invalid memory access, misaligned allocation, or write to deallocated memory, are among the most difficult problems to debug because popular debugging tools do not fully support state inspection when examining failures. This is particularly true for applications written in a combination of Python, C++, C, and Fortran. We present a tool that can help identify and debug memory errors in amulti-language program at the point of failure. Integrated in the TAU Performance System R , this debugging tool allocates pages of protected memory immediately before and after dynamic memory allocations. Accessing these “guard pages” raises an error signal that causes TAU to capture performance data at the point of failure, store detailed information for each frame in the callstack, and generate a file that may be sent to the developers for analysis. The tool works on parallel programs, providing feedback about every process regardless of whether it experienced the fault, and is useful to both software developers and users experiencing memory error issues as the file output may be exchanged between the user and the development team without disclosing potentially sensitive application data. This paper describes the tool and demonstrates its application to the multi-language CREATE-AV applications Kestrel and Helios. Since those codes are export controlled, we present results from an analogous code written specifically for testing but with structure and content derived from Helios and Kestrel. The analogous performance and debugging data closely match the data obtained from the CREATE-AV codes.Parallel CPU and GPU Computations to Solve the Job Shop Scheduling Problem with BlockingAbdelhakim AitZai, USTHB; Adel Dabah, USTHB; mourad Boudhar, USTHB Abstract: In this paper, we studied the parallelization of an exact method to solve the job shop scheduling problem with blocking JSB. We used a modeling based on graph theory exploiting the alternative graphs. We have proposed an original parallelization technique for performing a parallel computation in the various branches of the search tree. This technique is implemented on computers network, where the number of computers is not limited. Its advantage is that it uses a new concept that is the logical ring combined with the notion of token. We also proposed two different paradigms of parallelization with genetic algorithms. The first uses a network of computers and the second uses GPU with CUDA technology. The results are very interesting. In addition, we see a very significant reduction of computation time compared to the sequential method.Accelerating Monte Carlo Molecular Simulations Using Novel Extrapolation Schemes Combined with Fast Database Generation on Massively Parallel MachinesSahar Amir, KAUST; Ahmad Kadoura, ; Amgad Salama, ; Shuyu Sun, King Abdullah University of Science and TechnologyAbstract: In this paper we introduce an efficient thermodynamically consistent technique to extrapolate and interpolate normalized Canonical NVT ensemble averages for Lennard-Jones (L-J) fluids at different thermodynamic conditions from expensively simulated data points leading to significant speed up in generating intensive data. Preliminary results show promising applicability in oil and gas modelling, where accurate determination of thermodynamic properties in reservoirs is challenging. The methods reweight and reconstruct previously generated database values of Markov chains at neighbouring temperature and density conditions. To investigate the efficiency of these methods, two databases corresponding to different combinations of normalized density and temperature are generated. One contains 175 Markov chains with 10,000,000 MC cycle results and the other contains 3000 Markov chains with 61,000,000 MC cycle results. For such massive database creation, some parallelizing algorithms have been investigated. The relative error of the thermodynamic extrapolation and thermodynamic interpolation schemes were investigated with respect to classical interpolation and extrapolation.Big Data Analysis using Distributed Actors FrameworkSanjeev Mohindra, MIT Lincoln Laboratory Abstract: The amount of data generated by sensors, machines, and individuals is increasing exponentially. The increasing vol- ume, velocity, and variety of data places a demand not only on the storage and compute resources, but also on the analysts entrusted with the task of exploiting big data. Machine analytics can be used to ease the burdens on the analyst. The analysts often need to compose these machine analytics into workflows to process big data. Different parts of the workflow may need to be run in different geographical locations depending of the availability of data and compute resources. This paper presents a framework for composing and executing big data analytics in batch, streaming, or interactive workflows across the enterprise.Tuning HPEC Linux clusters for real-time determinismDavid Tetley, GE Intelligent Platforms; Joe Rolfe, GE Intelligent Platforms Abstract: There is a growing trend in the High Performance Embedded Computing market to take advantage of Linux cluster architectures from the High Performance Computing (HPC) and Supercomputer communities. The adoption of embedded x86 and GPU compute clusters tied together with a high speed, RDMA capable fabric brings a lot of compute capability to bear in form factors suitable for the rugged embedded Military and Aerospace market. These HPC architectures are built on the Linux operating system and offer the application developer a wide spectrum of open standard software. But how suitable are these platforms for applications requiring real-time performance? This paper investigates interrupt response times and message passing latencies using OpenMPI on three different versions of the Linux kernel; one ‘standard’ server grade, one server grade with real-time pre-empt patches applied, and one with a proprietary, real-time micro kernel. In order to characterize these platforms, a series of measurements were made with and without a background CPU load. This paper will also highlight some system and Linux kernel tuning techniques that can improve determinism and affect system performance. The results are presented in a series of graphs showing histograms of interrupt response and MPI message latencies under the various workloads and tuning scenarios. The measurements demonstrate interrupt response latencies of less than 10us and MPI latencies of around 1us which can meet the performance requirements of a wide range of embedded applications. The results also show that ‘standard’ server grade distributions of Linux may be suitable for “soft” real-time applications but are not generally suited for ‘hard’ or ‘firm’ real-time applications since interrupt response times cannot be guaranteed and may be well beyond 200us. However, with some system and kernel tuning, even “standard” server grade distributions can meet ‘soft’ real-time performance criteria. The addition of real-time pre-empt patches provides a significant improvement to determinism that can address the needs of many applications, but for true ‘hard’ real-time behavior a real-time flavor of Linux with micro-kernel modifications in addition to the pre-empt patches is needed. With these real-time Linux kernels, deterministic behavior can be guaranteed making them suitable for the real-time system control aspects of an HPEC system as well as for the signal, image, sensor and data processing back end where they are typically deployed. The advent of these capabilities moves ruggedized HPEC into a new era built around a Modular Open System Approach (MOSA) enabling shortened software development cycles by using open standard middleware and common APIs, thus protecting software investment over time and reducing technical risk, program development schedules and cost.