© Lorem Ipsum Dolor 2010
2013 IEEE High Performance
Extreme Computing Conference
(HPEC ‘13)
Seventeenth Annual HPEC Conference
10 - 12 September 2013
Westin Hotel, Waltham, MA USA
GPU Accelerated Blood Flow Computation using the Lattice Boltzmann Method
Cosmin Nita, Transilvania University of Brasov; Lucian Mihai Itu, Transilvania University of Bra; Constantin Suciu,
Siemens Corporate Technology
Abstract: We propose a numerical implementation based on a Graphics Processing Unit (GPU) for the acceleration of the execution
time of the Lattice Boltzmann Method (LBM). The study focuses on the application of the LBM for patient-specific blood flow
computations, and hence, to obtain higher accuracy, double precision computations are employed. The LBM specific operations are
grouped into two kernels, whereas only one of them uses information from neighboring nodes. Since for blood flow computations
regularly only 1/5 or less of the nodes represent fluid nodes, an indirect addressing scheme is used to reduce the memory
requirements. Three GPU cards are evaluated with different 3D benchmark applications (Poisseuille flow, lid-driven cavity flow and flow
in an elbow shaped domain) and the best performing card is used to compute blood flow in a patient-specific aorta geometry with
coarctation. The speed-up over a multi-threaded CPU code is of 19.42x. The comparison with a basic GPU based LBM implementation
demonstrates the importance of the optimization activities.
Enhancing a Cross-Platform Hyperspectral Image Analysis Library for Heterogeneous CUDA Support
Brian Landrón, University of Puerto Rico; Jonathan Torres, University of PUerto RIco Mayagüez Campus; Nayda Santiago,
University of Puerto RIco Mayagüez Campus
Abstract: Modern hyperspectral image (HSI) analysis makes use of the high performance computing power provided by General
Purpose Graphical Processing Units (GPGPUs) due to the large volume of hyperspectral sensor data involved. HSI analysis applications
range in purpose from military reconnaissance devices to devices that aid medical diagnostics. These applications require a substantial
amount of learning and development to be conducted in order to correctly implement GPGPU based algorithms. To aid rapid
prototyping of HSI analysis platforms that make use of GPGPUs an open source software library, libdect, supported by the NVIDIA
Compute Unified Device Architecture (CUDA), is being improved. The libdect library includes an implementation of the Reed-Xiaoli (RX)
and Matched Filter (MF) target detection algorithms and its infrastructure is supported by the CMake build system, incorporating cross-
platform compatibility into libdect. Testing of coding guidelines was integrated into libdect’s CMake infrastructure using KWStyle to help
maintain code quality, which can positively impact the software's life cycle. A build log is also supported with CTest and CDash to
facilitate development across heterogeneous platforms. This software library can only handle small data sets that allow the entire HSI
to be stored in a CUDA capable device's global memory as a single workload. In an effort to solve this issue the libdect development
team is designing algorithms that partition the entire HSI analysis process into a set of smaller tasks that a CUDA capable device can
handle. The solution must consider choosing the correct amount of grids, blocks, and threads that can efficiently handle each detector's
workloads, as well as determining which workloads are large enough to benefit from parallel execution. The libdect library is currently
in its prototype phase and will emerge as an open source, cross-platform, HSI analysis software library with heterogeneous CUDA
support. It will provide an encapsulation of detection algorithm implementations which future developers need not be concerned
about while balancing performance.
Acceleration of Particle Physics using FPGA-based Vector Processors with Narrow Custom Vector ALU Operations
Aaron Severance, VectorBlox Computing, Inc.; Joe Edwards, VectorBlox Computing, Inc.; Guy Lemieux, VectorBlox Computing, Inc.
Abstract: Embedded systems with fast, low-latency data processing needs are often built around FPGAs. Since most FPGA-based
processors are slow, the computation is often implemented as custom logic which is inflexible and difficult to develop, debug and
maintain. In contrast, an FPGA-based soft vector processor can achieve high processing rates using wide, SIMD-style parallelism.
However, a key drawback of this SIMD parallelism is the high cost of replicating complex operators like square-root, divide, and floating-
point across all parallel lanes. Since these operators are infrequently used, they can be implemented as narrow, custom ALU operators
to achieve good performance when required, yet keep area overhead low. This paper demonstrates a novel way to connect narrow
custom vector operators with minimal overhead, and uses a particle physics algorithm as a case study to demonstrate how to
judiciously add custom vector ALU operators to achieve desired performance levels with limited resource usage.
Efficient Parallel Runtime Bounds Checking with the TAU Performance System
John Linford, ParaTools; Sameer Shende, ParaTools; Allen Malony, ParaTools; Andrew Wissink, Ames Research Center
Abstract: Memory errors, such as an invalid memory access, misaligned allocation, or write to deallocated memory, are among the
most difficult problems to debug because popular debugging tools do not fully support state inspection when examining failures. This is
particularly true for applications written in a combination of Python, C++, C, and Fortran. We present a tool that can help identify and
debug memory errors in a
multi-language program at the point of failure. Integrated in the TAU Performance System R , this debugging tool allocates pages of
protected memory immediately before and after dynamic memory allocations. Accessing these “guard pages” raises an error signal that
causes TAU to capture performance data at the point of failure, store detailed information for each frame in the callstack, and generate
a file that may be sent to the developers for analysis. The tool works on parallel programs, providing feedback about every process
regardless of whether it experienced the fault, and is useful to both software developers and users experiencing memory error issues as
the file output may be exchanged between the user and the development team without disclosing potentially sensitive application
data. This paper describes the tool and demonstrates its application to the multi-language CREATE-AV applications Kestrel and Helios.
Since those codes are export controlled, we present results from an analogous code written specifically for testing but with structure
and content derived from Helios and Kestrel. The analogous performance and debugging data closely match the data obtained from the
CREATE-AV codes.
Parallel CPU and GPU Computations to Solve the Job Shop Scheduling Problem with Blocking
Abdelhakim AitZai, USTHB; Adel Dabah, USTHB; mourad Boudhar, USTHB
Abstract: In this paper, we studied the parallelization of an exact method to solve the job shop scheduling problem with blocking JSB.
We used a modeling based on graph theory exploiting the alternative graphs. We have proposed an original parallelization technique
for performing a parallel computation in the various branches of the search tree. This technique is implemented on computers network,
where the number of computers is not limited. Its advantage is that it uses a new concept that is the logical ring combined with the
notion of token. We also proposed two different paradigms of parallelization with genetic algorithms. The first uses a network of
computers and the second uses GPU with CUDA technology. The results are very interesting. In addition, we see a very significant
reduction of computation time compared to the sequential method.
Accelerating Monte Carlo Molecular Simulations Using Novel Extrapolation Schemes Combined with Fast Database Generation on
Massively Parallel Machines
Sahar Amir, KAUST; Ahmad Kadoura, ; Amgad Salama, ; Shuyu Sun, King Abdullah University of Science and Technology
Abstract: In this paper we introduce an efficient thermodynamically consistent technique to extrapolate and interpolate normalized
Canonical NVT ensemble averages for Lennard-Jones (L-J) fluids at different thermodynamic conditions from expensively simulated data
points leading to significant speed up in generating intensive data. Preliminary results show promising applicability in oil and gas
modelling, where accurate determination of thermodynamic properties in reservoirs is challenging. The methods reweight and
reconstruct previously generated database values of Markov chains at neighbouring temperature and density conditions. To investigate
the efficiency of these methods, two databases corresponding to different combinations of normalized density and temperature are
generated. One contains 175 Markov chains with 10,000,000 MC cycle results and the other contains 3000 Markov chains with
61,000,000 MC cycle results. For such massive database creation, some parallelizing algorithms have been investigated. The relative
error of the thermodynamic extrapolation and thermodynamic interpolation schemes were investigated with respect to classical
interpolation and extrapolation.
Big Data Analysis using Distributed Actors Framework
Sanjeev Mohindra, MIT Lincoln Laboratory
Abstract: The amount of data generated by sensors, machines, and individuals is increasing exponentially. The increasing vol- ume,
velocity, and variety of data places a demand not only on the storage and compute resources, but also on the analysts entrusted with
the task of exploiting big data. Machine analytics can be used to ease the burdens on the analyst. The analysts often need to compose
these machine analytics into workflows to process big data. Different parts of the workflow may need to be run in different
geographical locations depending of the availability of data and compute resources. This paper presents a framework for composing
and executing big data analytics in batch, streaming, or interactive workflows across the enterprise.
Tuning HPEC Linux clusters for real-time determinism
David Tetley, GE Intelligent Platforms; Joe Rolfe, GE Intelligent Platforms
Abstract: There is a growing trend in the High Performance Embedded Computing market to take advantage of Linux cluster
architectures from the High Performance Computing (HPC) and Supercomputer communities. The adoption of embedded x86 and GPU
compute clusters tied together with a high speed, RDMA capable fabric brings a lot of compute capability to bear in form factors
suitable for the rugged embedded Military and Aerospace market. These HPC architectures are built on the Linux operating system and
offer the application developer a wide spectrum of open standard software. But how suitable are these platforms for applications
requiring real-time performance? This paper investigates interrupt response times and message passing latencies using OpenMPI on
three different versions of the Linux kernel; one ‘standard’ server grade, one server grade with real-time pre-empt patches applied, and
one with a proprietary, real-time micro kernel. In order to characterize these platforms, a series of measurements were made with and
without a background CPU load. This paper will also highlight some system and Linux kernel tuning techniques that can improve
determinism and affect system performance. The results are presented in a series of graphs showing histograms of interrupt response
and MPI message latencies under the various workloads and tuning scenarios. The measurements demonstrate interrupt response
latencies of less than 10us and MPI latencies of around 1us which can meet the performance requirements of a wide range of
embedded applications. The results also show that ‘standard’ server grade distributions of Linux may be suitable for “soft” real-time
applications but are not generally suited for ‘hard’ or ‘firm’ real-time applications since interrupt response times cannot be guaranteed
and may be well beyond 200us. However, with some system and kernel tuning, even “standard” server grade distributions can meet
‘soft’ real-time performance criteria. The addition of real-time pre-empt patches provides a significant improvement to determinism
that can address the needs of many applications, but for true ‘hard’ real-time behavior a real-time flavor of Linux with micro-kernel
modifications in addition to the pre-empt patches is needed. With these real-time Linux kernels, deterministic behavior can be
guaranteed making them suitable for the real-time system control aspects of an HPEC system as well as for the signal, image, sensor
and data processing back end where they are typically deployed. The advent of these capabilities moves ruggedized HPEC into a new
era built around a Modular Open System Approach (MOSA) enabling shortened software development cycles by using open standard
middleware and common APIs, thus protecting software investment over time and reducing technical risk, program development
schedules and cost.