A Mechanism to Improve the Performance of Hybrid MPI-OpenMP Applications in GridShikha Mehrotra, C-DAC; Shamjith K V, C-DAC; Prachi Pandey, C-DAC; Asvija B, C-DAC; Sridharan R, C-DAC Abstract: In the current scenario of grid computing, heterogeneous resources are distributed across different administrative domains and geographical boundaries. Every node in a cluster consists of multiple core CPUs wherein the distributed memory across nodes and shared memory co-exists, thereby paving way for hybrid architectures. The hybrid programming approach combines MPI and OpenMP libraries to exploit this hierarchical multicore architecture. The clear requirements of such hybrid application and knowledge of the system architecture will help to boost the application performance. Scheduling these hybrid applications on the grid becomes a critical task for obtaining better performance. In this paper, we outline the attempt made in improving the scheduling mechanism for the hybrid applications based on the requirements of the application.Expanding the High Performance Embedded Computing Tool Chest - Mixing C and Java™Nazario Irizarry, The MITRE Corporation Abstract: High performance embedded computing systems are often implemented in the C language to achieve the utmost in speed. In light of continued budget reductions and the everpresent desire for quicker development timelines, safer and more productive languages need to be used as well. Java is often overlooked due to the perception that it is slow. Oracle Java 7 and C were compared to better understand their relative performance in single and multicore applications. Java performed as well as C in many of the tests. The quantitative findings and the conditions under which Java performs well help design solutions that exploit Java's code safety and productivity.Re-Introduction of Communication-Avoiding FMM-Accelerated FFTs with GPU AccelerationM Harper Langston, Reservoir Labs; Muthu Baskaran, Reservoir Labs; Benoit Meister, Reservoir Labs; Nicolas Nicolas Vasilache , Reservoir Labs; Richard Lethin, Reservoir Labs Abstract: As distributed memory systems grow larger, communication demands have increased. Unfortunately, while the costs of arithmetic operations continue to decrease rapidly, communication costs have not. As a result, there has been a growing interest in communication-avoiding algorithms for some of the classic problems in numerical computing. For example, there have been exciting new innovations in the development of communication-avoiding Fast Fourier Transforms (FFTs). A previously-developed low-communication FFT, however, has remained largely out of the picture, partially due to its reliance on the Fast Multipole Method (FMM), an algorithm which aids in accelerating dense computations. In light of the renewed interest in this method and other low-communication FFTs, we have begun an algorithmic investigation and re-implementation design for the FMM-FFT, which exploits the ability to tune precision of the result (due to the mathematical nature of the FMM) to reduce power burning communication and computation, the potential benefit of which is to reduce the energy required for the fundamental transform of digital signal processing. We reintroduce this algorithm as well as discuss new innovations we have developed to separate the distinct portions of the FMM into a CPU-dedicated process, relying on inter-processor communication for approximate interactions, and a GPU-dedicated process for dense interactions with no communication.Accelerating a Novel Particle-based Fluid Simulation on the GPUZhilu Chen, WPI; James Kingsley, WPI; Xinming Huang, Worcester Polytechnic Institute; Erkan Tuzel, Abstract: Stochastic Rotation Dynamics (SRD) is a novel particle-based simulation method that can be used to model complex fluids, such as binary and ternary mixtures, and polymer solutions, in either two or three dimensions. Although SRD is efficient compared to traditional methods, it is still computationally expensive for large system sizes, e.g. when using a large array of particles to simulate dense polymer solutions. Recently, as the power offered by Graphics Processing Units (GPUs) has risen, General Purpose GPU (GPGPU) computing has been introduced as an effective way to improve performance for parallel computation tasks. This work focuses on the acceleration of SRD simulations using Nvidia's GPGPU architecture, CUDA. We find that while the speed improvements delivered by GPU acceleration vary with the simulation version and parameters used, our GPU implementation runs around 10 times faster than the CPU version for basic simulations, and up to 50 times faster for polymers in solution.GPU Accelerated Elevation Map based Registration of Aerial ImagesJoseph Fernando, University of Dayton Research Abstract: This paper proposes a lower latency implementation of the georegistration algorithm proposed by Jovanovic et. al. The algorithm has been modified to mitigate the registration errors and has been parallelized to map to a Graphical Processor Unit (GPU). Also, the target image offset and the painting value computations have been combined to a single loop to eliminate the use of shared memory. The equations and the algorithm required to generate accurate orthorectified and georegistered images from digital Satellite images and aeriel photographs are proposed. The proposed modified algorithm has been implemented in compute unified device (CUDA) architecture to reduce latency. A fixed coordinate system is used to represent the image, focal and projection planes. Experimental results show that the proposed algorithm is capable of generating accurate georegistered images for high flying airborne vehicles. While this method has been tested using aerial photographs, it can be extended to Satellite images as well as other image data. A speedup of over 10x has been achieved over the CPU version.