© Lorem Ipsum Dolor 2010
2012 IEEE High Performance
Extreme Computing Conference
(HPEC ‘12)
Sixteenth Annual HPEC Conference
10 - 12 September 2012
Westin Hotel, Waltham, MA USA
A Third Generation Many-Core Processor for Secure Embedded Computing Systems
John Irza*, Coherent Logix, Inc.
Abstract:
As compute-intensive products proliferate, there is an ever growing need to provide security features to detect tampering, identify cloned or
counterfeit hardware, and deter cybersecurity threats. This paper describes the security features of the third generation 100-core HyperX™
processor which addresses these needs. Programmable security barriers allow the processor to implement a red-black System on Chip solution.
The implementation of Physically Unclonable Functions (PUFs), encryption/decryption engines, a secure boot controller, and anti-tamper features
enable the engineer to realize a secure embedded computing solution in an ultra-low power, many-core, C programmable processor-memory
network.
:::::::
Exploiting SPM-aware Scheduling on EPIC Architectures for High-Performance Real-Time Systems
Wei Zhang*, Virginia Commonwealth University
Abstract:
In contemporary computer architectures, the Explicitly Parallel Instruction Computing Architectures (EPIC) permits microprocessors to implement
Instruction Level Parallelism (ILP) by using the compiler, rather than complex ondie circuitry to control parallel instruction execution like the
superscalar architecture. Based on the EPIC, this paper proposes a time predictable two-level scratchpad based memory architecture, and an
ILP based static memory objects assignment algorithm is utilized in the compiler not to harm the characteristic of time predictability of scratchpad
memories. Then, to exploit the load/store latencies that are statically known in this architecture, we study a Scratchpad-aware Scheduling method
to improve the performance by optimizing the Load-To-Use Distance. Our experimental results indicate that the performance of the two-level
scratchpad based architecture on EPIC processors can be improved by the Scratchpad-aware Scheduling, while keeping time predictability.
:::::::
Using Copper Water Loop Heat Pipes to Efficiently Cool CPUs and GPUs
Stephen Fried*, Microway Inc. and Passive Thermal Technology
Abstract:
As the amount of power being rejected by 1U servers starts to approach and exceed 2 KW, the question in HPC continues to be, how can we not
only cool devices which reject this amount of heat, but also, how can we reject that heat efficiently.
:::::::
High locality and increased intra-node parallelism for solving finite element models on GPUs by novel element-by-element
implementation
Zsolt Badics*, Tensor Research LLC
Abstract:
The utilization of Graphical Processing Units (GPUs) for the element-by-element (EbE) finite element method (FEM) is demonstrated. EbE FEM
is a long known technique, by which a conjugate gradient (CG) type iterative solution scheme can be entirely decomposed into computations on
the element level, i.e., without assembling the global system matrix. In our implementation NVIDIA’s parallel computing solution, the Compute
Unified Device Architecture (CUDA) is used to perform the required element-wise computations in parallel. Since element matrices need not be
stored, the memory requirement can be kept extremely low. It is shown that this low-storage but computationintensive technique is better suited
for GPUs than those requiring the massive manipulation of large data sets. This first study of the proposed parallel model illustrates a highly
improved locality and minimization of data movement, which could also significantly reduce energy consumption in other HPC architectures.
:::::::
Accelerating Fully Homomorphic Encryption Using GPU
Wei Wang, ECE, Worcester Polytechnic Institute; Yin Hu, ECE, Worcester Polytechnic Institutue; Lianmu Chen, ECE, Worcester Polytechnic
Institute; Xinming Huang*, ECE, Worcester Polytechnic Institute; Berk Sunar, ECE, Worcester Polytechnic Institute
Abstract:
In a major breakthrough, in 2009 Gentry introduced the first plausible construction of a fully homomorphic encryption (FHE) scheme. FHE allows
the evaluation of arbitrary functions directly on encrypted data on untwisted servers. In 2010, Gentry and Halevi presented the first FHE
implementation on an IBM x3500 server. However, this implementation remains impractical due to the high latency of encryption and recryption.
The Gentry- Halevi (GH) FHE primitives, utilize multi-million-bit modular multiplications and additions – time-consuming tasks for general purpose
processors. In the GH-FHE implementation, the most computationintensive arithmetic operation is modular multiplication. In this paper, the
million-bit multiplication is calculated in two steps: large-number multiplication and modular reduction. Strassen’s FFT based algorithm is used so
that Graphics processing units (GPU) can employ massive parallelism to accelerate the largenumber number multiplication. In what follows, the
Barrett Modular Reduction algorithm is used to realize modular multiplication. We implemented the encryption, decryption and recryption
primitives on the NVIDIA C2050. Experimental results show factors of up to 7.68, 7.4 and 6.59 speed improvement for encryption, decryption and
recrypt, respectively, when compared to the GH implementation for the small setting in dimension 2048.
:::::::
Use of CUDA for the Continuous Space Language Model
Elizabeth Thompson*, Purdue University Fort Wayne
Abstract:
The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute
Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls
on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.
:::::::
Graph Programming Model - An Efficient Approach for Sensor Signal Processing Domain
Steve Kirsch*, Raytheon
Abstract:
The HPC community has struggled to find an optimal parallel programming model that can efficiently expose algorithmic parallelism in a
sequential program and automate the implementation of a highly efficient parallel program. A plethora of parallel programming languages have
been developed along with sophisticated compilers and runtimes, but none of these approach have been successful enough to became the
defacto standard. Graph Programming Model has the capability and efficiencies to become that ubiquitous standard for the signal processing
domain.
:::::::
An Application of the Constraint Programming to the Design and Operation of Synthetic Aperture Radars
Michael Holzrichter*, Sandia National Laboratories
Abstract:
The design and operation of synthetic aperture radars require compatible sets of hundreds of quantities. Compatibility is achieved when these
quantities satisfy constraints arising from physics, geometry etc. In the aggregate these quantities and constraints form a logical model of the
radar. In practice the logical model is distributed over multiple people, documents and software modules thereby becoming fragmented.
Fragmentation gives rise to inconsistencies and errors. The SAR Inference Engine addresses the fragmentation problem by implementing the
logical model of a Sandia synthetic aperture radar in a form that is intended to be usable from system design to mission planning to actual
operation of the radar. These diverse contexts require extreme flexibility that is achieved by employing the constraint programming paradigm.
:::::::
LLMORE: A Framework for Data Mapping and Architecture Analysis
Michael Wolf*, MIT Lincoln Laboratory
Abstract:
We outline our recent efforts in developing MIT Lincoln Laboratory’s Mapping and Optimization Runtime Environment (LLMORE). The LLMORE
framework consists of several components that together estimate and optimize performance critical sections of an application. This framework
can be used to improve the performance of parallel applications and as an important tool for analyzing different hardware architectures. In this
paper, we describe the use cases that have driven the development of LLMORE. We also give two concrete examples of how LLMORE can be
used to improve the parallel performance of a numerical operation and characterize the power efficiency of numerical algorithms and computer
architectures.
:::::::
Unitary Qubit Lattice Algorithm for Two-Component Bose-Einstein Condensate Gases: the Kelvin-Helmholtz and Counter-Superflow
Instabilities
George Vahala*, William & Mary
Abstract:
A unitary qubit lattice algorithm, employing four qubits per lattice site, is introduced to model a set of coupled Bose-Einstein condensates (BECs)
described by the Gross- Pitaevskii (GP) equation for the ground state wave functions. Using a series of unitary collide-stream-rotate operators,
the ideally parallelized (tested to over 210,000 cores) mesoscopic algorithm recovers the coupled GP equations in the diffusion limit. Both the
quantum Kelvin-Helmholtz (KH) and quantum counter-superflow instabilities will be examined on high resolution grids for both 2D and 3D. With
mean velocity shear between the two components, the Kelvin-Helmholtz and counter-superflow instabilities are driven. Recent 2D simulations of
Tsubota et. al. [1] on such BECs, using pseudospectral codes, have uncovered novel features not seen in the classical analogues of these
instabilities. in particular, as the shear velocity interface forms a sawtooth oscillation of increasing amplitude, quantum vorticies are spun off the
crests and troughs and propagate within their own condensates and so stabilize the KH instability. For thicker 2D interface boundaries, the two-
stream counterflow instability leads to the creation of quantum vortex pairs with complex dynamical behavior. These results will first be verified by
our 2D qubit algorithms and then extended to 3D where the quantum vortices can now interaction strongly and undergo reconnection and loop
ejection. Because the qubit algorithms are so well parallelized, detailed 3D structures will be examined with excellent spatial resolution. The
principal significance to DoD is in the development of unitary qubit codes that are immediately portable to quantum computers as they come
online. It aids in the interplay between quantum and classical turbulence and in the control of BECs. [1] H. Takeuchi, N. Suzuki, K. Kasamatsu, H.
Saito and M. Tsubota, Phys. Rev. B81, 094517 (2010).
:::::::
Early Experiences with Energy-Aware Scheduling
Kathleen Smith*, ARL DSRC
Abstract:
This paper documents the early experiences and recent progress with employing the Energy-Aware Scheduler (EAS) at the DoD
Supercomputing Resource Centers (DSRC). The U.S. Army Research Laboratory (ARL) has partnered with Lockheed Martin, Altair, and
Instrumental to assess feasibility on current DSRC High Performance Computing (HPC) systems. Developmental work was completed on the
ARL DSRC Test and Development systems and ported to the production systems at the ARL DSRC. The (EAS) is written in Python and works
with the current program-wide scheduler, Altair PBS Professional, that is deployed across the DSRCs. EAS reduces power and cooling costs by
intelligently powering off compute nodes that are not actively being used by the currently running or reserved for near future jobs. It has been
estimated that the Energy Aware Scheduler could potentially save millions of Kilowatt-hours each year throughout the program. We will describe
the extent of our work to date at the DSRC centers and our plans to complete our work by September 30, 2012.
:::::::
Isolating Runtime Faults with Callstack Debugging using TAU
Sameer Shende*, ParaTools, Inc.
Abstract:
We present a tool that can help identify the nature of runtime errors in a program at the point of failure. This debugging tool, integrated in the TAU
Performance System, allows a developer to isolate the fault in a multi‐language program by capturing the signal associated with the fault and
examining the program callstack. It captures the performance data at the point of failure, stores detailed information for each frame in the
callstack, and generates a file that may be shipped back to the developers for further analysis. Technical Approach:
:::::::
Fast Functional Simulation with a Dynamic Language
Craig Steele*, Exogi LLC
Abstract:
Simulation of large computational systems-on-a-chip (SoCs) is increasing challenging as the number and complexity of components is scaled up.
With the ubiquity of programmable components in computational SoCs, fast functional instructionset simulation (ISS) is increasingly important.
Much ISS has been done with straightforward unit-delay models of a non-pipelined fetch-decode-execute iteration written in a low-to-mid-level
Cfamily static language, delivering mid-level efficiency. Some ISS programs, such as QEMU, perform binary translation to allow software
emulation to reach more usable speeds. This relatively complex methodology has not been widely adopted for system modeling. We
demonstrate a fresh approach to ISS that achieves much better performance than a fast binary-to-binary translator by exploiting recent advances
in just-in-time (JIT) compilers for dynamic languages, such as JavaScript and Lua, together with a specific programming idiom inspired by
pipelined processor design. We believe that this approach is relatively accessible to system designers familiar with C-family functional simulator
coding styles, and generally useful for fast modeling of complex SoC components.
:::::::
Power and Performance Comparison of HPEC Challenge Benchmarks on Various Processors
Sharad Mehta*, Mercury Computer Systems,Inc.
Abstract:
The HPEC Challenge Benchmarks may be used to compare power and performance characteristics of various processors. The objective is to
enable data-driven decisions to be taken during selection of components and system architectures for high performance embedded computing
applications. The system architect has a wide range of choices available in terms of selection of software, firmware and hardware. These choices
include various types of processors (CPUs, GPUs, FPGAs). These components may be configured within various network topologies to
accommodate the processing and data rate requirements of the application at hand. In addition to the complexity of the algorithms and the
volume and rate of the incoming data, embedded systems are challenged to be deployed in harsh conditions with restrictions in size, weight and
power (SWaP). The system architect is driven to use new computer processing elements within new architectures. The prediction and
comparison of the performance of different processors is difficult and a uniform methodology is needed to compare their performance in various
possible architectures. Several metrics may be used for comparison. For example, the processing latency, data transfer rate in and out of the
processor and power consumption for key mathematical kernels are some of the measurable parameters that drive the decision-making process.
Various component vendors provide peak theoretical performance of a component or sub-system. However, when the system is constructed, the
overall performance of the system is generally found to be much lower in comparison to the peak theoretical performance of the sub-systems.
The performance may be improved significantly by using optimization techniques which are dependent among other things, upon processor
features, system capabilities and programming and diagnostic tools.
:::::::
Synthetic Aperture Radar on Low Power Multi-Core Digital Signal Processor
Dan Wang*, Texas Instruments
Abstract:
Commercial off-the-self (COTS) components have recently gained popularity in Synthetic Aperture Radar (SAR) applications. The compute
capabilities of these devices have advanced to a level where real time processing of complex SAR algorithms have become feasible. In this
paper, we focus on a low power multi-core Digital Signal Processor (DSP) from Texas Instruments Inc. and evaluate its capability for SAR signal
processing. The specific DSP studied here is an eight-core device, codenamed TMS320C6678, that provides a peak performance of 128
GFLOPS (single precision) for only 10 watts. We describe how the basic SAR operations, like compression and corner turning can be
implemented efficiently in such a device. Our results indicate that a baseline SAR range-Doppler algorithm takes 0.2 sec for a 16 M (4K 4K)
image.
:::::::
Integration and Development of the 500 TFLOPS Heterogeneous Cluster (Condor)
mark Barnell*, Air Forece Research Laboratory
Abstract:
The Air Force Research Laboratory Information Directorate Advanced Computing Division (AFRL/RIT) High Performance Computing Affiliated
Resource Center (HPC-ARC) is the host to a very large scale interactive computing cluster consisting of about 1800 nodes. Condor, the largest
interactive Cell cluster in the world, consists of integrated heterogeneous processors of IBM Cell Broadband Engine (Cell BE) multicore CPUs,
NVIDIA General Purpose Graphic Processing Units (GPGPUs) and Intel x86 server nodes in a 10Gb Ethernet Star Hub network and 20Gb/s
Infiniband Mesh, with a combined capability of 500 trillion floating operations per second (TFLOPS). Applications developed and running on
CONDOR include large-scale computational intelligence models, video synthetic aperture radar (SAR) back-projection, Space Situational
Awareness (SSA), video target tracking, linear algebra and others. This presentation will discuss the design and integration of the system. It will
also show progress on performance optimization efforts and lessons learned on algorithm scalability on a heterogeneous architecture.
:::::::
Ruggedization of MXM Graphics Modules
Ivan Straznicky*, Curtiss-Wright Controls Defense Solutions
Abstract:
MXM modules, used to package graphics processing devices for use in benign environments, have been tested for use in harsh environments
typical of deployed defense and aerospace systems. Results show that specially mechanically designed MXM GP-GPU modules can survive
these environments, and successfully provide the enormous processing capability offered by the latest generation of GPU to harsh environment
applications.
:::::::
Parallel Search of k-Nearest Neighbors with Synchronous Operations
Nikos Pitsianis*, Aristotle University and Duke University
Abstract:
We present a new study of parallel algorithms for locating k-nearest neighbors of each single query in a high dimensional (feature) space on a
many-core processor or accelerator that favors synchronous operations, such as on a graphics processing unit. Exploiting the intimate
relationships between two primitive operations, select and sort, we introduce a cohort of truncated sort algorithms for select. The truncated bitonic
sort (TBiS) in particular has desirable data locality, synchronous concurrency and simple data and program structures, which outweigh its single
drawback in taking more logical comparisons. TBiS can serve two special roles. One is as a reference point or a benchmark for quantitative study
of integral effect of multiple performance factors in algorithms and architectures for kNN search. The other is as a record holder at present for fast
kNN search on a parallel processor that imposes high synchronization cost. We provide with algorithm analysis and experimental results.
:::::::
AN INGENIOUS APPROACH FOR IMPROVING TURNAROUND TIME OF GRID JOBS WITH RESOURCE ASSURANCE AND
ALLOCATION MECHANISM
Prachi Pandey*,
Abstract:
In a heavily used grid scenario, where there are many jobs competing for the best resource, the meta-scheduler is burdened with the task of
judiciously allocating appropriate resources to the jobs. However, as the demand for the resources increases more and more, it becomes really
difficult to manage the jobs and allocate resources to them and hence most of the jobs will be in the queued state waiting for the resources to be
free. Gradually, it leads to a situation where the jobs stay in queued state longer than the execution state, resulting in highly increased turnaround
times. The challenge therefore is to make sure that the jobs don’t take an unreasonable time to complete because of the increased waiting time.
In this paper, we discuss about the advance reservation mechanism adopted in Garuda Grid for assuring the availability of compute resources
and QoS based resource allocation. Results of the experiments carried out with this setup confirm the reduction in queuing time of jobs in grid,
thereby improving the turnaround time.
:::::::
Large-Scale Molecular Dynamics Simulations of Early- and Intermediate-Stage Sintering of Nanocrystalline SiC
Bryce Devine*, US Army Corps of Engineers
Abstract:
Polycrystalline silicon carbide (SiC) has tremendous potential as a lightweight structural material if its fracture toughness and tensile strength
could be significantly (factor of 4) improved, which is the long-term goal of this and related research. Such a “super” ceramic would allow for two-
thirds weight reduction, or more, over that of steel and aluminum for most structural applications. The potential impact on military logistics is
enormous. Key to the realization of such a super ceramic is the development of appropriate SiC composite designs and the development of
methods to fabricate SiC composites to meet these designs through sintering. Technologies to support SiC composite design development are
addressed in a companion paperThis paper discusses research to develop sintering fabrication methods. Recently developed sintering
techniques allow for the production of ceramic materials with nanocrystalline grain structures and for the incorporation of organic reinforcements
in ceramic composites. Both reduction in grain size and the incorporation of tensile members have been shown to improve the fracture toughness
of SiC. We are performing multi-million-atom classical molecular dynamics (MD) simulations of early- and intermediate-stage Spark Plasma
Sintering (SPS) of nanocrystalline SiC to better understand, and then engineer, the sintering process. We have developed continuum models to
predict the thermal, electric, and displacement fields inside the sintering chamber. These provide boundary and initial conditions for the MD
simulations of sintering. Several mechanisms were observed during each stage of sintering consolidation, with the rate limiting mechanism
dependent upon temperature, pressure and grain size. This research helps lay the technical foundation for development of a lightweight
structural “super” ceramic matrix composite.
:::::::
High Performance Java
Jordan Ruloff*, DRC
Abstract:
At this point in time, it is apparent that future programming paradigms will be based around many-core processors and heterogeneous
computing. Diversity in new processor architectures has led to a large variety of processors, which were designed to address different issues
found in past architectures while, unfortunately and unintentionally, burdening programmers to use these new architectures effectively. As more
programming libraries and languages are developed, programmers will be able to design algorithms for these different architectures to maximize
their code efficiency, whether to maximize performance or minimize power usage. Unfortunately, not all code can scale efficiently in many-core
architectures nor can all code efficiently utilize heterogeneous architectures. Sometimes, a programmer may even have to deal with a task that is
inherently serial in nature. Even if the task is trivially parallel, a programmer may even find that, due to limiting constraints of a particular
architecture, like memory or interconnect speed, an algorithm best suited for a particular problem may not be the most desirable to maximize
performance. In order to efficiently utilize the computing hardware, programmers must have a basic understanding of the fundamental differences
between the various architectures and how to best utilize them. This paper covers the methods that were employed for addressing task and data
parallelism within the Java language to maximize performance of the World Wind Java Ballistic Interface code, Java 7’s fork/join framework and
AMD’s Aparapi Java bindings as well as the importance of parallel execution time and how to map it to the various execution frameworks.
:::::::
HPC-VMs: Virtual Machines in High Performance Computing Systems
Albert Reuther*, MIT Lincoln Laboratory
Abstract:
The concept of virtual machines dates back to the 1960s. Both IBM and MIT developed operating system features that enabled user and
peripheral time sharing, the underpinnings of which were early virtual machines. Modern virtual machines present a translation layer of system
devices between a guest operating system and the host operating system executing on a computer system, while isolating each of the guest
operating systems from each other. In the past several years, enterprise computing has embraced virtual machines to deploy a wide variety of
capabilities from business management systems to email server farms. Those who have adopted virtual deployment environments have
capitalized on a variety of advantages including server consolidation, service migration, and higher service reliability. But they have also ended
up with some challenges including a sacrifice in performance and more complex system management. Some of these advantages and
challenges also apply to HPC in virtualized environments. In this paper, we analyze the effectiveness of using virtual machines in a high
performance computing (HPC) environment. We propose adding some virtual machine capability to already robust HPC environments for specific
scenarios where the productivity gained outweighs the performance lost for using virtual machines. Finally, we discuss an implementation of
augmenting virtual machines into the software stack of a HPC cluster, and we analyze the affect on job launch time of this implementation.
:::::::
Complex Network Modeling with an Emulab HPC
Virginia Ross*, AFRL/RITB
Abstract:
To support DoD networks in the field, next generation complex network product designs need to be evaluated for optimum performance. Network
emulation plays an important role in evaluating these next generation complex network product designs. From the component level to the
system-of-systems level, emulation enables evaluation in a real system context, greatly reducing the cost and time of testing and validation
throughout the design cycle. For accurate network synthesis, emulation must support real-time speed, full packet fidelity, and provide
transparency. For example, the Joint Tactical Radio System (JTRS) has critical needs for network evaluation, including researching the JTRS
networking waveforms. With JTRS currently undergoing massive revision, this emulation can help save time and resources in modeling the
network for system development and testing. The Network Modeling and Simulation Environment (NEMSE) capability was developed and
installed on the Air Force Research Laboratory/Information Directorate (AFRL/RI) EMULAB high performance computer (HPC), a network
emulation testbed, to demonstrate this capability for future network modeling. The NEMSE environment has demonstrated the capability to
incorporate hardware and software elements to provide hardware-in-the-loop network emulation testing and support true network emulation.
NEMSE provides parallel execution and highest fidelity models and the scalability and interactivity required to test and evaluate advanced
network communication devices and architectures. This capability benefits the DoD by enabling rapid technology transition of complex network
architectures from research laboratories to the field. Actual Joint Tactical Radio System (JTRS) radios, Operations Network (OPNET) emulations,
and GNU (recursive definition for GNU is Not Unix) open-source software-defined-radio software/ firmware/ hardware emulations can be
accommodated.
:::::::
Signal & image processing technology transfer to army fielded combat robots
Peter Raeth*, DRC
Abstract:
:::::::
Use of Code Execution Profiles and Traces in the HPCMP Sustained Systems Performance Test
Paul Bennett*, U.S. Army Engineer Research and Development Center DoD Supercomputing
Abstract:
The High Performance Computing Modernization Program (HPCMP) sustained systems performance (SSP) test plays a vital role in ensuring that
the highest level of performance is delivered to users of the HPCMP HPC systems. A subset of the benchmark codes from the system
acquisition cycle is used to benchmark system performance in order to quantitatively evaluate updates to system software, hardware repairs,
modifications to job queuing policies, and revisions to the job scheduler. The SSP codes have proven migration capability to HPCMP HPC
systems and nonempirical tests for numerical accuracy. Metrics such as compilation time, queue wait time, benchmark execution time, and total
test throughput time are gathered and compared against data from previous tests to monitor the systems under test while minimizing impact to
the users. Jobs failing to execute properly or in anomalously short or long times are investigated, and the results are reported to systems
administrators and Center Directors at each Center for appropriate actions. In the past few years, many of the SSP performance issues have
been found to arise from contention for the interconnecting networks, and as such, are transient in nature. Unfortunately, without additional
investigation, it is impossible to determine whether any given performance issue is a systemic problem or arises from network contention. This
poster presents the results of a study made to determine the feasibility of using a lightweight profiling and tracing tool to determine whether it can
be used to more easily distinguish between systemic performance problems or transient interconnection problems.
:::::::
Parallel Circuit and Interconnect Simulation Using Multi-core PC
Chun-Jung Chen*, Chinese Culture University
Abstract:
This paper presents methods that utilize multi-core PC to perform MOSFET circuit simulation and transmission line calculation. A very coarse-
grained parallel computing strategy is proposed for circuit simulation. A parallel transmission line calculation method based on Method of
Characteristic is also described. All proposed methods have been implemented and tested. Experimental
:::::::
A Novel Probe Concept for Computational Imaging of the Physiological Activity of Large Numbers of Cells
Michael Henninger*, Massachusetts Institute of Technology
Abstract:
We propose a new kind of imaging probe that is small enough to be implanted in the body where it can monitor the physiology of many individual
cells embedded in an intact tissue environment (e.g., neurons in the brain). Imaging cells in intact tissue presents dual challenges of probe
miniaturization and microscopic imaging in a highly scattering medium. We find a simultaneous solution to both of these challenges by replacing
the lens optics of a conventional imaging system with a patterned aperture of opaque and transparent areas. The probe consists of a long, thin
shank densely arrayed with CMOS imaging pixels. The pixels are covered with a transparent standoff, a patterned array of apertures, and a
fluorescence emission filter. When implanted in conjunction with an excitation light source, such a device enables the measurement of
fluorescence from arbitrary locations in the brain. The probe’s patterned array causes the CMOS sensor to record both spatial and angular
information about the incident light. The captured 4D light field—two spatial and two angular dimensions—provides a powerful dataset for a
variety of computational imaging techniques and reconstruction algorithms. Indeed, we demonstrate that this data can be used to reconstruct
single cell sources in the full 3D volume from a single image shot, without any moving parts. We present representative numerical simulations of
the light field probe’s operation and feasibility, and demonstrate simple examples of aperture patterns and reconstruction algorithms.
:::::::