Diptorup Deb, Robert J. Fowler, and Allan Porterfield.
QUARC: an optimized DSL framework using LLVM.
In The Fourth Workshop on the LLVM Compiler Infrastructure in
HPC (LLVM-HPC2017), Denver, CO, November 2017. ACM SIGHPC.
[ bib ]
We describe aspects of the implementation of QUARC, a framework layered on C++ used for a domain specific lan- guage for Lattice Quantum Chromodynamics. It is built on top of Clang/LLVM to leverage long term support and per- formance portability. QUARC implements a general array extension to C++ with implicit data parallelism. A notable innovation is the method for using templates to capture and encode the high-level abstractions and to communicate these abstractions transparently to LLVM through an unmod- ified Clang. Another notable feature is a general array trans- formation mechanism used to improve memory hierarchy performance and maximize opportunities for vectorization. This reshapes and transposes arrays of structures containing nested complex arrays into arrays of structures of arrays. We discuss an example for which QUARC generated code has performance competitive with the very best hand-optimized libraries.
Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, Jan F. Prins, and
Robert J. Fowler.
Improving energy efficiency in memory-constrained applications using
core-specific power control.
In 5th International Workshop on Energy Efficient Supercomputing
(E2SC), Denver, CO, November 2017. ACM SIGHPC.
[ bib ]
Power is increasingly the limiting factor in High Performance Computing (HPC) at Exascale and will continue to influence future advancements in supercomputing to mitigate large operating costs and carbon footprints. Recent processors equipped with on-board hardware counters allow real time mon- itoring of operating conditions such as energy and temperature, in addition to performance measures such as instructions retired and memory accesses. Significantly, recent processors also provide the ability to dynamically control processor power utilization at the per-core level. We present an experimental memory study on modern CPU architectures, Intel Sandybridge and Haswell, to identify oppor- tunities to reduce CPU frequency and save energy with minimal performance impact. Using on-board hardware counters, we identify a metric, TORo core, that detects bandwidth saturation and increased latency in the memory system. This metric is then used to construct a dynamic policy to modulate per-core power controls on Haswell machines. The policy is evaluated when applied at coarse and fine- grained levels on six MPI mini-applications. The best energy savings with the coarse and fine-grained application of the dynamic policy is 32.1% and 19.5% respectively with a 2% increase in execution time in both cases. On average, the fine- grained dynamic policy yields a 1% speedup while the coarse- grained dynamic policy yields a 3% slowdown. Energy savings through frequency reduction not only provide cost advantages, they also reduce resource contention and create additional ther- mal headroom for non-throttled cores that can lead to improved performance.
|||Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, and Jan F. Prins. An adaptive core-specific runtime for energy efficiency. In 31st IEEE International Parallel & Distributed Processing Symposium, Orlando, FL, May 2017. IEEE. [ bib ]|
|||Sridutt Bhalachandra, Allan Porterfield, Stephen L Olivier, and Jan F Prins. An adaptive core-specific runtime for energy efficiency. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, pages 947--956. IEEE, 2017. [ bib ]|
|||R. Idaszak, R. Arthur, R Bartlett, I. Baxter, D.E. Bernholdt, R. Boisvert, K. Fecho, R. Fowler, S. Greenspan, M.A. Heroux, C. Iancu, C. Kartsaklis, D.S. Katz, Q. Koziol, S. Landsberg, E. Lucier, J. McGregor, T. Ndousse-Fetter, A. Pawlik, A.I. Reuther, W. Scarborough, and W. Schroeder. Economics of CSE software tools. In Computational Science and Engineering Software Productivity and Sustainability (CSESSP) Challenges Workshop Report, Washington, DC, October 2016. [ bib ]|
Diptorup Deb, Robert Fowler, and Allan Porterfield.
QUARC: An Array Programming Approach to High Performance
Computing, volume 10136 of Lecture Notes in Computer Science.
Springer International Publishing, Rochester, NY, September 2016.
[ bib |
We present QUARC, a framework for the optimized compilation of domain-specific extensions to C++. Driven by needs for programmer productivity and portable performance for lattice QCD, the framework focuses on stencil-like computations on arrays with an arbitrary number of dimensions. QUARC uses a template meta-programming front end to define a high-level array language. Unlike approaches that generate scalarized loop nests in the front end, the instantiation of QUARC templates retains high-level abstraction suitable for optimization at the object (array) level. The back end compiler (CLANG/LLVM) is extended to implement array transformations such as transposition, reshaping, and partitioning for parallelism and for memory locality prior to scalarization.
Allan Porterfield, Sridutt Bhalachandra, Wei Wang, and Robert Fowler.
Variability: A tuning headache.
In Symposium on Parallel and Distributed Processing (IPDPS) -
Workshop Proceedings, Chicago, IL, May 2016. IEEE.
[ bib |
Performance tuning is an ongoing activity at most HPC sites. Small performance improvements can save thousands of dollars. Run-to-run performance variations significantly impact performance tuning. Not being able to tell which code version is faster (or more energy efficient) in a single run greatly increases the computational expense and uncertainty for the programmer. We will show examples where autotuning frameworks could easily choose a sub-optimal kernel. We will also examine the difficulty optimizing a real-world HPC application
|||Allan Porterfield, Rob Fowler, Sridutt Bhalachandra, Barry Rountree, Diptorup Deb, and Robert Lewis. Application runtime variability and power optimization for exascale computers. In International Workshop on Runtime and Operating Systems at Scale (ROSS2015), Portland, OR, June 2015. [ bib ]|
|||Paul Ruth, Anirban Mandal, Claris Castillo, Robert Fowler, Jeff Tilson, Ilya Baldin, and Yufeng Xin. Achieving performance isolation on multi-tenant networked clouds using advanced block storage mechanisms. In 6th Workshop on Scientific Cloud Computing (ScienceCloud'15), pages 29--32, Portland, OR, June 2015. [ bib | DOI ]|
|||Kevin A. Huck, Allan Porterfield, Rob Fowler, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, and Thomas Sterling. An autonomic performance environment for exascale. In Supercomputing Frontiers 2015, Singapore, March 2015. [ bib ]|
Kevin Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen Malony,
Thomas Sterling, and Rob Fowler.
An autonomic performance environment for exascale.
Supercomputing frontiers and innovations, 2(3), 2015.
[ bib |
Exascale systems will require new approaches to performance observation, analysis, and runtime decision-making to optimize for performance and efficiency. The standard first-personmodel, in which multiple operating system processes and threads observe themselves and record first-person performance profiles or traces for offline analysis, is not adequate to observe and capture interactions at shared resources in highly concurrent, dynamic systems. Further, it does not support mechanisms for runtime adaptation. Our approach, called APEX (Autonomic Performance Environment for eXascale), provides mechanisms for sharing information among the layers of the software stack, including hardware, operating and runtime systems, and application code, both new and legacy. The performance measurement components share information across layers, merging first-person data sets with information collected by third-person tools observing shared hardware and software states at node- and global-levels. Critically, APEX provides a policy engine designed to guide runtime adaptation mechanisms to make algorithmic changes, re-allocate resources, or change scheduling rules when appropriate conditions occur.
|||Allan Porterfield, Rob Fowler, Sridutt Balachandra, and Wei Wang. OpenMP and MPI application energy measurement variation. In 1st International Workshop on Energy Efficient SuperComputing (E2SC), Denver, CO, November 2013. [ bib ]|
|||Kevin Huck, Sameer Shende, Allen Malony, Hartmut Kaiser, Allan Porterfield, Rob Fowler, and Ron Brightwell. An early prototype of an autonomic performance environment for exascale. In International Workshop on Runtime and Operating Systems at Scale (ROSS2013), Eugene, OR, June 2013. [ bib ]|
|||Jee Whan Choi, Daniel Bedard, Robert Fowler, and Richard Vuduc. A roofline model of energy. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS13), Boston, MA, May 2013. [ bib ]|
|||Anirban Mandal, Robert Fowler, and Allan Porterfield. System-wide introspection for accurate attribution of performance bottlenecks. In Workshop on High-performance Infrastructure for Scalable Tools (WHIST), Venice, Italy, June 2012. [ bib ]|
Walter Ermler, Jeffrey Tilson, and Robert J. Fowler.
Spin-orbit configuration interaction calculations of electronic
spectra of RuO2+ and OsO2+ catalytic cores.
In Southwest Regional Meeting of the American Chemical Society
(SWRMACS 2012), Baton Rouge, LA, 2012.
[ bib ]
Low-lying potential energy curves of RuO2+ and OsO2+ catalytic core molecules are analyzed using large-scale spin-orbit configuration interaction (SOCI) calculations based on multireference molecular wavefunctions. Relativistic effects are included using effective core potentials. The large spin-orbit splitting energies of the 4d and 5d subshells of Ru and Os require that the spin-orbit coupling operator be included when calculating the electronic spectra. The ground states of both molecules are triply bonded systems of 0+(1Σ+) symmetry having bond lengths, harmonic frequencies and dissociation energies of 1.63 and 1.70 Å, 757 and 772 cm-1, and 83.8 and 97.2 kcal/mol, respectively. These results are consistent with experimental observation that the ground state of complexed RuO2+ is diamagnetic.
|||Jeffrey L. Tilson, Walter C. Ermler, and Robert J. Fowler. Ci potential energy curves for three states of ruo2+. Chemical Physics Letters, 516(4–6):131 -- 136, 2011. [ bib | DOI | http ]|
|||Min Yeol Lim, Allan Porterfield, and Robert Fowler. SoftPower: Fine-Grain Power Estimations Using Performance Counters. In The ACM International Symposium on High Performance Distributed Computing (HPDC), Chicago, July 2010. ACM. Best short paper award. [ bib ]|
|||Todd Gamblin, Bronis de Supinski, Martin Schulz, Rob Fowler, and Daniel Reed. Efficiently clustering performance data at massive scales. In Proceedings of the International Conference on Supercomputing 2010 (ICS2010), Tsukuba, Japan, June 2010. ACM. [ bib ]|
|||Daniel Bedard, Min Yeol Lim, Robert Fowler, and Allan Porterfield. PowerMon: Fine-grained and integrated power monitoring for commodity computer systems. In Proceedings Southeastcon 2010, Charlotte, NC, March 2010. IEEE. [ bib ]|
|||Anirban Mandal, Rob Fowler, and Allan Porterfield. Modeling memory concurrency for multi-socket multi-core systems. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS2010), pages 56--75, White Plains, NY, March 2010. IEEE. [ bib ]|
|||Anirban Mandal, Min Yeol Lim, Allan Porterfield, and Rob Fowler. Effects of multi-core memory concurrency limits on multi-threaded applications. Technical Report RENCI Technical Report TR-10-03, RENCI, 2010. [ bib ]|
|||Anirban Mandal, Min Lim, Allan Porterfield, and Robert Fowler. Implications for applications and compilers of multi-core memory concurrency. In Poster at International Workshop on Languages and Compilers for Parallel Computing (LCPC'10), 2010. (poster). [ bib ]|
|||Allan Porterfield, Nassib Nassar, and Rob Fowler. Multi-threaded library for many-core systems. In Workshop on Multithreaded Architectures and Applications, Rome, May 2009. IEEE. [ bib ]|
Allan Porterfield, Rob Fowler, Anirban Mandal, and Min Yeol Lim.
Empirical evaluation of multi-core memory concurrency.
Technical Report TR-09-01, RENCI, Chapel Hill, North Carolina,
[ bib |
Multi-socket, multi-core computers are becoming ubiquitous, especially as nodes in compute clusters of all sizes. Common memory benchmarks and memory performance models treat memory as characterized by well-defined maximum bandwidth and average latency parameters. In contrast, current and future systems are based on deep hierarchies and NUMA memory systems, which are not easily described this simply. Memory performance characterization of multi-socket, multi-core systems require measurements and models more sophisticated than than simple peak bandwidth/minimum latency models. To investigate this issue, we performed a detailed experimental study of the memory performance of a variety of AMD multi-socket quad-core systems. We used the pChase benchmark to generate memory system loads with a variable number of concurrent memory operations in the system across a variable number of threads pinned to specific chips in the system. While processor differences had minor but measurable impact on bandwidth, the make-up and structure of the memory has major impact on achievable bandwidth. Our experiments exposed 3 different bottlenecks at different levels of the hardware architecture: limits on the number of references outstanding per thread; limits to the memory requests serviced by a single memory channel; and limits on the total global memory references outstanding were observed. We discuss the impact of these limits on constraints in tuning code for these systems, theimpact on compilers and operating systems, and on future system implementation decisions.
|||Robert Fowler, L Adhianto, Bronis de Supinski, Michael Fagan, Todd Gamblin, Mark Krentel, John Mellor-Crummey, Martin Schulz, and Nathan Tallent. Frontiers of performance analysis on leadership-class systems. Journal of Physics: Conference Series, 180, 2009. [ bib | DOI ]|
Allan Porterfield, Robert J. Fowler, Anirban Mandal, and Min Yeol Lim.
Performance consistency on multi-socket amd opteron systems.
Technical Report TR-08-07, RENCI, North Carolina, December 2008.
(Submitted for publication).
[ bib |
Compute nodes with multiple sockets each of which has multiple cores are starting to dominate in the area of scientific computing clusters. Performance inconsistencies from one execution to the next makes any performance debugging or tuning difficult. The resulting performance inconsistencies are bigger for memory-bound applications but still noticeable for all but the most compute-intensive applications. Memory and thread placement across sockets has significant impact on performance of these systems. We test overall performance and performance consistency for a number of OpenMP and pthread benchmarks including Stream, pChase , the NAS Parallel Benchmarks and SPEC OMP. The tests are run on a variety of multi-socket quad-core AMD Opteron systems. We examine the benefits of explicitly pinning each thread to a different core before any data initialization, thus improving and reducing the variability of performance due to data-to-thread co-location. Execution time variability falls to less than 2peak performance increases over 40hundreds or thousands of nodes, reducing variability will improve load balance and total application performance. Careful memory and thread placement is critical for the successful performance tuning of nodes on a modern scientific compute cluster.
|||Todd Gamblin, Bronis R. de Supinski, Martin Schultz, Rob Fowler, and Daniel A. Reed. Scalable load-balance measurement for SPMD codes. In Proceedings of Supercomputing 2008, Austin, TX, November 2008. ACM/IEEE. [ bib ]|
|||Jeffery L. Tilson, Mark S.C. Reed, and Robert J Fowler. Workflows for performance evaluation and tuning. In Proceedings 2008 IEEE International Conference on Cluster Computing (Cluster 2008), page 8pp, Tsukuba, Japan, September 2008. IEEE. [ bib ]|
|||Bronis R. de Supinski, Rob Fowler, Todd Gamblin, Frank Mueller, Prasun Ratn, and Martin Schultz. An open infrastructure for scalable, reconfigurable analysis. In International Workshop on Scalable Tools for High-End Computing (STHEC 2008). ACM/SIGARCH, July 2008. [ bib ]|
Allan Porterfield, Robert Fowler, and Mark Neyer.
MAESTRO: dynamic runtime power control.
In Workshop on Managed Multicore systems MMCS, Boston, MA,
[ bib ]
Microprocessors are increasing cores quickly enough that for many applications the microprocessor will provide greater computational resources than the memory system can supply data. Performance of even simple benchmarks is noticeably impacted by co-scheduling multiple copies on the cores of a current microprocessor system. Runtimes, such as MAESTRO, can use the excess computational resources to reduce power consumption of memory bound applications with little performance impact. By using hardware performance counters for the shared resources from a dedicated core, chip-wide bottlenecks are detected. When a threshold is exceeded, the frequency (and power) of the cores are reduced. MAESTRO is prototyped on an AMD Phenom as a daemon and currently detects high miss rates for the L3 shared cache and reduces the processor frequency until the rate drops below a second threshold. On a desktop system, running at low frequency can save up to 36core jobs to compute at full frequency and saves power when the bottlenecks exist during parallel execution. Performance degradation is currently 20-25% but is expected to fall as MAESTRO is tuned and the core counts on a chip increase.
Todd Gamblin, Rob Fowler, and Daniel A. Reed.
Scalable methods for monitoring and detecting behavioral classes in
In Proceedings of the International Parallel and Distributed
Processing Symposium 2008, Miami, FL, April 2008.
[ bib ]
Emerging petascale systems will have many hundreds of thousands of processors, while traditional task-level tracing tools already fail to scale to much smaller systems because the I/O backbones of these systems are not configured to handle the peak offered load of all their cores. Complete event traces of all processes are thus infeasible. To retain the benefits of detailed performance measurement while reducing volume of collected data, we developed AMPL, a general-purpose toolkit that reduces data volume using stratified sampling. We adopt a scalable sampling strategy, since the sample size required to measure a system varies sub-linearly with process count. By grouping, or stratifying, processes that behave similarly, we can further reduce data overhead while also providing insight into an application's behavior. In this paper, we describe the AMPL toolkit and we report our experiences using it on large-scale scientific applications. We show that AMPL can successfully reduce the overhead of tracing scientific applications by an order of magnitude or more, and we show that our tool scales sub-linearly, so the improvement will be more dramatic on petascale machines. Finally, we illustrate the use of AMPL to monitor applications by performance-equivalent strata, and we show that this technique can allow for further reductions in trace data volume and traced execution time.
Howard M. Lander, Robert J. Fowler, Lavanya Ramakrishnan, and Steven R. Thorpe.
Stateful grid resource selection for related asynchronous tasks.
Technical Report TR-08-02, RENCI, North Carolina, April 2008.
[ bib |
In today�s grid deployments, resource selection is based on the prior knowledge of the performance characteristics of the application on a particular resource and on real-time monitoring status of the resource such as load on the system, network bandwidth, etc. Any lag between a resource selection decision and the time the job appears in the system�s monitoring facility will cause subsequent decisions to be based on incorrect information. If two or more jobs arrive within this hysteresis window, the incorrect assessment of system state can have negative consequences on job response time and system throughput. In this paper we describe a stateful resource selection protocol we designed to mitigate this problem for a real time storm surge modeling project. We present results from real experiments on a regional grid. We use emulation to compare and study the effect of our protocol under varying load conditions. Based on our evaluation we argue that the enhanced protocol should be made available as a globally-aware grid resource selection service.
|||Robert J. Fowler, Todd Gamblin, Gopi Kandaswamy, Anirban Mandal, Allan K. Porterfield, Lavanya Ramakrishnan, and Daniel A. Reed. Challenges of scale: When all computing becomes grid computing. In Lucio Grandinetti, editor, High Performance Computing and Grids in Action, Advances in Parallel Computing. IOS Press, Amsterdam, March 2008. [ bib ]|
|||Robert J Fowler, Todd Gamblin, Allan K Porterfield, Patrick Dreher, Song Huang, and Balint Joo. Performance engineering challenges: the view from RENCI. J. Phys: Conf. Ser, page 5pp, 2008. [ bib ]|
|||Todd Gamblin, Prasun Ratn, Bronis R. de Supinkski, Martin Schulz, Frank Mueller, Robert J. Fowler, and Daniel A. Reed. An open framework for scalable, reconfigurable performance analysis. In Supercomputing 2007 (SC'07), Reno, NV, November 2007. (Poster). [ bib ]|
David Bailey and Robert Lucas et al.
Performance engineering: Understanding and improving the performance
of large-scale codes.
CT Watch Quarterly, 3(4):18--23, November 2007.
[ bib |
Achieving good performance on high-end computing systems is growing ever more challenging due to enormous scale, increasing architectural complexity, and increasing application complex-ity. To address these challenges in DOE�s SciDAC-2 program , the Performance Engineering Research Institute (PERI) has embarked on an ambitious research plan encompassing perform-ance modeling and prediction, automatic performance optimization and performance engineering of high profile applications. The principal new component is a research activity in automatic tun-ing software, which is spurred by the strong user preference for automatic tools.
|||Y. Zhang, R. Fowler, K. Huck, A. Malony, A. Porterfield, D. Reed, S. Shende, V. Taylor, and X. Wu. US QCD computational performance studies with PERI. J. Phys: Conf. Ser, 78(012083):5pp, August 2007. [ bib ]|
|||Todd Gamblin, Rob Fowler, and Daniel A. Reed. Runtime methods for automatic behavioral stratification of scientific codes. In Proceedings of the Los Alamos Computer Science Institute 7th Annual Symposium, Santa Fe, MN, October 2006. (Best Poster Award). [ bib ]|
Nathan Froyd, Nathan Tallent, John Mellor-Crummey, and Rob Fowler.
Call path profiling for unmodified, optimized binaries.
In Proceedings of the GCC and GNU Toolchain Developers' Summit,
Ottawa, Canada, June 2006.
[ bib |
Although gprof has long been the standard for call graph profiling in the GNU toolchain, it suffers from several shortcomings. First, in modern object-oriented programs, costs must be attributed to full calling contexts because the cost of function calls may be context dependent; gprof ignores this issue. Second, gprof uses instrumentation in procedure prologues to gather performance data. Gprof's instrumentation-based profiling imposes four costs: (1) recompilation is used to add instrumentation, (2) the presence of instrumentation distorts the measurements taken, (3) the overhead of instrumentation can significantly increase running time, and (4) the presence of instrumentation can weaken compiler optimization. We have developed a call-path profiler that avoids all of these shortcomings by measuring the performance of unmodified, fully-optimized binaries. Rather than inserting instrumentation, we use periodic sampling of hardware performance counters and stack unwinding to attribute samples to calling contexts and collect frequency counts for call graph edges. Experiments with the SPEC CPU2000 benchmark suite yield good accuracy and low (2samples/sec. A call-path profiler based on stack sampling needs information to unwind the stack at any point in the execution. In particular, it requires precise information about procedure epilogues -- more information than is required to handle C++ exceptions. In this paper, we describe the changes we have made to GCC to emit such information and we present results of experiments on the x86-64 platform. Accuracy and overhead are comparable to the measurements made on Tru64/Alpha. For profiling binaries that lack sufficient information, we propose to augment binutils to collect the necessary information using binary analysis.
Nathan Froyd, John Mellor-Crummey, and Robert Fowler.
Efficient call-stack profiling of unmodified, optimized code.
In Proceedings of the International Conference on Supercomputing
(ICS2005), pages 81--90, Cambridge, MA, June 2005.
[ bib ]
Call path profiling associates resource consumption with the calling context in which resources were consumed. We describe the design and implementation of a low-overhead call path profiler based on stack sampling. The profiler uses a novel sample-driven strategy for collecting frequency counts for call graph edges without instrumenting every procedure's code to count them. The data structures and algorithms used are efficient enough to construct the complete calling context tree exposed during sampling. The profiler leverages information recorded by compilers for debugging or exception handling to record call path profiles even for highly-optimized code. We describe an implementation for the Tru64/Alpha platform. Experiments profiling the SPEC CPU2000 benchmark suite demonstrate the low (2%-7%) overhead of this profiler. A comparison with instrumentation-based profilers, such as xgprof, shows that for call-intensive programs, our sampling-based strategy for call path profiling has over an order of magnitude lower overhead.
Ken Kennedy, Bradley Broom, Arun Chauhan, Robert Fowler, John Garvin, Charles
Koelbel, Cheryl McCosh, and John Mellor-Crummey.
Telescoping languages: A system for automatic generation of domain
Proceedings of the IEEE, 93(2):387--408, February 2005.
[ bib ]
The software gap - the discrepancy between the need for new software and the aggregate capacity of the workforce to produce it - is a serious problem for scientific software. Although users appreciate the convenience (and, thus, improved productivity) of using relatively high-level scripting languages, the slow execution speeds of these languages remain a problem. Lower level languages, such as C and Fortran, provide better performance for production applications, but at the cost of tedious programming and optimization by experts. If applications written in scripting languages could be routinely compiled into highly optimized machine code, a huge productivity advantage would be possible. It is not enough, however, to simply develop excellent compiler technologies for scripting languages (as a number of projects have succeeded in doing for MATLAB). In practice, scientists typically extend these languages with their own domain-centric components, such as the MATLAB signal processing toolbox. Doing so effective Our approach calls for using a library-preprocessing phase to extensively analyze and optimize collections of libraries that define an extended language. Results of this analysis are collected into annotated libraries and used to generate a library-aware optimizer. The generated library-aware optimizer uses the knowledge gathered during preprocessing to carry out fast and effective optimization of high-level scripts. This enables script optimization to benefit from the intense analysis performed during preprocessing without repaying its price. Since library preprocessing is performed only at infrequent "language-generation" times, its cost is amortized over many compilations of individual scripts that use the library. We call this strategy "telescoping languages" because it merges knowledge of a hierarchy of extended languages into a single library-aware optimizer. We present our vision and plans for compiler frameworks based on telescoping languages and - report on the preliminary research that has established the effectiveness of this approach.
Robert Wilhelmson, Jay Alameda, Kelvin Droegemeier, Michael Folk, Rob Fowler,
Dennis Gannon, Sara Graves, Dale Haidvogel, Parry Husbands, Charles
Lee Isbell Jr., Dan Weber, Paul Woodward, Bryant W. York, Sarah Anderson,
Brian Jewett, Christopher Moore, David Nolan, David Porter, Dave Semeraro, ,
and Steve Tanner.
MEAD (a modeling environment for atmospheric discovery).
In 20th International Conference on Interactive Information and
Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology,
Seattle, WA, January 2004.
in conjunction with the 84th AMS annual meeting.
[ bib |
The goal of the MEAD Expedition is the development and adaptation of Grid and TeraGrid-enabled cyberinfrastructure for enabling ensemble or very large domain model simulations coupled with data handling, analysis, data mining, and visualization services. This includes a dynamic workflow and data management environment applicable in a variety of fluid flow modeling environments. The specific applications chosen for MEAD are mesoscale storm and hurricane research and education. The MEAD Expedition is a cyberinfrastructure proving ground that has been funded for two years by the National Computational Science Alliance, an NSF PACI program. The MEAD project is documented at http://www.ncsa.uiuc.edu/AboutUs/FocusAreas/MEADExpedition.html.
|||Bradley Broom, Rob Fowler, Ken Kennedy, Charles Koelbel, and Michael Paleczny. Compiler support for out-of-core arrays on parallel machines. In Daniel Reed, editor, Scalable Input/Output, pages 155--174. MIT Press, Cambridge, MA, October 2003. [ bib ]|
Alain Darte, John Mellor-Crummey, Robert Fowler, and Daniel
Generalized multipartitioning of multi-dimensional arrays for
parallelizing line-sweep computations.
Journal of Parallel and Distributed Computing, 63(9):887--991,
[ bib |
Multipartitioning is a strategy for decomposing multi-dimensional arrays into tiles and mapping the resulting tiles onto a collection of processors. This class of partitionings enables efficient parallelization of "line-sweep" computations that solve one-dimensional recurrences along each dimension of a multi-dimensional array. Multipartitionings yield balanced parallelism for line sweeps by assigning each processor the same number of data tiles to compute at each step of a sweep along any array dimension. Also, they induce only coarse-grain communication. This paper considers the problem of computing generalized multipartitionings, which decompose d-dimensional arrays, dgt-or-equal, slanted2, onto an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning onto all of the processors for this general case. We use a cost model to select the dimensionality of the best partitioning and the number of cuts to make along each array dimension; then, we show how to construct a mapping that assigns the resulting data tiles to each of the processors. The assignment of tiles to processors induced by this class of multipartitionings corresponds to an instance of a latin hyper-rectangle, a natural extension of latin squares, which have been widely studied in mathematics and statistics. Finally, we describe how we extended the Rice dHPF compiler for High Performance Fortran to generate code that employs our strategy for generalized multipartitioning and show that the compiler's generated code for the NAS SP computational fluid dynamics benchmark achieves scalable high performance.
|||Robert Fowler, Alan Cox, Sameh Elnikety, and Willy Zwaenepoel. Using performance reflection in systems software. In Proceedings of USENIX Workshop on Hot Topics in Operating Systems (HOTOS IX), Lihue, HI, March 2003. Extended abstract. [ bib ]|
|||Robert Fowler, John Mellor-Crummey, Guohua Jin, and Apan Qasem. A source-to-source loop transformation tool (extended poster abstract). In Proceedings of the Los Alamos Computer Science Institute 3rd Annual Symposium, Santa Fe, NM, October 2002. Published on CD-ROM. [ bib ]|
John Mellor-Crummey, Vikram Adve, Bradley Broom, Daniel
Chavarría-Miranda, Robert Fowler, Guohua Jin, Ken Kennedy, and Qing
Advanced optimization strategies in the Rice dHPF compiler.
Concurrency: Practice and Experience, 14(8 & 9):741--768,
[ bib ]
To a large extent, today's commercially available HPF compilers have failed to deliver scalable parallel performance for a broad spectrum of applications because of insufficiently powerful compiler analysis and optimization. Substantial restructuring and hand-optimization can be required to achieve acceptable performance with an HPF port of an existing Fortran application, even for regular data-parallel applications. A key goal of the Rice dHPF compiler project has been to develop optimization techniques that enable a wide range of existing scientific applications to be ported easily to efficient HPF with minimal restructuring. This paper describes the challenges to effective parallelization presented by complex (but regular) data-parallel applications, and then describes how the novel analysis and optimization technologies in the dHPF compiler address these challenges effectively, without major rewriting of the applications. We illustrate the techniques by describing their use for parallelizing the NAS SP and BT benchmarks. The dHPF compiler generates multipartitioned parallelizations of these codes that are approaching the scalability and efficiency of sophisticated hand-coded parallelizations.
Daniel Chavarría-Miranda, Alain Darte, Robert Fowler, and John
Generalized multipartitioning for multi-dimensional arrays.
In Proceedings of the International Parallel and Distributed
Processing Symposium, Fort Lauderdale, FL, April 2002.
Selected as a Best Paper.
[ bib |
Multipartitioning is a strategy for parallelizing computations that require solving 1D recurrences along each dimension of a multi-dimensional array. Previous techniques for multipartitioning yield efficient parallelizations over 3D domains only when the number of processors is a perfect square. This paper considers the general problem of computing multipartitionings for d-dimensional data volumes on an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning onto all of the processors for this general case. Finally, we describe how we extended the Rice dHPF compiler for High Performance Fortran to generate code that exploits generalized multipartitioning and show that the compiler's generated code for the NAS SP computational fluid dynamics benchmark achieves scalable high performance.
|||John Mellor-Crummey, Robert Fowler, Gabriel Marin, and Nathan Tallent. HPCView: a tool for top-down analysis of node performance. The Journal of Supercomputing, 23:81--104, 2002. Extended version. Special Issue of selected papers from the 2001 Los Alamos Computer Science Institute Symposium. [ bib ]|
K. Kennedy, B. Broom, K. Cooper, J. Dongarra, R. Fowler, D. Gannon,
L. Johnsson, J. Mellor-Crummey, and L. Torczon.
Telescoping languages: A strategy for automatic generation of
scientific problem-solving systems from annotated libraries.
Journal of Parallel and Distributed Computing, 61:1803--1826,
[ bib ]
As machines and programs have become more complex, the process of programming applications that can exploit the power of high-performance systems has become more difficult and correspondingly more labor-intensive. This has substantially widened the software gap---the discrepancy between the need for new software and the aggregate capacity of the workforce to produce it. This problem has been compounded by the slow growth of programming productivity, especially for high-performance programs, over the past two decades. One way to bridge this gap is to make it possible for end users to develop programs in high-level domain-specific programming systems. In the past, a major impediment to the acceptance of such systems has been the poor performance of the resulting applications. To address this problem, we are developing a new compiler-based infrastructure, called TeleGen, that will make it practical to construct efficient domain-specific high-level languages from annotated component libraries. These languages are called telescoping languages, because they can be nested within one another. For programs written in telescoping languages, high performance and reasonable compilation times can be achieved by exhaustively analyzing the component libraries in advance to produce a language processor that recognizes and optimizes library operations as primitives in the language. The key to making this strategy practical is to keep compile times low by generating a custom compiler with extensive built-in knowledge of the underlying libraries. The goal is to achieve compile times that are linearly proportional to the size of the program presented by the user, rather than to the aggregate size of that program plus the base libraries.
Guohua Jin, John Mellor-Crummey, and Robert Fowler.
Increasing temporal locality with skewing and recursive blocking.
In Proceedings of Supercomputing 2001, Denver, COL, November
[ bib |
Effective memory hierarchy utilization is critical for high performance on modern single- and multi- processor systems. The key to performance on such systems is substantial temporal reuse. This paper focuses on strategies for improving temporal reuse in large-scale scientific codes that use iterative methods. We propose a coordinated approach for improving multi-level memory hierarchy utilization in such codes. We describe prismatic time skewing, a strategy for increasing temporal reuse in loop nests by applying multi-level time skewing. Novel aspects of this work include multi-dimensional skewing, handling for carried data dependences without additional storage, bi-directional skewing for handling periodic boundary conditions, and an interprocedural analysis and transformation strategy. We combine prismatic skewing with recursive blocking to boost reuse at all levels in a memory hierarchy. A preliminary evaluation of our techniques showed performance improvements ranging from 243% to a factor of 15 compared to the original code. Comparisons with time-skewing methods in the literature on these benchmarks show that the performance using our techniques is better by 12% to 133%. With an interprocedural application of our techniques, we were able to reduce total primary cache misses of a large application code by 23% to 27% and secondary cache misses by 45% to 119%.
Daniel Chavarría-Miranda, Alain Darte, Robert Fowler, and John
On efficient parallelization of line-sweep computations.
Technical Report RR2001-45, Laboratoire de l'Informatique du
Parall�lisme, �cole Normale Sup�riore de Lyon, November 2001.
[ bib |
Multipartitioning is a strategy for partitioning multi-dimensional arrays among a collection of processors so that line-sweep computations can be performed efficiently. With multipartitioning, computations that require solving 1-D recurrences along each dimension of a multidimensional array can be parallelized effectively. Previous techniques for multipartitioning yield efficient parallelizations over 3D domains only when the number of processors is a perfect square. This paper considers the general problem of computing optimal multipartitionings for d-dimensional data volumes on an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning for this general case, which enables efficient parallelizations of line-sweep computations under arbitrary conditions. Finally, we describe a prototype implementation of generalized multipartitioning in the Rice dHPF compiler and performance results obtains when using it to parallelize a line-sweep computation for different numbers of processors.
John Mellor-Crummey, Robert Fowler, and Gabriel Marin.
HPCView: a tool for top-down analysis of node performance.
In Proceedings of the Los Alamos Computer Science Institute
Second Annual Symposium, Santa Fe, NM, October 2001.
Distributed on CD-ROM.
[ bib |
Although it is increasingly difficult for large scientific programs to attain a significant fraction of peak performance on systems based on microprocessors with substantial instruction level parallelism and with deep memory hierarchies, performance analysis and tuning tools are still not used on a day-to-day basis by algorithm and application designers. We present HPCView - a toolkit for combining multiple sets of program profile data, correlating the data with source code, and generating a database that can be analyzed portably and collaboratively with commodity Web browsers. We argue that HPCView addresses many of the issues that have limited the usability and the utility of most existing tools. We originally built HPCView to facilitate our own work on data layout and optimizing compilers. Now, in addition to daily use within our group, HPCView is being used by several code development teams in DoD and DoE laboratories as well as at NCSA.
John Mellor-Crummey, Robert Fowler, and David Whalley.
Tools for application-oriented performance tuning.
In Proceedings of the International Conference on Supercomputing
(ICS2001), pages 154--165, Sorrento, Italy, June 2001.
[ bib |
Application performance tuning is a complex process that requires assembling various types of information and correlating it with source code to pinpoint the causes of performance bottlenecks. Existing performance tools don't adequately support this process in one or more dimensions. We discuss some of the critical utility and usability issues for application-level performance analysis tools in the context of two performance tools, MHSim and HPCView, that we built to support our own work on data layout and optimizing compilers. MHsim is a memory hierarchy simulator that produces source-level information not otherwise available about memory hierarchy utilization and the causes of cache conflicts. HPCView is a tool that combines data from arbitrary sets of instrumentation sources and correlates it with program source code. Both tools report their results in scope-hierarchy views of the corresponding source code and produce their output as HTML databases that can be analyzed portably and collaboratively using a commodity browser. In addition to daily use within our group, the tools are being used successfully by several code development teams in DoD and DoE laboratories.
|||John Mellor-Crummey, Robert Fowler, and David Whalley. On producing useful information for analyzing and tuning applications. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 2001), pages 332--333, Cambridge, Mass, June 2001. (poster). [ bib | .ps.gz ]|
Daniel Chavarría-Miranda, Alain Darte, Robert J. Fowler, and John
On efficient parallelization of line-sweep computations.
In Proceedings of Compilers for Parallel Computers (CPC2001),
Edinburgh, Scotland, June 2001.
Also available at
[ bib |
Multipartitioning is a strategy for partitioning multi-dimensional arrays among a collection of processors so that line-sweep computations can be performed efficiently. The principal property of a multipartitioned array is that for a line sweep along any array dimension, all processors have the same number of tiles to compute at each step in the sweep. This property results in full, balanced parallelism. A secondary benefit of multipartitionings is that they induce only coarse-grain communication. Previously, computing a d-dimensional multipartitioning required that p/(d - 1) be integral, where p is the number of processors. Here, we describe an algorithm to compute a d-dimensional multipartitioning of an array of ρ dimensions for an arbitrary number of processors, for any d, 2 <=d <=ρ. When using a multipartitioning to parallelize a line sweep computation, the best partitioning is the one that exploits all of the processors and has the smallest communication volume. To compute the best multipartitioning of a ρ-dimensional array, we describe a cost model for selecting d, the dimensionality of the best partitioning, and the number of cuts along each partitioned dimension. In practice, our technique will choose a 3-dimensional multipartitioning for a 3-dimensional line-sweep computation, except when p is a prime; previously, a 3-dimensional multipartitioning could be applied only when sqrt(p) is integral. We describe an implementation of multipartitioning in the Rice dHPF compiler and performance results obtained to parallelize a line sweep computation on a range of different numbers of processors.
Bradley Broom, Rob Fowler, and Ken Kennedy.
KelpIO: A telescope-ready domain-specific I/O library for
irregular block-structured applications.
In Proceedings of the 2001 IEEE International Symposium on
Cluster Computing and the Grid, pages 148--155, Brisbane, Australia, May
Best Paper Award.
[ bib |
To ameliorate the need to spend significant programmer time modifying parallel programs to achieve high-performance, while maintaining compact, comprehensible source codes, this paper advocates the use of telescoping languages technology to automatically apply, during the normal compilation process, high-level performance enhancing transformations to applications using a high-level domain-specific I/O library. We believe that this approach will be more acceptable to application developers than new language extensions, but will be just as amenable to optimization by advanced compilers, effectively making it a domain-specific language extension for I/O. The paper describes a domain-specific I/O library for irregular block-structured applications based on the KeLP library, describes high-level transformations of the library primitives for improving performance, and describes how a high-level domain-specific optimizer for applying these transformations could be constructed using the telescoping languages framework.
|||Bradley Broom, Daniel Chavarría-Miranda, Guohua Jin, Rob Fowler, Ken Kennedy, and John Mellor-Crummey. Overpartitioning with the rice dhpf compiler. In Proceedings of the 4th Annual HPF Users Group Meeting, Tokyo, Japan, October 2000. (extended abstract). [ bib | .ps.gz | .pdf ]|
Kai Zhang, John Mellor-Crummey, and Robert Fowler.
Compilation and runtime optimizations for software distributed shared
In Languages, Compilers and Run-Time Systems for Scalable
Computers, 5th International Workshop, volume 1915 of Lecture Notes on
Computer Science, pages 182--191, Rochester, New York, May 2000.
[ bib |
We present two novel optimizations for compiling High Performance Fortran (HPF) to page-based software distributed shared memory systems (SDSM). One technique, compiler-managed restricted consistency, uses compiler-derived knowledge to delay the application of memory consistency operations to data that is provably not shared in the current synchronization interval, thus reducing false sharing. (False sharing occurs when two or more processors each accesses mutually disjoint sets of data elements in the same block.) The other technique, compiler-managed shared buffers, when combined with the previous optimization, eliminates fragmentation. (Fragmentation occurs when an entire block of data is communicated to transport only a small fraction its content.) Together, the two techni ques permit compiler-generated code to efficiently apply multi-dimensional computation partitioning and wavefront parallelism to execute efficiently on SDSM systems.
|||P. T. Koch, R. J. Fowler, and E. B. Jul. Message-driven relaxed consistency in a software distributed shared memory. In Proceedings First Symposium on Operating Systems Design and Implementation (OSDI), pages 75--86, Monterey, California, November 1994. [ bib | .ps.gz ]|
|||R. J. Fowler. Architectural convergence and the granularity of objects in distributed systems. In M. Riveill R. R. Guerraoui, O Nierstrasz, editor, Proceedings of the ECOOP '93 Workshop on Object-Based Distributed Programming, volume 791 of Lecture Notes on Computer Science, pages 36--49. Springer-Verlag, July 1994. [ bib | .ps.gz ]|
|||R. J. Fowler and L. I. Kontothanassis. Mercury: Object-affinity scheduling and continuation passing on multiprocessors. In Proceedinge of PARLE'94 (Parallel Architectures and Languages Europe), volume 817 of Lecture Notes on Computer Science, pages 661--676, Athens, Greece, July 1994. Springer-Verlag. [ bib | .ps.gz ]|
|||P. T. Koch and R. J. Fowler. Carlsberg: A distributed execution environment providing coherent shared memory and integrated message passing. In B. Magnusson, G. Hedin, and S. Minör, editors, Proceedings of the Nordic Workshop on Programming Environment Research (NWPER'94), pages 279--294, June 1994. Available as Lund Institute of Technology Report LU-CS-TR:94-127. [ bib | .ps.gz ]|
|||J. E. Veenstra and R. J. Fowler. The prospects for on-line hybrid coherency protocols on bus-based multiprocessors. Technical Report TR-490, University of Rochester, Computer Science Department, March 1994. [ bib ]|
|||J. E. Veenstra and R. J. Fowler. MINT: A front end for efficient sequential simulation of multiprocessor memory hierarchies. In Proceedings of the Second International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS'94), pages 201--207, Durham, North Carolina, January 1994. [ bib | .ps.gz ]|
|||J. E. Veenstra and R. J. Fowler. MINT tutorial and user manual. Technical Report TR-452, University of Rochester, Department of Computer Science, November 1993. [ bib ]|
|||I. Macarie and R. J. Fowler. A closed-form approximation to a mva multiprocessor model and a new component of communication overhead. In Proceedings ROCYCS '93 - The Romanian Symposium on Computer Science, pages 335--347, Iasi, Romania, November 1993. [ bib | .ps.gz ]|
|||A. Cox and R. Fowler. Adaptive cache coherency for detecting migratory shared data. In Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, California, May 1993. [ bib | .ps.gz ]|
|||J.E. Veenstra and R.J. Fowler. A performance evaluation of optimal hybrid cache coherency protocols. In 5th International Conference on Architectural Support for Programming La nguages and Operating Systems, pages 149--160, September 1992. [ bib | .ps.gz ]|
|||R. J. Fowler and L. I. Kontothanassis. Supporting user-level exception handling on a multiprocessor micro-kernel: Experiences with platinum. In Proceedings of SEDMS-III, pages 217--232, Newport Beach, CA, March 1992. [ bib | .ps.gz ]|
|||R.J. Fowler and L.I. Kontothanassis. Improving processor and cache locality in fine-grain parallel computations using object-affinity scheduling and continuation passing. Technical Report TR-411, University of Rochester, Department of Computer Science, 1992. [ bib | .ps.gz ]|
|||W.J. Bolosky, M.L. Scott, R.P. Fitzgerald, R.J. Fowler, and A.L. Cox. NUMA policies and their relation to memory architecture. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 212--221, April 1991. [ bib ]|
|||Alan L. Cox, Robert J. Fowler, and Jack E. Veenstra. Interprocessor invocation on a NUMA multiprocessor. Technical Report TR-356, Computer Science Department, University of Rochester, October 1990. [ bib ]|
|||Thomas J. LeBlanc, John M. Mellor-Crummey, and Robert J. Fowler. Analyzing parallel program executions using multiple views. Journal of Parallel and Distributed Computing, 9:203--217, 1990. [ bib ]|
|||A.L. Cox and R.J. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. In Proceedings of the 12th ACM Symposium on Operating Systems Principles, pages 32--44, Litchfield Park, AZ, December 1989. [ bib ]|
|||Robert J. Fowler and Alan L. Cox. An overview of PLATINUM: A platform for investigating non-uniform memory. Technical Report TR-262, Computer Science Department, University of Rochester, November 1988. [ bib ]|
|||Robert J. Fowler, Thomas J. LeBlanc, and John M. Mellor-Crummey. An integrated approach to parallel program debugging and performance analysis on large-scale multiprocessors. In Proceedings of the SIGPLAN/SIGOPS Workshop on Parallel and Distribute Debugging, pages 74--182, Madison, Wisconsin, May 1988. ACM. [ bib ]|
|||Robert J. Fowler. The complexity of using forwarding addresses for decentralized object finding. In Proceedings of the Fifth Annual ACM Symposium on Principles of Distributed Computing, pages 108--120, Calgary, Alberta, Canada, August 1986. [ bib ]|
|||Robert J. Fowler. Decentralized Object Finding Using Forwarding Addresses. PhD thesis, University of Washington, Seattle, Washington, December 1985. (Department of Computer Science Technical Report TR85-12-1). [ bib ]|
|||Robert J. Fowler. The analysis of a simple distributed resource finding protocol. Technical Report TR84-08-02, Department of Computer Science, University of Washington, Seattle, August 1984. [ bib ]|
|||N.A. Lynch, M.J. Fischer, and R.J. Fowler. A simple and efficient byzantine generals algorithm. In Proceedings of the Second Symposium on Reliability in Distributed Software and Database Systems, pages 46--52, Pittsburgh, PA, July 1982. [ bib ]|
|||D. Dolev, M.J. Fischer, R.J. Fowler, N.A. Lynch, and H.R. Strong. An efficient byzantine agreement without authentication. Information and Control, 52:257--274, March 1982. [ bib ]|
|||R.J. Fowler, A.B. Strubel, P.A. Thiemans, S.C. Vestal, M.J. Fischer, T.H. Kehl, and E.D. Lazowska. The CSL switch: A micro-computer controlled multi-computer front end. The Journal of Digital Systems, 6(3/4, Summer/Fall):265--278, 1982. [ bib ]|
|||E. Lazowska, H. Levy, G. Almes, M. Fischer, R. Fowler, and S. Vestal. The architecture of the Eden system. In Proceedings of the Eighth Symposium on Operating Systems Principles, pages 148--159, Pacific Grove, California, December 1981. [ bib ]|
|||R.J. Fowler, M.S. Paterson, and S.L Tanimoto. Optimal packing and covering in the plane are np-complete. Information Processing Letters, 12(3):133--137, June 1981. [ bib ]|
|||S.L. Tanimoto and R.J. Fowler. Covering image subsets with patches. In Proceedings of the International Conference on Pattern Recognition, pages 835--839, Miami Beach, Fla., December 1980. [ bib ]|
|||R. J. Fowler, M.S. Paterson, and S.L. Tanimoto. The complexity of packing and covering in the plane and related intersection graph problems. Technical Report 80-05-02, Dept. of Computer Science, University of Washington, May 1980. [ bib ]|
|||R.J. Fowler and J.J. Little. Automatic extraction of irregular network digital terrain models. In Proceedings of SIGGRAPH79, pages 199--207, August 1979. Computer Graphics, Vol. 13, No. 2, Aug. 1979,. [ bib ]|
|||T.K. Peucker, R.J. Fowler, J.J. Little, and D.M. Mark. The triangulated irregular network. In Proceedings of the American Society of Photogrammetry Digital Terrain Model Symposium, St. Louis, Mo., May 1978. [ bib ]|
|||R.J. Fowler. Approaches to multi-dimensional searching. In Proceedings of the First Intl. Advanced Study Symposium on Topological Data Structures for Geographic Information Systems, October 1977. Republished in The Harvard Papers on Geographic Information Systems, Vol. 4. [ bib ]|
This file was generated by bibtex2html 1.98.