Issues in Compiler Research
Position Paper for NSF Workshop on
New Challenges and Directions for Systems Research
St. Louis, MO, July 31 - August 1, 1997
Siddhartha Chatterjee
Department of Computer Science
The University of North Carolina at Chapel Hill
Introduction
The rapid developments in computer and communication systems and the
growth of innovative applications provides new challenges and
opportunities to the research area of programming languages,
compilers, and runtime systems. This research area, being at the
nexus of systems research, will be critical to the development of
applications that exploit high levels of abstraction and functionality
and achieve high levels of performance.
I see the following as key problems in this area of research.
- Management of deep memory hierarchies.
- Aggressive optimization of large applications in the presence of
abstraction boundaries, protection boundaries, and separately
compiled modules.
- Use of domain-specific information to optimize the transmission,
storage, and interpretation of object code and multimedia data.
The following sections describe these items in greater detail.
Memory hierarchy management
The increasing disparity between CPU and memory speeds over the past
two decades has resulted in the use of multi-level memory hierarchies
that attempt to exploit locality of reference in programs and hide
memory latencies. The performance of many of today's complex
applications is determined by the characteristics of the memory
subsystem rather than by the operation count of the program, and there
is often a dramatic difference between the peak advertised CPU
performance and the observed performance of a given application. The
widening gap between processor and memory performance is likely to
continue, resulting in even deeper memory hierarchies.
An integrated solution to the memory hierarchy management problem must
deal with resource allocation and management issues at all levels of
the hierarchy: good interprocedural allocation of machine registers,
spatial and temporal locality in multiple levels of data cache, code
layout in the instruction cache, virtual-to-physical address mappings
in the translation lookaside buffer, page-level locality in main
memory, I/O buffers in secondary storage, and local/remote memory and
memory consistency issues in distributed shared memory systems. The
state of practice is that users end up restructuring programs manually
to improve memory performance, and that the tools available for this
task are limited to specific subproblems or to subclasses of programs.
The lack of automatic tools for memory hierarchy management has the
obvious negative consequences on readability, portability, and
maintainability of code. A typical example of this phenomenon is seen
in high-performance mathematical library codes such as ScaLAPACK,
where much of the complexity of a typical routine is independent of
the numerical computing issues and arise from trying to do a good job
of load balancing and memory management.
The state of the art in tools for this problem varies widely.
- Global and interprocedural register allocation techniques are
mature, and are available in many research and commercial
compilers.
- Automatic loop tiling techniques for data cache locality[10] can
handle a limited class of programs (typically, polyhedral loop
nests in scientific codes accessing array data with affine index
expressions), and do not cleanly handle multiple levels of caches.
- Code layout techniques for improved instruction cache behavior are
mostly ad hoc and show small levels of performance improvement.
- Paging algorithms for virtual memory are largely demand-driven,
and typically do not incorporate any information that compiler
analysis might reveal. Mechanisms exist for pinning down pages,
but the use of these mechanisms is largely left to the user.
- A large body of work exists in the area of out-of-core algorithms,
but this community has been largely theoretically inclined and
interacts very little with other areas of systems research.
(Cormen[5] and Choudhary[8] are exceptions to this general trend.)
- Intelligent prefetching of pages in a global memory environment[4]
is an active area of operating systems research, but proceeds in a
vacuum, as current techniques for predicting page reference
patterns are very rudimentary.
Good memory locality is a prerequisite for high performance. The
fundamental challenges in this problem are twofold:
- To develop a mixture of static and dynamic tools to understand and
improve the data locality pattern of a larger class of programs.
See Pauca et al.[12] for initial attempts in this direction.
- To develop an integrated framework that reveals the inclusion
properties among successive levels of the memory hierarchy and
exploits these properties to develop unified solutions that scale
smoothly across many orders of magnitude of memory size and access
latency. See Bilmes at al.[1] and Kodukula et al.[9] for initial
attempts in this direction.
Progress in this area will depend on the collaboration among several
research areas of computing systems.
- Programming languages to allow users to convey domain-specific
knowledge about locality at a high level and to design
memory-friendly and latency-tolerant programs.
- Compilers and static analysis tools to identify locality patterns,
and to jointly restructure code and data to improve locality.
- Operating systems to provide mechanisms for thread scheduling and
data prefetching, with appropriate hooks to incorporate results of
static analysis.
- Computer architecture to provide features that support latency
tolerance and memory prefetching, to provide feedback to the
application about its memory access patterns, and to allow the
development of customized memory consistency protocols.
Performance in the face of abstraction
The growth of problem solving environments and object-oriented
programming techniques means that programs are increasingly been
written as collections of modules rather than as monolithic entities.
Application programs are often little more than the glue that binds
together calls to library routines (e.g., OpenGL in graphics
programming; the Java API class libraries; and numerical libraries
such as LAPACK). A similar layering exists in operating systems,
although the reason there is often protection rather than abstraction.
Such layering and modularity are indispensable for controlling the
complexity and safety of applications, but provide a difficult
challenge for delivering high performance out of such applications[2].
The following characteristics of this problem make it both challenging
and profitable.
- Individual components can be very small. Many object methods are
very simple, perhaps a dozen or so instructions. Existing studies
show that the characteristics of object-oriented programs differ
significantly from those of traditional procedural programs[3].
- Separate compilation is indispensable for developing large
software systems. A program may not even exist in its entirety
until execution time. This is further complicated by issues such
as dynamic linking, proprietary third-party libraries, and
untrusted components in a network environment.
- Different components of an application can be written in different
languages, making cross-language interoperability a key issue.
- Opportunities for information propagation exist across relatively
long call chains. This offers the opportunity for optimization
across call chains, e.g., removing redundant safety and argument
validation checks or performing aggressive register allocation.
Breakthroughs in this problem would have widespread impact in diverse
areas, ranging from faster processing of communication protocol stacks
to speeding up the verification process for Java bytecodes.
Progress in this problem needs to be made at different levels.
- The organization of separately compiled modules needs to be
rethought. In particular, separately compiled modules need to
provide more information about optimization opportunities,
in the form of program-level invariants and assertions as well as
information about resource usage.
- Separately compiled modules must support specialization (cloning) of
routines based on conditions existing at call sites. This could
be supported by maintaining additional representations of the
program that could be used to generate such specialized versions
on demand.
- The techniques of proof-carrying code[11] might be adapted to
specify optimization opportunities in addition to safety
properties.
- Visual programming systems can allow the user to specify code
properties relevant to optimization that might be difficult to
recognize automatically.
There is need to develop consensus and standards within the compiler
community about the kinds of capabilities that separately compiled
modules must support. The benefits of standards for representing such
information are clear.
Domain-specific compression
With the growing use of network computing, there is a growing need for
efficient encodings of multimedia data and object code in order to
conserve network resources during transmission, disk resources during
storage, and processor resources during interpretation. In this
section I will address a restricted form of this issue as it applies
to code (such as Java class files) that a client may download from a
server.
Text compression techniques can be applied naively to object code
files, but more leverage can be gained by exploiting special
characteristics of instruction and data streams. Ernst et al.[7] have
demonstrated greater improvements by exploiting semantic information
in instruction streams, but this has barely scratched the surface of
what is possible in this area.
Many possible research opportunities exist here, all with high payoff.
- The coding approaches examined so far are ad hoc. It remains to
be seen what further gains are possible using advanced statistical
coding techniques such as those based on Markov models.
- As a possible alternative to JIT compilation, one could adaptively
identify "superoperators" (common sequences of instructions,
possibly parameterized) in the program, and use them in the
encoding process. The client side could dynamically generate
native code or otherwise optimize the implementation of
superoperators to speed up the interpretation process.
Conclusions
Compilers and related technologies have a key role to play in the next
generation of systems and applications. A good example of this is the
work of Eide et al.[6], in which fairly standard compiler
optimizations were applied to RPC stub generation to demonstrate
impressive gains in end-to-end performance. Similar opportunities
undoubtedly exist in other application areas. The compiler community
needs to be driven by the problems faced by such new application
domains.
Bibliography
- J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel, "Optimizing
Matrix Multiply using PhiPAC: A Portable, High-Performance, ANSI
C Coding Methodology", Proceedings of International Conference on
Supercomputing, Vienna, Austria, July 1997. To appear.
- J. C. Brustoloni, "Effects of Buffering Semantics on I/O
Performance", in Proceedings of OSDI'96, USENIX, Seattle, WA,
pp. 277-291, October 1996.
- B. Calder, D. Grunwald, and B. Zorn, "Quantifying Behavioral
Differences Between C and C++ Programs". Journal of Programming
Languages 2(4), pp. 313-351, 1994.
- P. Cao, E. W. Felton, A. R. Karlin, and K. Li, "A Study of
Integrated Prefetching and Caching Strategies", Proceedings of
ACM SIGMETRICS'95, pp. 188-197, May 1995.
- T. H. Cormen and M. Hirschl, "Early Experiences in Evaluating the
Parallel Disk Model with the ViC* Implementation". To appear in
Parallel Computing. Also Dartmouth College Computer Science
Technical Report PCS-TR96-293, August 1996.
- E. Eide, K. Frei, B. Ford, J. Lepreau, and G. Lindstrom, "Flick:
A Flexible, Optimizing IDL Compiler", Proceedings of PLDI'97,
pp. 44-56, June 1997.
- J. Ernst, W. Evans, C. W. Fraser, S. Lucco, and T. A. Proebsting,
"Code Compression", Proceedings of PLDI'97, pp. 358-365, June
1997.
- M. Kandemir, J. Ramanujam, and A. Choudhary, "Improving the
Performance of Out-of-Core Computations", Proceedings of the
International Conference on Parallel Processing, Blomingdale, IL,
August 1997. To appear.
- I. Kodukula, N. Ahmed, and K. Pingali, "Data-Centric Multi-Level
Blocking", Proceedings of PLDI'97, pp. 346-357, June 1997.
- M. S. Lam, E. E. Rothberg, and M. E. Wolf, "The Cache Performance
and Optimization of Blocked Algorithms", Proceedings of ASPLOS
IV, pp. 63-74, April 1991.
- G. C. Necula and P. Lee, "Proof-Carrying Code", Technical Report
CMU-CS-96-165, Carnegie Mellon University, September 1996.
- V. P. Pauca, X. Sun, S. Chatterjee, and A. R. Lebeck,
"Architecture-efficient Strassen's Matrix Multiplication: A Case
Study of Divide-and-Conquer Algorithms", Proceedings of the
International Linear Algebra Society (ILAS) Symposium on Fast
Algorithms for Control, Signals and Image Processing, Winnipeg,
Manitoba, Canada, June 1997.
Siddhartha Chatterjee (sc@cs.unc.edu)
Last modified: Mon Jul 28 09:38:58 EDT 1997