Simple GC-Safe Compilation Hans-J. Boehm (boehm@xerox.com) Xerox PARC 8/13/91 It is increasingly attractive to compile programming languages requiring garbage collection into a standard intermediate language. This allows readily available translators for the intermediate language to be used in targeting an increasing variety of architectures, and thus allows an implementation that is immediately portable across a variety of architectures. The intermediate language most commonly used for this purpose is the C programming language. There are translators to C from C++, Modula-3, Eiffel, Sather, Scheme, Common Lisp, and others. In several cases, these are the only currently viable implementations of the respective language. The reason for this is no doubt that C compilers target more different architectures than translators for other candidate languages. Furthermore every major hardware manufacturer provides a C compiler, which can thus take advantage of all instruction timing information known to the hardware manufacturer. This makes it difficult for third party compilers for other languages to provide comparable performance. The following proposal also applies to other intermediate languages. However, due to the dominance of C as an intermediate language, we cast the proposal in terms of C. We are careful to state the proposal such that it is also applicable to hand-written C, both because this is useful in itself, and because any other proposal would be unrealistic. In order to facilitate the discussion, we refer to the original source language as L. A fundamental difficulty with using standard C in conjunction with a garbage collector is that the C compiler is unlikely to cooperate with the collector in any major way. It cannot be expected to generate tables describing data structure layout to the garbage collector, or to adhere to conventions that clearly identify pointers. In fact, it has no way to generate such information, since it is often difficult to generate or write C programs with reliable type information in the C source. Most C-generating compilers appear to emit misleading C type information. However, the C compiler is free to choose it's own data structure layout, at least for function activation records, thus making it essentially impossible for the L-to-C compiler to emit such information. A number of techniques have been proposed to eliminate the need for compiler generated data layout information. Edelson proposes a method based on explicit root registration [Edelson 90]. Most other approaches rely on conservatively traversing at least the program stack and registers, treating all bit patterns that correspond to object pointers as though they were objects pointers. Currently all approaches exhibit at least one of the following problems: 1. They force all local variables that may be root pointers to heap allocated data structures into memory, and out of machine registers. On modern architectures this is likely to be expensive, and the cost is not proportional to the amount of allocation. Edelson's collector and conservative collectors relying on C language ``volatile'' declarations exhibit this problem. 2. They fail to work correctly in a multi-threaded environment in which collection may occur asynchronously with a client process. This generally applies to fully copying collectors that use a C-like intermediate language, since it is impossible to ensure that the generated code will not maintain unrecognizable copies of pointers, for example by copying a registered root into a machine register. 3. They may collect reachable objects (or relocate them without correctly updating pointers) if the C compiler performs optimizations that disguise pointers. This may happen in spite of the fact that transformations performed by the C compiler are normally considered legal. The rest of this paper addresses solutions to the third problem for collectors that do not exhibit the other two. These include the collectors described in [Bartlett 88], [BoehmWeiser 88], [BoehmDemersShenker 91], and reference count collectors such as the one described in [Rovner 84]. Those in [BoehmWeiser 88] and [BoehmDemersShenker 91] are suitable if interoperability with hand-written C code is an issue. We now describe the problem of unsafe compiler optimizations in more detail, and present a proposed solution. The solution consists of both a definition of ``garbage-collector-safe'' C-code, and of ``garbage-collector-safe'' compiler optimization. The latter ensures that the safety properties required of the input program are preserved in the object program. The goal of this proposal is to guarantee safe garbage collection of efficient code, while requiring a minimum of modification to compiler back ends. We would have preferred to require no modification. But only an adaptation of Edelson's root registration approach to not-fully-copying collection seems likely to make this feasible. However, we are unhappy with its likely cost in code quality and interoperability with hand-written C code. We are not addressing situations in which it might be necessary to make extensive modifications to the compiler back end for other reasons, for example to provide exact pointer identification for object persistence. This is a minimalist proposal to provide for safe garbage collection. Problem Description Currently C compiler optimizations are often unsafe in the presence of garbage collection. For example, on a SPARC, the loop in the code fragment: int *a, *b; int i, sum; a = (int *)malloc(100000 * sizeof (int)); b = (int *)malloc(100000 * sizeof (int)); ... for (i = 0; i < 100000; i++) { sum += a[i] + b[i]; } /* a, b, and i dead here */ could profitably be compiled to something like diff = b-a; /* diff reuses b's register, which is now dead. */ for(aptr = a; aptr < a +100000; a++) { sum += *aptr + *(aptr+diff); } This is profitable on a SPARC because the aptr+diff addition is free, since it becomes part of a doubly indexed load instruction. Thus this stunt saves an increment instruction in the loop, and no shift operations are necessary. The result is that b does not appear accessible to a garbage collection occurring inside the loop. (In a single-threaded system, we would need a function call inside the loop to trigger the collection.) SPARC compilers actually do not yet produce the above code. However, simpler versions of the problem do appear in other contexts. It is possible to contrive examples that fail with existing SPARC compilers. For a simpler example, consider the function f in: struct s { char space[35000]; struct s * next; }; struct s * f(x) struct s * x; { return(x -> next); } On an IBM RS/6000, this results in the following code (extracted from a compiler produced listing): AIU r3=r3,1 L r3=SHADOW$(r3,-30536) BA lr The first instruction adds 1 to the upper half of the argument. If a garbage collection is triggered at this point, e.g. by a concurrent thread, the only reference to x consists of x + 65536 stored in r3. The following load instruction then supplies an offset of -30536 to compensate for the initial overshoot. In general, there is nothing in the C language standard that guarantees that a pointer to an object be recognizable until the object is dereferenced, a property that is clearly required by a garbage collector. The same comment generally applies to compiler intermediate languages that were designed for use in compilers that do not expect a garbage collector as part of the run-time system. Proposal The value returned by an allocation function (e.g. malloc) is a base pointer. A base pointer is not necessarily the base address of an object, but the garbage collector must be able to easily convert it to one. (E.g. many standard malloc implementations store bookkeeping information at the beginning of an object, and return a displaced pointer. A Scheme implementation might add one to a pointer to distinguish it from a 31-bit integer.) We may perform arithmetic or logical operations on a base pointer to derive another value that may still be used to dereference the original object. If a value derived from a base pointer is still convertible to the base address of the object by the garbage collector, then we still consider the result to be a base pointer. If not, then it is a derived pointer. Any value computed from a derived pointer and that may still be used to access the object is also a derived pointer. (This is true even if its value becomes identical to a base pointer to the object. This aspect of the definition is unimportant to the client, but simplifies the task of GC-proofing existing compilers.) The idea behind this proposal is that the object code for every routine should guarantee that whenever it constructs a derived pointer to an object, it also keeps a corresponding base pointer from which the derived pointer was generated. The lifetime of the base pointer must be at least that of the derived pointer. In order to make it feasible for a conventional compiler to maintain this property, we require that no C function either returns, or stores a derived pointer it created. (It created the derived pointer if it computed it by logical or arithmetic operations from another derived or base pointer.) Here we take ``storing a pointer'' to include assigning it to a variable located in the data or bss segment, or storing it in the heap. We refer to this condition as constraint I. We expect the C programmer (or C program generator) to produce code that satisfies constraint I, as well as the following two constraints: (II) No derived pointer shall be assigned to an automatic variable with larger scope than all of the variables containing base pointers from which it was derived. (III) No derived pointer shall be passed to a function, if that function may store the derived pointer. Note that the definition of base and derived pointers is dependent on a particular collector. However, given constraint I, a compiler may conservatively assume that all values returned by functions or passed as parameters are (nonpointers or) base pointers. Any values computed from such pointers should be treated both as derived pointers and also as a potential base pointer for further derived pointers. (Derived pointers may be passed as parameters. But it is safe to treat them as base pointers inside the function, since the caller is required to maintain a corresponding base pointer.) Unfortunately, the notion of ``pointer'' is not well-defined in C. The most useful notion is probably that any pointer type, or an integer large enough to hold a pointer, is considered to be a pointer. We propose a compiler option (GCSAFE) that guarantees the following: the optimizer does not introduce derived pointers whose lifetime exceeds that of the corresponding base pointers, and that the optimizer does not remove as a dead variables the last base pointer to an object, while derived pointers to the same object may still be live. More precisely, this option guarantees that the object code satisfies the (low level analog) of the above three conditions, provided the C source code does. In addition, this option guarantees two properties that are satisfied by most compilers on modern architectures in any case: 1. All pointers are represented by contiguous storage aligned on a (sizeof (long)) boundary. (On most modern machines this should be done for performance or correctness reasons in any case. For historical reasons, it's not true on Sun 3s.) 2. A pointer assignment will cause a write operation to the memory cell containing the pointer. If xis an uninitialized pointer variable, it is safe to compile an assignment of y to x as (using C notation) if (x != y ) x = y Though there are presumably no compilers that do this, the resulting code would break a generational collector such as the one in [BoehmDemersShenker91]. Note that it is usually possible for a compiler to verify that a source program satisfies the above constraints. There are two exceptions. Certain pointers that appear to the C compiler to be derived may actually be base pointers. This may happen if the collector has a very general notion of base pointer. Second, it may not be known whether a function stores a derived pointer passed as an argument. Both can be expected to rarely cause problems, and can be handled by an overly cautious compiler that recognizes pragmas that allow the programmer to declare that a parameter is not saved, or that a variable in fact contains a base pointer. Implementation and Consequences Compilation with GCSAFE should have a minimal effect on performance for machines with a large register set. Certain variables must be maintained longer than otherwise necessary. Thus some additional register pressure is introduced, but no additional operations (except for register spills) are needed. Note that most loop indices and the like can easily be determined to not point to the heap, and are thus unaffected. The restriction on functions makes it possible to do some reasonably precise analysis even with separate compilation. Parameters may still have to be kept live longer than would otherwise be necessary. Certain programming styles become impossible in this environment. Functions returning pointers to the middle of strings are the most common example. For example, the C library functions index and rindex should not be used with a garbage collector. However, variants that return the index of the desired character, rather than a pointer to it, are perfectly safe. A reasonable implementation strategy for the compiler writer is to annotate dereferences in the intermediate code with a set of base pointer locations, one for each possible derived pointer that might be dereferenced. Thus given the following code segment a := malloc( ... ); d := a; c := f(); if ( ... ) { b := a + 4; } else { b := c + 8; } x := *b the final dereference of b would be annotated with the pair of base pointer locations (a, c). Equivalently d could be used instead of a, since it is also a copy of the base pointer from which b was derived. Whenever more than one base pointer is available, one of those known to definitely be a base pointer, and still in scope at the point of reference is chosen. If no such base pointer is available, all possible base pointers are added as annotations (and a warning may be issued). Procedure calls involving a derived pointer as an argument must be treated as dereferences. Annotations on dereferencing operations are viewed as uses of the base pointers for purposes of flow analysis, especially for dead variable analysis. They do not otherwise appear in the generated code. Since the lifetime of derived pointers can only end at dereferences of the derived pointer, many instruction reorderings and other low level code optimizations can be performed without any analysis of pointer lifetimes. Preliminary experiments indicate that it is possible to fill most delay slots in SPARC assembly code using only trivially safe instruction interchanges, together with some transformations that don't affect the order in which operations are executed. References [Bartlett 88] Bartlett, Joel F., Compacting Garbage Collection with Ambiguous Roots, DEC WRL Research report 88/2, Digital Equipment Corporation Western Research Laboratory, February, 1988. [BoehmWeiser 88] Boehm, Hans-J., and M. Weiser, ``Garbage Collection in an Uncooperative Environment'', Software Practice and Experience 18, 9 (September 1988), pp. 807-820. [BoehmDemersShenker 91] Boehm, Hans-J., Alan J. Demers, Scott Shenker, ``Mostly Parallel Garbage Collection'', ACM SIGPLAN 91 Conference on Programming Language Design and Implementation, SIGPLAN Notices 26, 6 (June 1991), pp. 157-164. [Edelson 90] Edelson, ``Dynamic Storage Reclamation in C++'', University of California at Santa Cruz technical report UCSC-CRL-90-19, June 1990. [Rovner 84] Rovner, Paul, ``On Adding Garbage Collection and Runtime Types to a Strongly-Typed, Statically Checked, Concurrent Language'', Report CSL-84-7, Xerox Palo Alto Research Center.