Marc Shapiro
26 July 1997
This paper deals with sharing objects in large-scale distributed
systems.
It is based on experience, both in building such systems (notably
SOS and PerDiS), and in
working with industrial application developers.
To provide sharing over the network, while improving performance and availability, objects are replicated and cached. To provide sharing over time, objects persist on disk. Replication and persistence are not just copying: they must preserve sharing (``object identity'').
Different replication policies exist; for instance, full replication vs. caching vs. no replication (stubs). Coherence algorithms differ, e.g. pessimistic in well-connected environments vs. optimistic policies for disconnected operation. Some replication is manual: users copy files, administrators restore from backup. Not all replicas are identical copies of each other; for instance, when copying an object across a security boundary you might mask out some fields. I am making two points here: (i) do not assume that replication policy is static, and (ii) do not assume that replication is performed by code inside the object.
For the user, the granule of interest is much larger, say the whole building, or maybe a storey. In system terms, this translates to a container for a set of related objects. I will call these containers clusters hereafter. Think of a cluster as a file or a disquette, also acting as an independent sub-heap.
The cluster is the right level for naming (the user's entry point into the graph) and for setting policies. For instance, if I set the protection attributes for my cluster to non-world-readable, then it's just as if it was stored on a ejected disquette. Even if there are pointers from another, readable document into this one, you can't access it.
The cluster is also the right level for setting replication policies: since users access documents, then replicate documents, not the individual component objects. This is not only easy to grasp for the end-user, it also makes life much simpler for the application programmer. The application programmer just worries about programming the application-related functionality of objects, and storing them into clusters. He doesn't need to worry about putting replication code (and all the related stuff, such as concurrency control, version management, commitment, etc.) into each individual object. Instead, the principle of separation of concerns is applied: replication code is written by the systems programmer, and attached to the systems-level entity, the cluster. Application objects can evolve independently from policy objects, and policy objects can be re-used in different applications.
Clusters help with performance and scalability. It is much more efficient to do I/O and remote copies of a set of objects that will be used together. Clusters break up the system into components that are small enough to be manageable, but not so small that the management overhead becomes unbearable. Each cluster is garbage-collected independently of others.
To summarize:
A policy is an object. It may be stored either inside or outside the cluster it manages. Similarly for class objects.
A further step would be to provide a hierarchical nesting of clusters. This would provide a number of benefits, particularly a clean and efficient way of doing multi-level concurrency control. It raises some design problems, which I have not thought out: for instance can two sub-clusters within a cluster can grow without limit, or how policies at different levels of the hierarchy interact.
The first problem is often solved by serializing the whole object graph that A refers to. This approach does not preserve sharing semantics. If object C is made persistent, which also points to B, then when A and B are read back into memory, there will be two independent copies of C. The second problem, as far as I know, is currently not taken into account.
A much more natural model is persistence by reachability, where any object reachable from a root of persistence (an entry point into the persistent object graph) becomes persistent. This is the job of a garbage collector. Distributed GC for a replicated store must take into account replication and concurrency control. The PerDiS project has shown that such a GC is indeed feasible.
Application developers should be able to use the remote invocation layer directly if the shared-state facilities are insufficient for their purpose. For the kinds of applications I described, I believe direct access to remote invocation will hardly ever be necessary. Remote invocation is harder to use than shared state; in particular, active replication (replicating invocations) poses some tricky problems.
The PDS architecture proposed here is similar to PerDiS.
In SOS, replication is intimately tied in with remote references and remote invocation. The general scenario is the following. A process acquires an object reference as a result of some remote invocation. Before the first invocation, the system causes the reference to be bound. This means that the server is upcalled at a special invocation point; it returns a proxy object that is then instantiated in the client's address space.
Thus, replication policy is decided at bind time by the server. The simplest policy (and the default) is that the server installs a stub that does remote invocations on the server. An alternative is to install a replica or a cache of the server's state.
The binding up-call is a powerful concept that makes SOS infinitely configurable (but of course raises a lot of security problems which we did not address at the time). However, the object model makes it very hard to use. The server can install a single proxy, of a class that must be known at the time the server is compiled; this forces the programmer to make policy decisions much too early in the design process. We tried to use inheritence to separate out the intrinsic semantics of the object from the replication policy and to re-use replication code, but this effort failed because of the limitations of inheritence.
After the demise of SOS, we started a research project in which the system provides an extensible library of replication policy objects. The application programmer can compose his application-specific, replication-independent objects, with a replication policy object. All replication objects have the same interface, therefore it is possible to select a policy at run time.
SOS supported explicit persistence on output and automatic on input. The programmer explicitly asks to copy an object to disk to make it persistent; SOS copies object state onto disk using the same mechanism as for remote copy. When following a reference to a persistent object, object faulting loads it into memory automatically.
Object faulting is a big win for programmers.
In contrast, explicit persistence is too complex even for very
simple-minded applications.
Figuring out how much of the object graph to copy, maintaining
sharing (object identity), and removing unreferenced persistent
objects, are a nightmare.
We conclude from this experience that the only useable model is
persistence by reachability.
PerDiS
PerDiS is an Esprit
Long-Term Research project to build a Persistent Distributed
Store (PDS) for cooperative CAD applications.
The application area is the building and construction industry.
The requirements are ease of use, efficient data access, security,
fault tolerance, and large-scale operation.
We predict that the usage pattern will be such that there is
little write contention.
A PDS is a big, persistent, shared, distributed memory. Conceptually, an application process maps in the memory, allocates objects in it, assigns and navigates pointers; acessibility is restricted only by access rights. Any reachable object becomes persistent; unreachable objects are deleted by the distributed garbage collector. Class information is stored in the store just like any other object (there is a hidden pointer from any object to its class).
A cluster is an arbitrary set of objects; a program allocating an object indicates in which cluster it is to be stored. The cluster, not the object, is the granule presented to the casual user, much like a file in Unix: a cluster has a name, protection attributes, etc. The cluster is governed by objects that dictate the policies applied to all the data objects in that cluster, such as allocation and deallocation algorithms, concurrency control and swizzling algorithms; in a future version there should be per-cluster replication policies too.
Conceptually, each application process gets its own working copy of the memory to modify locally. Updates are accepted at commit time only if the user running the process has the appropriate access rights, and the updates do not conflict with another transaction.
The current version of PerDiS is written in C++ and supports C++ applications. Three different proof-of-concept implementations exist; we are currently implemening a PerDiS platform which will be used by real application developers, to be made public at the end of 1997. It is layered upon a remote invocation mechanism called SSP Chains. SSP Chains are cheap and fault tolerant. They support object migration.