Shared State in a Large-Scale Distributed System

Marc Shapiro
26 July 1997

This paper deals with sharing objects in large-scale distributed systems. It is based on experience, both in building such systems (notably SOS and PerDiS), and in working with industrial application developers.

Shared State

The distributed system is large graph of complex objects, linked by references. An object has state and behavior. Any object can point to any other object, but effective access to an object is limited by access and concurrency control. The dominant access mode is searching for an entry point into the graph, then navigating the references from there. This kind of architecture is appropriate for interactive and group-oriented tasks, such as cooperative computer-aided engineering (ignoring real-time conferencing-like tools), and viewing the artifacts.

To provide sharing over the network, while improving performance and availability, objects are replicated and cached. To provide sharing over time, objects persist on disk. Replication and persistence are not just copying: they must preserve sharing (``object identity'').

Different replication policies exist; for instance, full replication vs. caching vs. no replication (stubs). Coherence algorithms differ, e.g. pessimistic in well-connected environments vs. optimistic policies for disconnected operation. Some replication is manual: users copy files, administrators restore from backup. Not all replicas are identical copies of each other; for instance, when copying an object across a security boundary you might mask out some fields. I am making two points here: (i) do not assume that replication policy is static, and (ii) do not assume that replication is performed by code inside the object.

Granularity

To set up the application functionality, the application programmer deals with fine-grain objects. For instance a architect's CAD tool deals with walls (containing x, y, z coordinates, plus maximum load, fire resistance, coating, etc); setting a pointer from one wall to a rectangle creates a doorway; pointing to another parallelepiped makes the wall rest on a floor, etc.

For the user, the granule of interest is much larger, say the whole building, or maybe a storey. In system terms, this translates to a container for a set of related objects. I will call these containers clusters hereafter. Think of a cluster as a file or a disquette, also acting as an independent sub-heap.

The cluster is the right level for naming (the user's entry point into the graph) and for setting policies. For instance, if I set the protection attributes for my cluster to non-world-readable, then it's just as if it was stored on a ejected disquette. Even if there are pointers from another, readable document into this one, you can't access it.

The cluster is also the right level for setting replication policies: since users access documents, then replicate documents, not the individual component objects. This is not only easy to grasp for the end-user, it also makes life much simpler for the application programmer. The application programmer just worries about programming the application-related functionality of objects, and storing them into clusters. He doesn't need to worry about putting replication code (and all the related stuff, such as concurrency control, version management, commitment, etc.) into each individual object. Instead, the principle of separation of concerns is applied: replication code is written by the systems programmer, and attached to the systems-level entity, the cluster. Application objects can evolve independently from policy objects, and policy objects can be re-used in different applications.

Clusters help with performance and scalability. It is much more efficient to do I/O and remote copies of a set of objects that will be used together. Clusters break up the system into components that are small enough to be manageable, but not so small that the management overhead becomes unbearable. Each cluster is garbage-collected independently of others.

To summarize:

A most important policy is how to convert between the in-memory representation and the external one. A reasonable default is write out raw binary; on input, pointers are swizzled. An alternative policy is to query the objects in the cluster, for instance to avoid writing data that can be reconstructed (``transient'' data). Users could write new policies, for instance in order to connect seamlessly to a legacy database or file representation.

A policy is an object. It may be stored either inside or outside the cluster it manages. Similarly for class objects.

A further step would be to provide a hierarchical nesting of clusters. This would provide a number of benefits, particularly a clean and efficient way of doing multi-level concurrency control. It raises some design problems, which I have not thought out: for instance can two sub-clusters within a cluster can grow without limit, or how policies at different levels of the hierarchy interact.

Persistence

Persistence is an integral part of sharing, which has been all too much neglected by distributed systems researchers. The prevailing approach is manual persistence, where the user or the programmer explictly tells the system to store a particular object on disk or to read it back. (Deciding what bits are written to disk for a given object is an orthogonal issue, which might be one of a cluster's policies.) This approach is terribly inconvenient and error-prone, for much the same reasons as manual memory management. There are two problems. (i) If object A refers to object B and A is made persistent, then B must be persistent too, otherwise A may experience a run-time error next time it is read in. (ii) If the reference from A to B is the only one, and A is deleted, then B should be deleted too.

The first problem is often solved by serializing the whole object graph that A refers to. This approach does not preserve sharing semantics. If object C is made persistent, which also points to B, then when A and B are read back into memory, there will be two independent copies of C. The second problem, as far as I know, is currently not taken into account.

A much more natural model is persistence by reachability, where any object reachable from a root of persistence (an entry point into the persistent object graph) becomes persistent. This is the job of a garbage collector. Distributed GC for a replicated store must take into account replication and concurrency control. The PerDiS project has shown that such a GC is indeed feasible.

A Persistent Distributed Store

I call the above architecture a Persistent Distributed Store (PDS). The PDS can be implemented as a two-layered system. The top layer provides the application developer with a simple and easy-to-use model: shared state and passive replication (i.e. replicating the internal state of the objects). The bottom layer supports the remote invocation of policy objects.

Application developers should be able to use the remote invocation layer directly if the shared-state facilities are insufficient for their purpose. For the kinds of applications I described, I believe direct access to remote invocation will hardly ever be necessary. Remote invocation is harder to use than shared state; in particular, active replication (replicating invocations) poses some tricky problems.

The PDS architecture proposed here is similar to PerDiS.

Appendix

SOS

SOS was an early (1984) distributed object system for C++ sporting dynamic loading of objects and classes, run-time type checking (safe casts), remote invocation, and persistence.

In SOS, replication is intimately tied in with remote references and remote invocation. The general scenario is the following. A process acquires an object reference as a result of some remote invocation. Before the first invocation, the system causes the reference to be bound. This means that the server is upcalled at a special invocation point; it returns a proxy object that is then instantiated in the client's address space.

Thus, replication policy is decided at bind time by the server. The simplest policy (and the default) is that the server installs a stub that does remote invocations on the server. An alternative is to install a replica or a cache of the server's state.

The binding up-call is a powerful concept that makes SOS infinitely configurable (but of course raises a lot of security problems which we did not address at the time). However, the object model makes it very hard to use. The server can install a single proxy, of a class that must be known at the time the server is compiled; this forces the programmer to make policy decisions much too early in the design process. We tried to use inheritence to separate out the intrinsic semantics of the object from the replication policy and to re-use replication code, but this effort failed because of the limitations of inheritence.

After the demise of SOS, we started a research project in which the system provides an extensible library of replication policy objects. The application programmer can compose his application-specific, replication-independent objects, with a replication policy object. All replication objects have the same interface, therefore it is possible to select a policy at run time.

SOS supported explicit persistence on output and automatic on input. The programmer explicitly asks to copy an object to disk to make it persistent; SOS copies object state onto disk using the same mechanism as for remote copy. When following a reference to a persistent object, object faulting loads it into memory automatically.

Object faulting is a big win for programmers. In contrast, explicit persistence is too complex even for very simple-minded applications. Figuring out how much of the object graph to copy, maintaining sharing (object identity), and removing unreferenced persistent objects, are a nightmare. We conclude from this experience that the only useable model is persistence by reachability.

PerDiS

PerDiS is an Esprit Long-Term Research project to build a Persistent Distributed Store (PDS) for cooperative CAD applications. The application area is the building and construction industry. The requirements are ease of use, efficient data access, security, fault tolerance, and large-scale operation. We predict that the usage pattern will be such that there is little write contention.

A PDS is a big, persistent, shared, distributed memory. Conceptually, an application process maps in the memory, allocates objects in it, assigns and navigates pointers; acessibility is restricted only by access rights. Any reachable object becomes persistent; unreachable objects are deleted by the distributed garbage collector. Class information is stored in the store just like any other object (there is a hidden pointer from any object to its class).

A cluster is an arbitrary set of objects; a program allocating an object indicates in which cluster it is to be stored. The cluster, not the object, is the granule presented to the casual user, much like a file in Unix: a cluster has a name, protection attributes, etc. The cluster is governed by objects that dictate the policies applied to all the data objects in that cluster, such as allocation and deallocation algorithms, concurrency control and swizzling algorithms; in a future version there should be per-cluster replication policies too.

Conceptually, each application process gets its own working copy of the memory to modify locally. Updates are accepted at commit time only if the user running the process has the appropriate access rights, and the updates do not conflict with another transaction.

The current version of PerDiS is written in C++ and supports C++ applications. Three different proof-of-concept implementations exist; we are currently implemening a PerDiS platform which will be used by real application developers, to be made public at the end of 1997. It is layered upon a remote invocation mechanism called SSP Chains. SSP Chains are cheap and fault tolerant. They support object migration.


Marc Shapiro, INRIA Rocquencourt
Last modified: Sat Jul 26 16:22:24 EDT 1997


Return to: Table of Contents