
RMI cleanup -- Because a Java application with native methods may
support multiple platforms, it helps to organize multiple shared
libraries into sub-directories by platform, such as
<app-dir>/lib/<platform>


Performance/robustness issues
  o demand read rmi server should use a producer/consumer queue so that number of worker threads is bounded

Consider: would system be simplified by getting rid of "arbitrary byte range" and just doing full block read/write?


Bug report -- RandomAccessState passes fails test 18 (out of memory
error) (pass WRITE-8 for 1000 x 100K files pass WRITE-2 for 100 x 1MB
files); works for 50MB (50x1MB) at default memory size (so maybe
that's all we can expect -- works when working set is smaller than
heap size and flakey when larger...)


  o per-writer-log should be able to put bodies on disk (so that bound writes don't overload log)


  o remote demand read needs an additional argument: what do on miss: NACK | BLOCK | FORWARD (forward needs to deal with loops)



TBD: update schedule for between now and SOSP
  o "finish" OSDI paper
      -- clean up system
      -- design issues: 
             SDIMS-FS design -- hole filling
      -- design, run, debug, re-run, graph experiments

Here is a draft list of missing features and TBDs. We need to (a) nail
down some of the details of how long different features would take to
add and what they would buy us and (b) prioritize work based on what
experiments they enable, how important they are to the 'completeness'
of the whole system, and (c) how much effort/risk they require. Also,
add new issues as they come up.



After the deadline thoughts and issues

  (0) DONE -- publish extended tech report with full set of experiments
  and appendicess


  (1) Clean up LocalStore -- see notes/2004/7/20b.txt
      o Fix things broken by RandomAccessState
            o subscribeUPQ(..., IS, VV) needs to scan local store
            o replace "chain" logic for filtering redundant sends
              with push metadata + receiver-fetch of body
      o MDD: Before integrating new RandomAccessState w/ DataStore
        need to add the conflict detection/resolution logic bc
	it is complex enough to worry me -- maybe go back to
	block based storage? Maybe do something simpler for 
	range based (e.g., no re-grouping once things split
	on assumption that conflicts are rare...) Add conflictStore
	class/DB?

      o Re-write per-object state - rip out tentative state, rip out csn, 
        page metadata to disk
        o Check the code for cloneIntersectInvaltargetgetChop... very
          carefully. I'm not convinced by it. Why does the bound
          version return (unbound) PIs? Why does the bound version use 
          accept stamps and the PI version use VV's? Does the right
          thing happen in the single writer logs if we receive
	  a bound inval and then some partially overlapping other stuff?
      o Clean up locking model 
      o re-write AllInterestSets (auto-split on imprecise inval to
        minimize imprecise)
      o add *persistent* log to Log (perhaps use BerkeleyDB?)
      o handle demand read to imprecise interest set
      o we need to separate out and precisely defined the notions of
        (1) an "interest set" that a node is tracking as
        precise/imprecise; (2) "target" of an invalidation; (3)
        "precise set" of a subscription. Right now we call all of them
        "interest set" and cause confusion. Need to talk through
        design of "fall through" case, of "demand miss" of imprecise
        interest set; of "hole filling"; etc.  Need a written
        statement of what data structures a node maintains locally and
        how it handles various requests.
      o nonblocking reads -- fix that
      o imprecise read interface
      o move initial prefetching out of localstore into application
      o make sure we have solid end-to-end automated tests of
        basic functionality
      o Thou Shalt Not turn off Env.Warn and Env.PerformanceWarn
        The
        The performanceWarn and Warn calls should all be designed
        to print rarely (e.g., if(!warned){Env.warn(...); warned = true;})
	

  (2) Checkpoint/recovery
      o Local checkpoint/recover
      o Garbage collect log, remote send/receive checkpoint

  (3) Global check/cleanup -- exception handling, event logging,etc.


      o code review -- grep "catch" and make sure every exception
        is handled in some sensible way (not silently dropped!)

      o add self-test of distributed delete code -- create on node A,
        (eventually) successful read on node B, delete on node A,
        NoSuchObjectException on node B

      o Env.RemoteAssert() should throw a remote inconsistency
        exception to force us to handle that case (or perhaps defer
	this for now)

      o Sync requests should include in their body the name of the
        node that is to provide the sync (not just the node that is
        requesting the sync. And replies should also include the node
        that generated the sync reply (rather than requiring the
        receiver to get this info from the stream. This is in keeping
        with our notion that all messages should be self-describing.

      o The "headers" that we send at the start of inval streams (and
        others?)  should be made into an object (e.g.,
        InvalStreamHeader = {magic, sender, startVV, ...} rather than
        just being a series of individual objects that we need to
        remember to send in the right order. This would be
        cleaner. Also, it would make it easier to do things like
        support optional arguments, new timeouts, etc.

      o the experimental control harness that we use to run
        experiments that take commands from stdin and put output to
        stdout should be decoupled from URANode. It should be an
        application that uses Core's public interface as a library
        interface. Also, it might make sense to structure things so
        that the stdin/stdout/stderr of the Core are decoupled from
        the stdin/out of this application program?
	
      o Remove CSN from invals (main issue may be cleaning up
        UpdateLog)

      o Clean up Env.printdebug stuff -- provide some sort of 
        conditional control so we can turn on/off specific
	subsets of debugging messages


  (4) Performance tuning
     o FIRST: design key experiments and prioritize this list
       based on what is needed for experiments
     o delay parameters on inval channels
     o avoid resending same objects on demand channels
     o out-of-sequence batching of invalidations (as per paper -- 
       additional delay) [see below]
     o verify size of metadata is near optimal for broad range
       of workloads; fix if needed
     o delay on applying invals to local (for bodies to arrive or for 
       precise to arrive)
     o Incoming update bodies should be buffered in a priority queue
       sorted by acceptStamp so that we can apply lower-numbered ones
       ASAP w/o them waiting behind higher-numbered ones (a la Amol's
       system.)
     o maybe: add vector updates to avoid sending same stuff on inval channels
       (simple: bidirecitional case; more complex -- send "my new 
       startVV" across)
     o RandomAccessState::scanForUpdates prefetch logic should
       be moved to application level so that subscriber can scan
       and prefetch subdirectory of interest rather than
       requiring scan of all objects in system (see comment in that
       function for more detail.)
     o delay applying bound invals
     o separate demand and prefetch body channels; 
     o tcp-nice
     o SDIMS timeout -- reset on network reconfiguration [see below]
     o before applying inval to localstore, interesect operation on log
       to refine incoming imprecise inval based on past knowledge [see below]
     o add hooks on remote read miss to support hierararchical 
       caching [see below]
     o read liveness optimization [see below]
     o scalability of interest sets/precise sets to support coherence-mode
       clients that subscribe to lots of objects rather than a few
       big subdirectories [see below]
     o Use the java.nio functionality to essentially use select()
       instead of lots of threads for our sockets
     o Use the java.nio.channels.FileChannel (and socket channels) to
       do direct transfer of data between channels (to avoid copies)
        -- we should never have to touch the body bytes (think this 
       through a lot before doing it: (a) is it worth the trouble to 
       get rid of a copy? How would locking work?
     o Combine small local writes into a large write (before I
       send it to anyone else) 

  (5) NFS server
     o Think carefully through directory operations (e.g., unlink or
       rename) 
     o Add API to bind several updates into a transaction
     o Local file system interface; directories, metadata: owner,
       permissions, last modified time, ...
     o security/access control

  (6) Conflict detection and logging; interface for accessing 
      conflicting writes

      create "pendingBody" priority queues
         o Currently, BodyRecvWorker ends up applying bodies in FIFO
           order. The fix is: all BodyRecvWorker threads should
           insert the BodyMsg's they pull off the network into
           a *shared* set of priority queues, one per NodeId,
           this set of priority queues should be notified whenever
           an element of currentVV is advanced 
           e.g., pending.notify(nodeId, localClock),
           and a worker thread(s) should pull causally legal events
           out of these shared queues and call core.applyBody with 
           them. May want multiple worker threads since
           DataStore:applyBody can block.

         o DataStore::applyOverlappingAtStartTime and 
	   DataStore::applyBody need to enqueue bound invals
	   or bodies that are applied to imprecise interest 
	   sets into this queue so they can be applied later
	   (so in addition to per-writer priority queues for
	   delaying application of body until currentVV includes
	   body, also need per-IS priority queues for
	   delaying application of body until we think
	   IS is precise (this could be treated as a hint
	   to simplify locking.)

	 o See mike's notes for a simple way to update
           RandomAccessState to handle conflict detection/resolution
	


  (7) Additional garbage collection, performance tuning
      o downgrade/merge idle interest sets
      o cache replacement
      o WorkQueues should use dyanmic # of threads -- in their
        constructor hand them a max # of threads and a WorkerFactory
        that can produce worker threads. Then, if work sits
        in the queue for "too long" or if queue gets "too big",
        create an additional thread.
      o Add garbage collection of "dead object" records
        (see mike's notes/2004/8/2d.txt) 
      o Env.remoteAssert should throw an exception that
        we handle appropriately in the code

  (8) Future
      o PRACTI
        o Applications
        o Arbitrary consistency: TACT, big red button
        o Limiting version vector size (no O(# nodes))
      o Volume quorums + inval + leases
      o callbacks
      ------
      o security
      o See "research issues" below

Also: Design and execute more experiments for final version.

-----------------------------------------------------
-----------------------------------------------------

ISSUE:

  o On Tuesday, there was a change to the logic of inval iterator to
    "fix" the split/join case (in particular, the issue that delta
    doesn't end up precise for b; it is not clear this is a "bug" but
    it would be nice to fix as per footnote in paper.) From Arun's
    description, I am worried that using gi.start as start time rather
    than startVV (???not really sure what you are doing???) could
    violate causality (argument is that startVV represents "no missing
    invals".) I think the current pseudo-code in paper is correct
    (though it still has the undesirable property that b ends up
    imprecise at delta;)

     the fix for that, I think, is to intersect gi with log and get
     back more precise gi(s) before applying gi to InvalSetStatus


  o The interface to DataStore (read, readBody) seems badly broken.
    Does a read of data that hasn't arrived block, return null,
    or throw ObjNotFoundException? Need to re-write this interface
    and everything that uses it to make responsibilities for
    corner cases more clear.  An, I really don't like the Mailbox --
    fix the synchronization logic. Overall, I suspect that DataStore
    is going to need a rewrite from scratch...

  o regardless of bound/unbound argument Write object in tentative
    local store keeps a copy of body in memory

Minor TBDs
  o SubscribeBodyWorker -- make sure we
    informOutgoingBodyStreamTerminated in all termination cases
    (including "normal" return)

  o Add code to detect/garbage collect redundant body or inval
  connections/workers

  o Note that if bodyrecvworker thread dies and tells local controller
    of its death, no guarantee that remote controller will notice (due
    to isolation of network connection from thread. Perhaps, if thread
    dies due to exception, it should closeStreamAndSocket()?)

 o Heartbeat output stream currently "swallows" IO exceptions; 
   need to hand them back to writer

 o DataStore.java applyOverlappingAtStartTime silently eats
   causalOrderException, 
   DataStore applyOverlappingAtStartTime repeats essentially
   same code for gi.isPrecise() and gi.isBound() cases -- clean up

 o Add sensible names to all created threads to aid debugging
   under debugger

  o clarify terminology and type system to distinguish: target of
  inval, interest set of a node, precise-set of an inval subscription

  o Fallback if TCP-Nice/TCP-LP not available. (1) for clarity make
  the C code that turns on the TCP-Nice sock options use a constant
  rather than a magic number ("14" or "15") for TCP-Nice or TCP-LP;
  (2) need some way to determine if the kernel supports TCP-Nice or
  LP - maybe getsockopt(fd, TCP_NICE, ...) could return
  TCP_NICE_SUPPORTED_AND_ON = 1948571 or TCP_NICE_SUPPORTED_AND_OFF =
  294752 (e.g., numbers unlikely to be returned by chance), (3) the
  C/Java interface needs a way to communicate whether low-priority
  sends are supported, (4) the Java code needs to fall back on rate
  limiting or app-level Nice if low priority sends are not supported

Requirements for "core" experiments (defined in ../../2004.23b.txt)
------------------------------------------------------------------

  o Merging of imprecise invalidations on inval subscriptions
    (e.g., {/a/a1, /a/a2} v. {/a/*})
    (deferred for initial unit testing; need it for "real"
    experiments)

  o 2004.5.21 18:20 CDT -- 4 features we may need to add to
    make main experiments work:
    (1) delay_bound parameter for invalidation streams
    (2) delay_precise parameter for invalidation streams
    (3) separate stream pools for demand read replies and prefetch streams
        (and prefetch streams on TCP-nice)
    (4) update outgoing invalinterator vv for node alpha when
        we receive incoming inval from node alpha


Additional desirable experiments
 
  o measure how well T_imprecise and T_precise timeouts reduce 
    inval cost for real workloads


  o redo first "bandwidth" experiments with self-tuning TCP-nice as
  available bandwidth varies (?)

  o write combining to avoid full replication

 

Requirements for system to fully implement design described in text of paper
----------------------------------------------------------------------------

  (It may be possible to defer some of these, though it would not be
  desirable to have to do so... Figuring out which of these will be
  needed for "More ambitious" experiments may help prioritize.)

  o for conflict detection, our implementation needs to carry around
    acceptstamp of target of write

  o add "coherent (not consistent) read" see ../../2004/4/26b.txt
    (disable lpvv/cvv check; maintain per-object status even for imprecise) 

  o Local Garbage collection and log recovery
    o writing committed data to disk in data store
    o log: omitVV/log garbage collection

  o Distributed Checkpoint recovery: 
  
    Motivation: (1) Optimization to speed read misses to imprecise
    interest set I. Suppose that a node alpha syncs with node beta for
    interest set I, and suppose that alpha's currentVV and last
    precise VV for I are cVV_alpha and lpVV_alpha, respectively. If
    lpVV_alpha is really old (e.g., "0"), beta could send alpha a
    checkpoint of all of beta's metadata for I rather than sending all
    invalidations for I. The advantage is that the former costs
    O(#objects in I) while the latter costs O(#writes per object in I
    * # objects in I) (also, in terms of local overhead, the former
    requires us to scan through all objects in the interest set while
    the latter requires us to scan through the entire log.)  (2) We
    also need to handle the case where beta has done garbage
    collection on the log.

    A challenge is making sure that alpha's log remains in a "legal"
    state (e.g., if we sent the checkpoint and updated alpha's
    DataStore w/o updating its update log, then when alpha does a
    local write, there would be a causally-illegal "gap" in its
    log. Need to avoid this.) 

    A node MAY choose to send a checkpoint rather than a log for any
    reason; typically a node would do this if (a) it has omitVV{s} >
    requestedStartVV{s} for some element s in the version vector or
    (b) it notices that sending the checkpoint will be cheaper than
    sending the full log.

    Algorithm: 

      * a checkpoint for interest set I contains beta's (the sender's)
        lpVV, cVV for the interest set, the sender's currentAccept and
        lastAccept and unresolvedConflicts for each object in the
        interest set;

      * alpha (the receiver's) log treats this as an imprecise inval
        with startVV_checkpoint = endVV_checkpoint = cVV_beta and, if
        necessary, generates a gap-filling imprecise inval with
        startVV_gap = currentVV_alpha, endVV_gap = startVV_checkpont,
        and target_gap = *.

      * After applying the gap filling inval to the log, beta
        can safely replace its ISStatus for I with the isStatus in the
        checkpoint and replace its per-object state for objects in I
        with the per-object state from the checkpoint *except* that
        beta must set currentVV of I to be max(cVV in checkpoint, cVV
        of any other interest set at Beta); this last bit ensures that
        the view remains causally consistent even if we have already
        read something that comes later than this checkpoint

      * Notice that checkpoint recovery leaves us in a bit of
        dilemma. If some node whose currentVV exceeds ours sends us a
        really new checkpoint, then we also get with this checkpoint a
        rather ugly gap-filling imprecise invalidation that 
        potentially makes the entire universe of data imprecise for
        the receiver of the checkpoint. On the other hand, if some
        node whose currentVV is older than our sends us a checkpoint,
        then we recover the state to some point, but we don't make it
        current so we still can't read from the updated interest set
        (until we subscribe to and see invalidations from lpVV to
        cVV). I think this is OK -- in the former case, the other
        interest sets should quickly "catch up" (by the split/join
        argument/example) and in the latter case, I need to subscribe
        to invals for this IS to stay current -- no surprise
        there. (See also read-liveness optimization)

  o conflict detection + logging; interface to read conflicting
    versions; interface to be notified of conflicts; interface to
    delete resolved conflict; deletion of conflicts notification need
    to be propagated around in logs.

  o conflict resolution for directories

  o Think carefully through directory operations (e.g., unlink or
    rename) 
  o Add API to bind several updates into a transaction
  o Local file system interface; directories, metadata: owner,
    permissions, last modified time, ...

  o Local per-object state should include a "bound" field; bound
  objects are ineligible for cache replacement; also need a protocol
  to transfer this bound bit from one node to another (is 2PC
  needed?).  In fact, I suspect that there needs to be 3 states --
  bound/bound, bound/unbound, unbound/unbound -- bound in checkpoin
  (can't discard) v. bound in log (must forward as bound); if I send
  someone a checkpoint the can treat bound/unbound as unbound/unbound,
  but they need to treat bound/bound as bound/bound until they get an
  unbind message

Requirements for "real" SDIMS controller system
-----------------------------------------------

  o See SDIMSController.java for detailed comments on what is required. 

  o NFS interface

  o Bind/unbind algorithm -- I am worried that the current algorithm
    has a problem -- bound invalidations could get sent too far too
    soon. In steady state, we expect inval subscriptions to be "caught
    up", so as soon as a bound inval arrives, we will send it on to
    next node... How do we control propagation so that bound
    invalidations make it to the specified "stability" nodes first? 
    I see 2 options. One is to add a delay on sending (e.g., when I
    subscribe for invals, I can supply a "delayBoundInvals" parameter
    on how long to delay sending me bound invalidations). The other is
    to split the bound invalidations protocol to first send a "bound
    invalidation" that doesn't include the body and then to require a
    node to fetch the body before it may apply the inval to its local
    state; the first is probably simpler to implement, but it feels
    more like a hack and is probably less flexible.

    Answer: I think adding the "delayBoundInval" parameter to inval
    subscriptions makes sense. It is simple to integrate into design
    (changes localized to Log::GetNext() method). And the "non-hack"
    justification is something like "for a bound inval, one can defer
    work by waiting for a moment...this param says how long you should
    be willing to wait."

    Cost: 1/2 day to code and test (?)

    Benefit: Ensuring reliability is a fundamental problem if we
    separate invalidations from updates. I think we need this
    enhancement to credibly claim to have a solid solution to this
    problem (e.g., a solution that can simulate pangea or
    flush-dirty-to-server-before-cooperative-cache-forwarding, or ...) 


Clean-up/simplify maintenance
-----------------------------
  o Decide whether to completely rip out commit logic or to carefully
    integrate commit logic so that we can configure the system with
    commit activated or deactivated -- see ../../2004/4/26.txt

    Rip out option:
    o Get rid of commits in GeneralInv, UpdateLog, RMI, socket
    Add option
    o Clean "option" for commits in  UpdateLog (garbage collection?) 
    o commit logic at primary 
    o commit logic in localStore

  o Sync requests should include in their body the name of the node
    that is to provide the sync (not just the node that is requesting
    the sync. And replies should also include the node that generated
    the sync reply (rather than requiring the receiver to get this
    info from the stream. This is in keeping with our notion that all
    messages should be self-describing.

  o The "headers" that we send at the start of inval streams (and
    others?)  should be made into an object (e.g., InvalStreamHeader =
    {magic, sender, startVV, ...} rather than just being a series of
    individual objects that we need to remember to send in the right
    order. This would be cleaner. Also, it would make it easier to
    do things like support optional arguments, new timeouts, etc.

  o Why is SDIMSController::informReceiveInval() doing a
    readBody()????


Performance enhancements
------------------------

  o For SDIMS timeout threads, we loop with longer and longer timeouts
    in the case of disconnected network; add something s.t when SDIMS
    notices a major network reconfiguration, we retry all pending
    timeouts immediately rather than waiting

  o Before applying incoming inval to lpVV and currentVV for ISStatus,
    apply to log, which does the intersect operations on what is already
    in log to refine the incoming imprecise invalidations based on 
    past knowledge? (So that Delta in the split join example ends up being 
    precise for b as well?)

  o When a remote read encounters invalid data, we currently just
    return (with a hint to the caller that the data probably won't be
    coming from us and do you want to retry?) We should also support
    controllers that want the callee to get the data and send it to
    the original caller (e.g., hierarchical caching). We could do this
    by notifing the controller of a remote read miss. The controller
    can then ignore it (caller's responsibility to find someone else)
    or issue a read request of its own and, when the data arrives
    locally, issue another read request to itself to send the data to
    the original caller.

  o Read liveness optimization. There is a potential issue with read
    liveness -- the simple logic for a read miss is to wait until
   
      while((isStatus.isPrecise(obj->getInterestSet) 
            && 
            objMetadata{obj}.hasValidBody))

    but what if by the time a body arrives, time has marched forward
    either (a) invalidating this object or (b) making the IS imprecise
    again. I think that when data arrives, we can return it *if* 
    lpVV >= the cVV that was current when the read was issued and if
    (b) the body that arrives is at least as new as the cVV that was
    current when the read was issued. The argument is that we can
    declare that the read "happened" at any moment in time between the
    cVV when the read was issued and now. So, when a new body arrives,
    we can use this body to satisfy a pending read that was issued
    with cVV_pendingRead if
      lpVV_IS >= cVV_pendingRead && body.acceptStamp >= localStore.obj.acceptStamp
??
Jiandan:
It might work. suppose there's another read of IS2 blocking with the same 
cvv(i.e. issued at the same time). At this time, it's possible
that a body1 for IS2 and a body2 for IS arrive when IS.lpvv > IS2.lpvv >= cVV_pendingRead
and the body2.acceptStamp >= localStore.obj2.acceptStamp.
then both read can return according to this optimization.
It might violate the causality for example: there's body0 s.t. 
IS.lpvv, IS2.lpvv <body1 < body0 < body2.
It won't happen. Because when body2 arrives, body1 should have been replaced by body0.

Need more careful think.

  o tcp-nice for update stream (need to separate update stream from
    demand inval stream in that case). Also, let "demand requests"
    include a "priority" bit to distinguish "demand" from "prefetch"
    requests (perhaps change the name of the object? or is it too
    late?)

  o filtering of redundant body sends; another possiblity is to send
  <objId, nodeId, priority> tuples and have the receiver manage the
  priority queue (and issue "low priority" requests for specific
  bodies as things reach the head of the queue)

  o Out-of-sequence batching of invalidations. At present, inval
    iterator batches together invals to objects outside of the
    interest set and sends the current batch when either a timeout
    occurs (to ensure global progress) or when an inval that
    intersects the interest set arrives (to avoid delaying stuff we do
    care about). As a result, we might expact to see a pattern of
    alternating precise/imprecise invalidations to data we
    care/don't care about.

    Proposal: Add a second "delay" parameter D2 and delay an inval that
    we do care about by as much as D2. Thus the state of an inval
    iterator will be one imprecise invalidation and zero or more
    buffered precise invalidations; when delay D2 expires, send all of
    this on. Now, the imprecise invalidation is sent before the
    precise ones in a set but may "span" multiple precise ones as
    well. IN the scenario above, we might now see a pattern
    IPPPPPIPPPPP

    Cost: 1/2 day to code and test (?)

    Benefit: A single, reasonably elegant method to incorporate
    several optimizations. One is "batch synchronization" -- if nodes
    are operating in "batch mode" and sending "old stuff" in the log
    rather than streaming invalidations as they arrive, this
    optimization lets us reduce the cost of sending the imprecise
    invals by sending one "big one" instead of a bunch of little
    ones. Similarly, suppose a node is doing a "demand resync" for an
    interest set that used to be imprecise but that now needs to be
    precise -- this approach lets us do the minimal encoding. 

    SEE CHECKPOINT DISCUSSION BELOW WHICH, I THINK, SUBSUMES SOME OF
    THIS OPTIMIZATION Additional tweak: suppose that within D2 we see
    k precise invalidations for the same object, we should be able to
    batch them together to have a start time at the start time of the
    first of these invals and an end time at the end time of the last
    of these invals; we should be able to tweak the logic so that the
    receiver still treats these as precise -- the receiver would need
    to "lock" the data store until endtime (e.g., apply the batch
    invalidate transactionally). Notice that the receiver's log
    retains a causal view of data that it can pass on to
    others. Can/should a receiver of such a message always assume this
    is what is happending and grab the lock? (I am a bit worried about
    liveness -- we "know" it is OK to grab the lock in this case in
    batch mode, but in streaming mode we could get stuck if the
    channel drops out from under us; I think to take advantage of this
    optimization, the receiver needs to buffer everything up to end
    time and don't grab the lock and start applying this transaction
    until we know we can finish the transaction. The point of all of
    this is to make resynchronization of a subdirectory cost O(size of
    subdirectory) rather than O(# of writes to subdirectory) --
    essentially we naturally fall back on a checkpoint style of
    resynchronization...)



  o To support per-object coherence/consistency: make invalIterator,
    updateQueue, and interestSetStatus all scale to large numbers of
    interest sets (e.g., for per-object subscription/interest set
    tracking). Mainly, what that will probably mean is representing
    the set of interestSets as a tree rather than a linked list to
    allow us to do matching in O(lnN) rather than O(N) time.  See
    ../../2004/4/23q.txt

  o Improve encoding of interest set to more efficiently support
    subscribing to a long list of individual objects as opposed to a few
    coarse-grained subdirectories; this would help efficiently support
    coherence-mode clients; also -- add a feature to incrementally
    add/remove elements from an invalidation (InvalIterator) or update
    (UpdateQueue) subscription. See 2004.4.23b.txt.



  o When an inval arrives *from* node X, update any inval iterator of
    streams we are sending to X to advance its currentVV before
    applying inval to my local log; now that I know X has seen
    everything up to this inval, I know I don't need to resend this
    inval to it!

    Benefit: clean way to support a bunch of interesting topologies
    where inval streams are bidirectional (e.g. *most* topologies, I
    think) 
 
    Cost: 1/2 day to code and test (?)


  o TACT layer?

  o Big red button
     o Quorum version with volume leases

  o May want to add a "batch mode" option on subscriptions indicating
    that we don't want to block for new invalidations; as soon as you
    finish getting through what is now in your log, close the
    connection.  (Notice how this dovetails nicely with the "D2 delay"
    parameter for batching with no special-case gunk for batch
    mode...)

  o Carefully considered cost/benefit algorithm for when to merge
    imprecise invalidation targets into a more general (more vague,
    more cheaply encoded) target (e.g., {/a/a1, /a/a2} v. {/a/*})

  o Improve efficiency of InvalRecvWorker for determining when it can
    applyNonOverlappingAtEndTime().  Need to wait to apply an end time
    inval until we know no start time in the stream could possibly
    come before this end time. Currently, simple thing: scan entire
    list of pending end times VV's each time a new inval arrives and
    compare pending end time vv with the currentVV that will be in
    effect after applying next strt time inval --> O(# pending end
    time invals * size of VV). New idea: each element of currentVV is
    a list of pending end-time invals s.t.  element e is on
    list{s} where s is the lowest-numbered vv element s.t. endTime{s}
    > currentVV{s}; when a new element arrives, scan down the lists and
    hang it off the first list that is preventing issuing of that
    element. When a new start time arrives, update each element of
    currentVV that is changed, and for each element that is changed,
    scan down the list -- for any element that no longer belongs on
    this list, scan down the lists to put it on the next list that it
    belongs on; if it belongs on no list, apply it.  Each endTime
    requires work O(# elements in its VV) from the time it arrives until
    the time it is applied; each startTime requires work O(# elements
    in its VV) to update currentVV.

  o Need to think about scalability of the version vector. One idea:
    See "The hash history approach for reconciling mutual
    inconsistency" by Kang, Wilensky, and Kubiatowicz [haven't read it
    yet so I don't know if it will help]. Another idea: can we
    add/remove ids from an interest set's version vector on the fly (a
    la Bayou's adding and removing servers on the fly?) Essentially,
    one could insert a "remove X from VV for interest set I" into the
    log that instructs anyone receiving the write that it is OK to
    stop tracking node X for interest set I; this "write" would get
    serialized in the log according to the normal rules and if X later
    does a write to I, it gets added back into the VV starting at the
    time that it does the write; don't know if this would really work,
    but it just might, and if it does then the version vector length
    for an interest set is proportional to the number of "active
    writers" in that interest set rather than proportional to the
    number of nodes in the system....can this be made to work?

  o Perhaps add a "background thread" for prefetching that scans local
    state for things to push (instead of just inserting new writes
    into upq)

  o Incoming update bodies should be buffered in a priority queue
    sorted by acceptStamp so that we can apply lower-numbered ones
    ASAP w/o them waiting behind higher-numbered ones (a la Amol's
    system.)

  o Should have the option to delay applying incoming inval messages
    until the corresponding update body has arrived (subject to a
    timeout.) (a la Amol's system)

  o WorkQueues should use dyanmic # of threads -- in their
    constructor hand them a max # of threads and a WorkerFactory
    that can produce worker threads. Then, if work sits
    in the queue for "too long" or if queue gets "too big",
    create an additional thread. Also, kill off threads
    if workload falls


Feature requests

  o Add "Notifications" (a la WinFS) -- provide API to let application
  ask to be notified whenever object foo changes. From http://msdn.microsoft.com/data/winfs/default.aspx?pull=/library/en-us/dnintlong/html/longhornch04.asp

    "The WinFS Notification Service uses the concepts of short-term
    and long-term subscriptions. A short-term subscription lasts until
    an application cancels the subscription or the application
    exits. A long-term subscription survives application
    restarts. WinFS API watchers are a set of classes that allow
    applications to be selectively notified of changes in the WinFS
    store and provide state information that can be persisted by the
    application to support suspend/resume scenarios.

    "The Watcher class can notify your application of changes to
    different aspects of WinFS objects, including the following:

    "* Item changes
    "* Embedded item changes
    "* Item extension changes
    "* Relationship changes

    "When a watcher raises an event, it sends watcher state data with
    the event notification. Your application can store this state data
    for later retrieval. Subsequently, you can use this watcher state
    data to indicate to WinFS that you want to receive events for all
    changes that occurred after the state was generated."


  o Access control

  o security on log exchange -- each update signed; "chain" updates
    together to prevent log omission

  o Less name-centric implementation
    o Store chunks based on content hashes to get LBFS advantages
    o Use bloom filters for interest sets, invalidation sets, etc.


Research issues

  o See 2005.2.1q.txt -- PRACTI -- notes on Yu and Vahdat conit consistency TACT

      TACT consistency gives up availability.

      Research question: Is there any way to bound or adapt order
      error and temporal error (and aritmetic) to be highly available
      (e.g., volume-quorums?) This could be a nice SIGMOD/VLDB paper?
      (Need to look at the SOSP paper where they look at availability
      v. consistency to get their answer...). See 2005/2/1q for more
      details and thoughts

  o Implement the last 3 years of SOSP/OSDI/NSDI file system papers on
    PRACTI to show "we're a toolkit"


 


--------------------------------
$Log: TODO.txt,v $
Revision 1.41  2006/10/16 05:44:27  zjiandan
Fixed DataStore::applyCheckpoint large lock problem (refer to mike's 2006.10.12.txt), moved IncommingConnection unit tests to junit.

Revision 1.40  2005/06/01 16:00:50  dahlin
(1) Got JNI/Nice to compile for cygwin, (2) put junit framework in place

Revision 1.39  2005/05/24 16:21:02  dahlin
It now compiles under cygwin

Revision 1.38  2005/03/25 13:49:50  dahlin
*** empty log message ***

Revision 1.37  2005/03/16 21:51:17  dahlin
RandomAccessState test 18 works with 50MB not with 100MB

Revision 1.36  2005/03/16 21:35:25  dahlin
Added berekelyDB 1.7.1; added RandomAccessState test 18 stress test memory that fails

Revision 1.35  2005/02/11 14:55:22  dahlin
Moved master high-level priority list to update-osdi-to-sosp/experiments.tex

Revision 1.34  2005/02/10 19:25:57  dahlin
Added global TODO list

Revision 1.33  2005/02/03 17:56:03  dahlin
Added some future research items

Revision 1.32  2005/01/27 15:26:14  dahlin
Added idea for less name-centric implementation

Revision 1.31  2005/01/20 15:04:17  dahlin
Added feature request list

Revision 1.30  2004/10/22 20:46:55  dahlin
Replaced TentativeState with RandomAccessState in DataStore; got rid of 'chain' in BodyMsg; all self-tests pass EXCEPT (1) get compile-time error in rmic and (2) ./runSDIMSControllerTest fails [related to (1)?]

Revision 1.29  2004/10/07 13:51:40  dahlin
RandomAccessState passes self tests; about to make a clean-up pass.

Revision 1.28  2004/09/17 20:50:58  dahlin
RandomAccessState -- added deletes and tests 7-8 work

Revision 1.27  2004/09/08 22:43:20  dahlin
Updated RandomAccessState to be more comprehensible; but at present it fails test 2 (endless loop)

Revision 1.26  2004/08/18 22:44:44  dahlin
Made BoundInval subclass of PreciseInval; RandomAccessState passes 2 self tests

Revision 1.25  2004/07/28 14:27:35  dahlin
Added sanity checks for immutable objects

Revision 1.24  2004/07/26 20:03:39  dahlin
Fixed typos from windows checkin so it will compile under Linux

Revision 1.23  2004/07/22 19:09:16  dahlin
Draft summer TODO list -- organized by related items

Revision 1.22  2004/07/21 22:43:33  dahlin
*** empty log message ***

Revision 1.21  2004/07/21 20:02:59  dahlin
Reorganized main items

Revision 1.20  2004/07/21 18:43:46  dahlin
Updating plans for summer

Revision 1.19  2004/07/14 19:43:03  dahlin
Updates after reading some of the code...

Revision 1.18  2004/05/26 04:11:39  dahlin
*** empty log message ***

Revision 1.17  2004/05/26 04:08:34  dahlin
*** empty log message ***

Revision 1.16  2004/05/21 22:53:36  dahlin
made list of features we may need for experiments in TODO.txt

Revision 1.15  2004/05/21 19:22:09  dahlin
Clarification of bound field for per-object state to support sending checkpoints across the network

Revision 1.14  2004/05/21 00:20:10  dahlin
Some ideas to enhance prefetching

Revision 1.13  2004/05/20 21:21:07  dahlin
Added TBD item -- delay applying invals until bodies arrive

Revision 1.12  2004/05/20 20:54:10  dahlin
Added TBD item -- support for hierarchical read misses by notifying the controller of remote read miss

Revision 1.11  2004/05/20 15:51:45  dahlin
Added TBD to make sync requests and replies self-describing

Revision 1.10  2004/05/18 03:48:59  dahlin
Bound bit for local state

Revision 1.9  2004/05/18 03:12:12  dahlin
Buffering incoming bodies in priority queue

Revision 1.8  2004/05/04 21:11:00  dahlin
Updates from discussion -- Lei, Jiandan, Amol, Mike

Revision 1.7  2004/05/04 13:45:56  dahlin
Feature request: atomic update of multiple objects

Revision 1.6  2004/05/02 17:23:54  dahlin
Complete first draft of SDIMSController.java design -- includes high-level TBD but no pseudo-code

Revision 1.5  2004/05/02 16:06:28  dahlin
Add idea for making current data not just new writes eligible for prefetching/pushing

Revision 1.4  2004/04/29 21:48:19  dahlin
Added wacky idea for reducing size of version vectors on a per-interest set basis

Revision 1.3  2004/04/29 14:02:30  dahlin
Idea to improve efficiency of InvalRecvWorker for endTime matching; also tried to fix ^M glitch in file...

Revision 1.2  2004/04/27 21:48:45  dahlin
Add: merging of imprecise invalidations for update subscription; Add: need to think through directory operations

Revision 1.1  2004/04/27 21:31:54  dahlin
Added TODO list


--------------------------------
