2009.3.23b.txt -- Robbert van Renessee talk -- consensus without pain

I think I can teach undergraduates non-blocking consensus (or get close...)


Suppose we want consensus 
  -- each node can vote red or blue
  -- eventually system DECIDES red or blue
  -- nontriviality -- DECIDES red or DECIDES blue is reachable
  -- stability -- once DECIDES X always DECIDES X

System model 
  -- nodes can crash 
  -- asynchronous

   (so 2 types of "crash" -- reboot = crash/recover = "slow"
       hardware fault = crash and never recover
    )

desired fault tolerance property
        -- any number of nodes can crash and system remains safe
        -- up to f nodes can crash and system is eventually live


(0) 2 phase commit can get stuck

    remember -- require all correct processes to agree GLOBAL_ABORT
    or GLOBAL_COMMIT (including coordinator and all participants)

    After participant sends its vote, it needs to wait for coordinator 
    to say "ABORT" or "COMMIT"

    If timeout, can poll other participants
         (a) Anyone voted ABORT or has not yet voted (--> vote ABORT) --> GLOBAL_ABORT
         (b) Anyone received GLOBAL_ABORT or GLOBAL_COMMIT --> obvious
         (c) else?
               Coordinator might have crashed before making decision,
               -> conceivable to devise a protocol that makes a decision
               and gets all participants to agree

               Coordinator might still be running and decided either
               GLOBAL_ABORT or GLOBAL_COMMIT but network is slow -->
               any decision participants make could conflict and violate safety!


(1) Simplify problem statement to make issue more clear

    Simpler case -- no coordinator. Anyone can just look at
    participants' state and run a funtion on it to decide RED or BLUE

     3 process case; majority rule; tolerate 1 crash (f=1)

     R B R --> RED
     R B B --> BLUE
     R B ? --> if last node is slow, then could be RED or BLUE
                    if last node is crashed, then can decide either (leader flip a coin)
                    leader can't tell which case has occurred

    (2 PC is variation on this with different voting rule and process
    2 above is the leader?  Notice that if you try to "fix" 2PC by
    choosing a new leader, that new leader runs into exactly this
    problem -- if it doesn't hear from old leader, how can it decide

    ... plus, need to agree on who the new leader is (or you might
    have two new leaders making two different decisions...)


(2) Need to add "pre-decide" phase. Can't go directly from 
    local vote to decision

     7-line consensus algorithm
     need 3f+1 = 4 nodes to tolerate f = 1 crash

     e_a = 0  // election number
     c_a = RED or BLUE  // vote this election
     while (1)
          e_a = e_a + 1
          broadcast <e_a, c_a> to all
          VOTES_a = receive <e_a, _> from 2f+1
          c_{a+1} = MAJORITY(VOTES_a)


     --> eventually system reaches consensus on RED or BLUE
     (key: can spin for a long time, but once a majority emerges
     it is stable -- all future VOTES will include majority)

      round      node 0     node 1      node 2        node 3       result
        1               R             B              R                   B            UNDECIDED
        2               B             R              B                   -            LEANING BLUE [3 is slow]
        3               B             B              B                   -            DECIDED BLUE
        4             CRASH       B             B                   B

       Note -- can get unlucky and loop forever (but engineering strategies
       to avoid this)

      round      node 0     node 1      node 2        node 3       result
        1               R             B              R                   B            UNDECIDED
        2               B             R              B                   R            UNDECIDED
        3               R             B              R                   B            UNDECIDED
        4               B             R              B                   R            UNDECIDED 
        5               B             R              B                   R            UNDECIDED 
        6               B             R              B                   R            UNDECIDED 
        ...
        99             B             R              B                   -            LEANING BLUE (3 is slow)
       100            B            B                B                   -           DECIDED BLUE         

      (Notice that state 99 is only leaning. If next event is node 3
      recovers and chooses R, then we transition to undecided; if next
      event is nodes 0, 1, 2 exchange votes four round 100, we
      transition to DECIDED BLUE)


(3) Do we really need 3f+1 for majority rule?

     no -- have a leader

     (also can help avoid -- but not eliminate possibility of -- endless loop.
     Endless loop always possible in async system ... can minimize probability
     with small assumptions)

     in round i, leader is node (i % n)

     follower for round i waits for most recent proposal received from
     a leader or votes ABSTAIN on timeout

     if I vote X in round i, then everyone votes X or ABSTAINS in round i

     in round i, tell leader my most recent non-abstain vote and its round

     leader receives state from f+1 in round i-1, and (1) if any
     most-recent-vote <round, X>, then recommends X from most recent
     round reported (note all non-ABSTAIN for a given round match),
     (2) if all ABSTAIN, choose any X

     --> once a round succeeds in having f+1 participants vote X,
     then leader for any subsequent round that reads from f+1 will see
     at least one X as most recent vote and propose X --> X is now stable
     

     round    
     0            ABSTAIN (B)  ABSTAIN (B)  ABSTAIN (R)   // ABSTAIN but recommend B or R

     1            PROPOSE B                                             (recv state from 0 and 2; can propose anything)
                   VOTE B         ABSTAIN        ABSTAIN       nodes 1 and 2 time out
     2                                PROPOSE R                         (recv state from 1 and 2; Can propose anything f+1 abstained)
                  ABSTAIN        VOTE R          ABSTAIN       nodes 0 and 2 time out
     3                                                      PROPOSE B    (recv state from 0 and 2; can porpose anything f+1 abstrained)
                  VOTE B                                VOTE B         B IS COMMITTED
     4           PROPOSE B
                  VOTE B                                VOTE B         B IS COMMITTED
     5                              PROPOSE B
                  VOTE B        VOTE B                                 B IS COMMITTED
     6                                                       TIMEOUT

     7          PROPOSE B
                  VOTE B        VOTE B                                 B IS COMMITTED

     ...

    Rightmost column is "global view." Can participants tell if we are
    committed? Not quite.

    BUT, if nodes send their votes in each round to a LEARNER (or
    LEARNERs), and a LEARNER ever learns that f+1 voted for same thing
    in same round,  then LEARNER knows that is the committed value.

    --> I can stop PROPOSING once I know COMMIT happend; if the
    network is "well behaved" for long enough, resonable to expect
    that everyone eventually stops.


(4) Do things change if we go to "unanimous to commit; any objection abort" transaction?

   No. "ABSTAIN" really is abstain (even if I recomend COMMIT or
   ABORT.) Leader in round 1 perfectly welcome to have rule "If durig
   round 0, I see n ABSTAIN(prefer COMMIT), then I PROPOSE COMMIT, but
   if I see anything else (any ABSTAIN prefer ABORT or TIMEOUT) then I
   propose ABORT" and for leaders in all other rounds to have rule "If
   I see f+1 ABSTAIN prefer X, I propose ABORT"
 

(5) The full Paxos/BFT protocol

    want to execute a series of commands in same order at replicas (state machine replication)

    basic idea: send commands to current leader, current leader
    preproposes order, and we all agree. Keep current leader until
    timeout, then "view change" to new leader.

    commands ordered by <view, seqNum>

    within a view (1 leader) -- castro picture -- 3 phases of communication -- PREPROPOSE, PROPOSE, COMMIT [repeat]

    participant -- if timeout, stop in current view, wait for NEW VIEW message; if I am next leader, send NEW VIEW message

    VIEW CHANGE -- must agree on what happened in all prior views (so no committed transactions lost)
         PREPROPOSE-VIEW  PROPOSE-VIEW  COMMIT-VIEW

         leader gathers from followers list of all [PROPOSED/COMMITTED?] from prior views [note that
              followers stop participating in ealier views when they send list for view v]
         get list from f+1 followers; include in new view proposal all PROPOSED seen by any of the f+1
         sends list with PREPROPOSE-VIEW

         if timeout before view v COMMITS, then repeat with new leader for view v+1