2009.3.23b.txt -- Robbert van Renessee talk -- consensus without pain I think I can teach undergraduates non-blocking consensus (or get close...) Suppose we want consensus -- each node can vote red or blue -- eventually system DECIDES red or blue -- nontriviality -- DECIDES red or DECIDES blue is reachable -- stability -- once DECIDES X always DECIDES X System model -- nodes can crash -- asynchronous (so 2 types of "crash" -- reboot = crash/recover = "slow" hardware fault = crash and never recover ) desired fault tolerance property -- any number of nodes can crash and system remains safe -- up to f nodes can crash and system is eventually live (0) 2 phase commit can get stuck remember -- require all correct processes to agree GLOBAL_ABORT or GLOBAL_COMMIT (including coordinator and all participants) After participant sends its vote, it needs to wait for coordinator to say "ABORT" or "COMMIT" If timeout, can poll other participants (a) Anyone voted ABORT or has not yet voted (--> vote ABORT) --> GLOBAL_ABORT (b) Anyone received GLOBAL_ABORT or GLOBAL_COMMIT --> obvious (c) else? Coordinator might have crashed before making decision, -> conceivable to devise a protocol that makes a decision and gets all participants to agree Coordinator might still be running and decided either GLOBAL_ABORT or GLOBAL_COMMIT but network is slow --> any decision participants make could conflict and violate safety! (1) Simplify problem statement to make issue more clear Simpler case -- no coordinator. Anyone can just look at participants' state and run a funtion on it to decide RED or BLUE 3 process case; majority rule; tolerate 1 crash (f=1) R B R --> RED R B B --> BLUE R B ? --> if last node is slow, then could be RED or BLUE if last node is crashed, then can decide either (leader flip a coin) leader can't tell which case has occurred (2 PC is variation on this with different voting rule and process 2 above is the leader? Notice that if you try to "fix" 2PC by choosing a new leader, that new leader runs into exactly this problem -- if it doesn't hear from old leader, how can it decide ... plus, need to agree on who the new leader is (or you might have two new leaders making two different decisions...) (2) Need to add "pre-decide" phase. Can't go directly from local vote to decision 7-line consensus algorithm need 3f+1 = 4 nodes to tolerate f = 1 crash e_a = 0 // election number c_a = RED or BLUE // vote this election while (1) e_a = e_a + 1 broadcast to all VOTES_a = receive from 2f+1 c_{a+1} = MAJORITY(VOTES_a) --> eventually system reaches consensus on RED or BLUE (key: can spin for a long time, but once a majority emerges it is stable -- all future VOTES will include majority) round node 0 node 1 node 2 node 3 result 1 R B R B UNDECIDED 2 B R B - LEANING BLUE [3 is slow] 3 B B B - DECIDED BLUE 4 CRASH B B B Note -- can get unlucky and loop forever (but engineering strategies to avoid this) round node 0 node 1 node 2 node 3 result 1 R B R B UNDECIDED 2 B R B R UNDECIDED 3 R B R B UNDECIDED 4 B R B R UNDECIDED 5 B R B R UNDECIDED 6 B R B R UNDECIDED ... 99 B R B - LEANING BLUE (3 is slow) 100 B B B - DECIDED BLUE (Notice that state 99 is only leaning. If next event is node 3 recovers and chooses R, then we transition to undecided; if next event is nodes 0, 1, 2 exchange votes four round 100, we transition to DECIDED BLUE) (3) Do we really need 3f+1 for majority rule? no -- have a leader (also can help avoid -- but not eliminate possibility of -- endless loop. Endless loop always possible in async system ... can minimize probability with small assumptions) in round i, leader is node (i % n) follower for round i waits for most recent proposal received from a leader or votes ABSTAIN on timeout if I vote X in round i, then everyone votes X or ABSTAINS in round i in round i, tell leader my most recent non-abstain vote and its round leader receives state from f+1 in round i-1, and (1) if any most-recent-vote , then recommends X from most recent round reported (note all non-ABSTAIN for a given round match), (2) if all ABSTAIN, choose any X --> once a round succeeds in having f+1 participants vote X, then leader for any subsequent round that reads from f+1 will see at least one X as most recent vote and propose X --> X is now stable round 0 ABSTAIN (B) ABSTAIN (B) ABSTAIN (R) // ABSTAIN but recommend B or R 1 PROPOSE B (recv state from 0 and 2; can propose anything) VOTE B ABSTAIN ABSTAIN nodes 1 and 2 time out 2 PROPOSE R (recv state from 1 and 2; Can propose anything f+1 abstained) ABSTAIN VOTE R ABSTAIN nodes 0 and 2 time out 3 PROPOSE B (recv state from 0 and 2; can porpose anything f+1 abstrained) VOTE B VOTE B B IS COMMITTED 4 PROPOSE B VOTE B VOTE B B IS COMMITTED 5 PROPOSE B VOTE B VOTE B B IS COMMITTED 6 TIMEOUT 7 PROPOSE B VOTE B VOTE B B IS COMMITTED ... Rightmost column is "global view." Can participants tell if we are committed? Not quite. BUT, if nodes send their votes in each round to a LEARNER (or LEARNERs), and a LEARNER ever learns that f+1 voted for same thing in same round, then LEARNER knows that is the committed value. --> I can stop PROPOSING once I know COMMIT happend; if the network is "well behaved" for long enough, resonable to expect that everyone eventually stops. (4) Do things change if we go to "unanimous to commit; any objection abort" transaction? No. "ABSTAIN" really is abstain (even if I recomend COMMIT or ABORT.) Leader in round 1 perfectly welcome to have rule "If durig round 0, I see n ABSTAIN(prefer COMMIT), then I PROPOSE COMMIT, but if I see anything else (any ABSTAIN prefer ABORT or TIMEOUT) then I propose ABORT" and for leaders in all other rounds to have rule "If I see f+1 ABSTAIN prefer X, I propose ABORT" (5) The full Paxos/BFT protocol want to execute a series of commands in same order at replicas (state machine replication) basic idea: send commands to current leader, current leader preproposes order, and we all agree. Keep current leader until timeout, then "view change" to new leader. commands ordered by within a view (1 leader) -- castro picture -- 3 phases of communication -- PREPROPOSE, PROPOSE, COMMIT [repeat] participant -- if timeout, stop in current view, wait for NEW VIEW message; if I am next leader, send NEW VIEW message VIEW CHANGE -- must agree on what happened in all prior views (so no committed transactions lost) PREPROPOSE-VIEW PROPOSE-VIEW COMMIT-VIEW leader gathers from followers list of all [PROPOSED/COMMITTED?] from prior views [note that followers stop participating in ealier views when they send list for view v] get list from f+1 followers; include in new view proposal all PROPOSED seen by any of the f+1 sends list with PREPROPOSE-VIEW if timeout before view v COMMITS, then repeat with new leader for view v+1