The goal of this assignment is to introduce you to consensus algorithms in general and Paxos in particular. You will implement the Paxos replication algorithm within the context of a distributed systems simulator.
The provided code implements a simulator for the environment,
including servers, clients, and the network. The simulator framework can
be found in
class dssim_t. Network code can be found
class Net. Servers and clients are represented as
paxclient. Check their
paxserver.h. Some basic types are defined
Clients generate work, and servers respond to messages. The message
handlers have the same names as the corresponding message types.
paxmsg.h has the definitions of message types.
Implementations of the client's handlers are provided for your
reference. You are supposed to implement the handlers
paxos_exec.cpp for servers, which does NOT include
The Paxos log is implemented in
class Paxlog. You can
treat it as persistent. In general, don't worry about persistence for
Paxlog::trim_front can be used to erase the executed
entries. To determine whether an entry is the next to execute,
A server maintains a
paxobj to represent the underlying
state machine. A request is a function that takes the
paxobj, changes its state, and returns a
string. A server logs the requests, and executes log entries when
also implements recording last result for each client asynchronously.
To compile, run
make pax. Use
./pax -h to see how to configure it.
For this lab, the only modifications you may make are to paxos_exec.cpp. During our testing, ALL OTHER files will be overwritten, so any changes you made will be lost. Please only modify paxos_exec.cpp and check yourself before you hand in the lab.
View change is not implemented in this lab, but the servers need to
form an initial view when the simulation starts. We implemented a
"fake initial view change" for this problem, and you need to specify
This lab is based on the paper "Paxos made practical" by David Mazieres. Please read that paper carefully. Here we note several deviations.
execute_resis defined as a union message type, but in this lab we explicitly use two message types,
replicate_arg, there is a
committedfield, which specifies a viewstamp below which the server has executed all requests and sent their results back to clients. Note that the requests need to be executed in successive viewstamp order. In the same view, a request with timestamp
tscan only be executed after request with timestamp
vc_state.latest_seenviewstamp when the primary has no unexecuted entries in the Paxos log after receiving a
replicate_res. So accept messages are sent quite infrequently, but they do allow all replicas to end the simulation with the same state.
viewstamp_t::sucessor). Backups execute all requests less than or equal to the
replicate_arg. The quite ugly iterator interface to Paxlog (
Paxlog::end) are provided for you to traverse the log to determine which entries can be executed.
There are two basic notions of time, one is client work, the other is the "ticks" counted off by the simulated system. On each tick, the simulator allows each node to check for messages and run any relevant handlers. On each tick, a node gets to receive a few messages, because in real life message processing latency is much shorter than network delay.
All events in the simulation are controlled by a central random
number generator. By changing the seed to this generator, you can
change the order of events in your simulation, which can be quite
valuable for debugging. See the
There are three things that keep the simulation running.
pax). We don't want to end the simulation if a node has a message in its input queue, it should process that message. Finally, in order to not "strand" any replicas who might not be in the current view, the simulator continues the simulation until all replicas have emptied their log.
Life in a distributed system is tense. You are constantly worried that you have lost touch with your brethren, or perhaps it only looks like that because they have lost touch with you.
Because of the FLP impossibility result, liveness in distributed systems can be compromised. The simulator has a system of messages and thresholds for timeouts. Given any setting for these parameters, you can force the system into a "tough" state, where it will do something like initiate a view change or possibly just stop responding. Don't worry, you aren't responsible for making your system work across every possible parameter setting (which has been proved impossible). However, in general, if the system initiates a view change, that can mean that one of your nodes isn't attending to its work. For example, if the primary never forwards replicate requests, the replicas will initiate a view change. You don't need to handle the view change, but its existence might indicate a problem with your code.
Another set of parameters that are sensitive is client timeout and number of clients. If you increase the number of clients enormously, and keep the same client timeout, then the servers will get overwhelmed and the clients will start resending reqeusts that are currently in process. This will lead to system overload, with the simulator's queues filling up and the simulator not responding and occupying more and more memory. There is no way to completely eliminate these problems in a distributed system.
Message reordering is controlled by the
--shuffle flag. Both take a value from 0 to 100
interpreted as the percentage of timesteps when a delay or shuffle
can occur. You should probably start your work by specifying --delay 0
and --shuffle 0, which will guarantee in order delivery. A delay
means that it is possible that even though there is a message at the
head of a node's queue, the node will not see the message in that
tick. A shuffle means all messages in a node's incoming queue will
be randomly permuted.
Just to be totally clear, in order to be correct your code should NOT assume FIFO delivery. We only suggest turning on FIFO delivery as a way to make progress when you start your implementation. Of course, feel free to use the defaults, which is 10% probability of delay and 20% for shuffle.
High values of delay make it probable that for a given tick, a node processes zero messages. If this continues long enough, you will hit a timeout. So don't worry about very large values of delay. I think you and we can stress all interesting interleavings with shuffle == 100 and delay at, for example, 80. Don't start messing with heartbeat messages unless you want to enter a world of pain.
Hand in your code and write a few paragraphs about how you handle the important cases for your replication protocol. Also describe the most interesting bug you found in your own code.
Please report how much time you spent on the lab.