CS 372 Spring 2011: LabADisk

CS372 Project ADisk - An Atomic Disk

Due: 5:59:59 PM April 22 2011

Assignment Goals

To learn about key storage system concepts: logging and transactions
To provide the basis for project FS, where you will build a reliable file system.

Overview of Project

In projects ADisk and FS, you will construct a user-level library that presents the abstraction of a reliable file system called RFS. In order to manage the complexity, you will implement this system in several phases, each of which presents a successively higher-level abstraction. You will be given the abstraction of a raw disk interface. On top of this you will build

An atomic disk

A reliable multi-level tree

A reliable flat file system

A reliable directory-based file system (RFS)

This project, deals only with step 1 above. It is very important to do an excellent job on this part of the project, or you will have significant problems in the next part.

The Assignment

Before you begin the assignment, grab the following code: lab-FS.tar
This tar archive contains the source code to the raw disk interface as well as files defining the other interfaces listed below. To extract the files from the archive, use the following command.
tar -xvf lab-FS.tar
A directory called lab-FS/ will be created, and your files will be extracted into it.

Part 0: Understand the supplied low-level disk system

RFS will be implemented using an 8-Mbyte file with a "device driver" that treats this storage space as a contiguous array of raw disk sectors, each of size Disk.SECTOR_SIZE bytes. This driver is similar to Linux's block device interface. The disk sectors are numbered from 0 to Disk.NUM_OF_SECTORS- 1 inclusive. No partial read or write to a disk sector is possible. This important restriction has several implications on the design of the file system. For example, updating a single byte in a disk sector requires that the sector be read in memory, modified, and then written back to disk. Similarly, reading a single byte in a disk sector requires that the sector be read entirely from disk though only one byte is to be read.

Disk

A Disk provides an asynchronous interface: requests are issued and later a callback provides the results of the action. When you create a Disk, you register an object that will receive callbacks:

public Disk(DiskCallback callback) throws FileNotFoundException

The callback is an object that will receive notifications when operations complete (see below).

Once a Disk is created, the following method initiates a read or write request:

public void startRequest(int operation, int tag, int sectorNum, byte b[]) throws IllegalArgumentException, IOException

operation is Disk.READ or Disk.WRITE

tag is an identifier the caller may user to match replies with requests. Disk does not interpret tag, it just passes it through with the request so that the callback corresponding to the request has the same tag. Note that the value DiskResult.RESERVED_TAG is reserved for error signaling.

sectorNum identifies the sector to access

b[] is a byte array from which to write or into which to read (depending on the operation).

Warning: You must not re-use b[] until the operation is complete.

Warning: You are not guaranteed that reads or writes will complete in the order they are issued. In fact, the thread inside of Disk will process pending requests in a random-ish order.

You can constrain the order of operations to disk by inserting barriers with the following call:

public void addBarrier() throws IOException

All writes issued before a barrier will be completed before any writes issued after the barrier. Reads, however, are not constrained by a barrier. Reads issued after a barrier may complete before or after writes issued before a barrier.

Note that although you could put a barrier before/after each write, for good performance in practice (and full credit for the lab), you should use barriers judiciously. Whenever you use a barrer, your code should include a comment explaining exactly why you need one. Be brief and precise; if the reader cannot easily figure out what your comment means, it is not useful.

For testing/debugging, you can also arrange for your disk to crash with some probability at some point in the future.

public setFailProb() See source file for details.

DiskCallback, DiskResult, and DiskUnit

A DiskCallback defines an interface that must be implemented by an object that receives completion notices in the form of a DiskResult when a read or write request finishes.

public void requestDone(DiskResult result)

So, when the operation you started with Disk::startRequest() completes, DiskCallback::requestDone() will be called. DiskUnit gives some examples of how this works.

Part 1: Build an atomic disk

Your task is to design, implement, and thoroughly test an atomic disk. The atomic disk allows a series of writes to different sectors to be made and then atomically applied -- either all the writes occur or none of them do. The atomic disk presents a similar interface to the disk, except (a) each read and write includes a transaction ID and (b) there are additional calls to begin, commit, abort, and apply transactions. You will implement this atomicity by providing an redo log on disk This redo log will consume ADisk.REDO_LOG_SECTORS sectors of your disk, so users of the ADisk will see a smaller disk (n o larger than Disk.NUM_OF_SECTORS - ADisk.REDO_LOG_SECTORS sectors; possibly a few sectors smaller than that if you reserve sectors for other disk metadata such as a start-of-log pointer.).

In particular, an ADisk implements the following interface:

// Allocate an ADisk that stores its data using
// a Disk.
//
// If format is true, wipe the current disk
// and initialize data structures for an empty
// disk.
//
// Otherwise, initialize internal state, read the log,
// redo any committed transactions
public ADisk(boolean format)

// Return the total number of data sectors that
// can be used *not including space reseved for
// the log or other data sructures*. This
// number will be smaller than Disk.NUM_OF_SECTORS.
public int getNSectors()

// Begin a new transaction and return a transaction ID
public TransID beginTransaction()

// Store the update in an in-memory buffer associates with
// the specified transaction, but don't write it to the log yet.
public void writeSector(TransID tid, int sectorNum, byte buffer[])
    throws IllegalArgumentException, IndexOutOfBoundsException

// Read the disk sector numbered sectorNum and place
// the result in buffer. Note: the result of a read of a
// sector must reflect the results of all previously
// committed writes as well as any uncommitted writes
// from the transaction tid. The read must not
// reflect any writes from other active transactions
// or writes from aborted transactions.
void readSector(TransID tid, int sectorNum, byte buffer[])
     throws IOException, IllegalArgumentException, IndexOutOfBoundsException

// Write all of the transaction's updates to the log.
// Then, put a barrier in the write queue.
// Then, write "commit" to the log
// Then, wait until the "commit" is safely on disk.
// Ensure that eventually,
// the updates within the transaction are
// writen to their specified sectors, but don't wait for
// the writeback to finish before returning
// e.g., mark the transaction data structure as
// "committed" and move it to the writeback queue.
//
// Hint: Things will probably be easier if
// you make sure that commit i+1 cannot land
// on disk before commit i is on disk.
// Barriers may help here.
void commitTransaction(TransID tid)
    throws IOException, IllegalArgumentException

// Free up the resources for this transaction without
// committing any of the writes.
void abortTransaction(TransID tid)
    throws IllegalArgumentException

Asynchronous IO

Writes to the log within a transaction must be asynchronous -- (1) write() must return without waiting for its write to go to disk. (2) When a transaction commits, you must send the writes to the log, issue a barrier request, and then send the commit record to the log. (3) Only after the commit (and all the writes in the transaction it commits, of course) is (are) safely in the log, can the commit() call return. (4) Only after the commit has completed, you should arrange for the updates to get written to their final locations on disk. (5) The commit() call should return before these write-backs (to the final locations on disk) complete; it may return before these write-backs are even issued (e.g., another thread may issue those writes.)

Hint: If two transactions update the same sector, you need to make sure that the write from the transaction that commits later happens after the write from the earlier-committed transaction. A good way to do this might be to have a producer/consumer queue for committed transactions' writes. Then, commit() puts a set of writes into the queue, and a single write-back thread pulls a transaction's writes from the queue, asynchronously issues them all, waits for them all to complete, and then moves to do the write-back for the next transaction's writes.

Hint: The log will have a sequence of writes for a transaction followed by a commit record for that transaction (or no commit record if the system crashed before the commit made it to disk.) The problem is: How do you know if the sector you are reading is a "commit" or just a sector update that happens to look like a "commit"? You also need to know which sectors were updates by the transaction's writes. So, you probably want to preceed a transaction's writes in the log with a description of the transaction's writes: how many are there, what sector is updated by each write, etc. Thus, it may be useful if your log looks like: [transaction_metadata [sectors]* transaction_commit]*

Concurrency

Note that your implementation must use locks and condition variables to enforce concurrency control.

One shared data structure you are likely to have will be an object to represent each in-progress transaction. Multiple threads may concurrently attempt to read/write/commit/abort the same transaction. Your code must synchronize such concurrent access.
Multiple in-progress transactions may try to update the same sector. A read by transaction T of sector S must return the data from the latest write to sector S by transaction T, if any. If transaction T has not updated sector S, then a read must return the data written by last write of sector S by the last-committed transaction T' that wrote sector S.

Hint: If you have a per-transaction set of updates and a write-back queue of committed transactions' writes, then a read could first check it's transaction, then check the write-back queue, and then, if necessary, read from disk. When you read from the write-back queue, make sure you enforce the "last commit wins" rule.

Hint: It is logically possible to have two transactions committing "at the same time", so that, for example, both t1 and t2 issue their writes to the log at the same time (to different locations in the log, obviously). Then, t1 and t2 could both issue their commits. This "optimization" is not required. In fact, it is a probably a bad idea for at least two reasons. First, it will complicate recovery (after a crash transaction i in the log may not have a commit in the log, but transaction i+1 or i+7 may). Second, to the extent that the updates get performed in a way that takes advantage of this extra concurrency, performance would be worse not better (due to extra rotational delays between writing a transaction's data and its commit.) The point of this whole hint is: You probably want to ensure that only one thread/transaction is doing a commit at a time. Put a barrier in an appropriate place.
Recovery

When your system starts, you need to make sure that all updates in the log from committed transactions (1) are immediately visible to reads and (2) are eventually written to their final locations on disk. So, your recovery code must read data from the logs, update in memory data structures, update disk, or both. Make sure you follow the ordering rule above (if two transactions update the same sector, the update from the transaction that committed last is the one that must "win".) Notice that transactions may not commit in the same order that they begin, so the tid you assign when a transaction starts cannot be used to order updates on recovery.

Hint: One option is to have special-purpose code that reads each transaction from the log, does that transaction's write-back, waits for the write-back to complete, and then goes on to the next committed transaction. But there may be a simpler way: you probably already have code that applies a list of transactions in a specified order. And that makes sure that reads "see" these writes while they are still pending. Can you re-use that code?

Circular log

Your on-disk log is fixed size, and you should treat it as a circular log with new updates added at the head and the tail pointing to the oldest update from a committed transaction that has not been (or may not have been) written back to its final location on disk. So, every write to a transaction moves the head forward. And, when a write-back completes, the tail moves forward. Obviously, you must make sure the head does not pass the tail.

If a transaction tries to commit when there is not room to send its writes to the log, the commit call should block until the transaction's updates and commit can be written to the log. (If a single transaction has too many writes to fit in the log, even if the log were otherwise empty, you should throw an exception either on the culprit write or on the commit.)

On recovery, ideally the recovery thread would read the log from the tail to the head so that it can commit exactly the writes that have not been written back yet. Note, however, that it is safe to start before the tail -- it is no problem to re-write-back a write that has already been written back (as long as ordering constraints are followed.) So, you don't need to have a perfect on-disk head or tail pointer. Still, it may simplify things if you know about where the tail is. One option is to store a "log-start" pointer in a well-known sector on disk. The log-start must always be at or before the tail and the head may not cross the log-start.

Unit tests

As your career progresses, you will find that writing simple sanity-check tests for each object you write will save you enormous amounts of time and make you a much more productive programmer. So, we should not have to require you to write any specific tests because you should already be planning to do so. But, thorough testing of this part of the project is so important to the rest of the project, that we're going to require it.

Write a program called ADiskUnit that executes a series of unit tests for ADisk. Designing such tests is a skill that deserves practice, so we do not provide a full list of what you should test. You might start with simple reads and writes within a transaction. Then look at reads and writes across transactions that commit or abort. Be sure to test garbage collection and crash recovery. You should have some tests with a single thread and some with more. Should probably have some tests for the simple case when the disk doesn't crash and some for the case where it does. Etc. We strongly recommend writing the tests as you add each piece of functionality rather than trying to write a bunch of tests as an afterthought once you have finished the project.

The TA will also have a set of unit tests (other than yours) to use for grading your code. So, passing your tests is not necessarily enough. But, if you design good tests, then passing your tests makes it much more likely that you will pass the TA's tests.

Additional comments

Your tests must be self-checking. That is, after each test, you should output a simple statement like "Test Succeeds" or "Test Fails." (e.g., you should not require a human to read through 100 lines of output to see if the read on line 92 matches the write on line 7; instead, your unit test program should remember the earlier write it did and compare the read result to it.)

Internal Data Structures and Approach

Your ADisk should have the following internal structures and types:

An ActiveTransactionList keeps a list of Transactions that have been started but not yet committed or aborted.
ADisk.beginTransaction() adds a transaction to this list. ADisk.readSector() checks an active transaction to see if the desired sector has been written within that transaction. ADisk.writeSector() adds an update to an active transaction. ADisk.commitTransaction() and ADisk.abortTransaction() remove a transaction from the active transaction list.
A WriteBackList keeps a list of transactions that have committed but that have not yet been written back.
ADisk.commitTransaction() should move a Transaction from the ActiveTransactionList to the WriteBackList. (But note that it should wait until the commit is in the log before putting the transaction on this list.)
Notice that all of the transactions on the write back list are committed but not yet on disk. This means that ADisk.readSector() needs to check this list to see if the sector being read has an update here. It might even have multiple updates here, in which case the read should return the last sector committed.
A writeback thread should write the updates from the oldest transaction on this list; then remove that transaction from this list; then repeat.
A simple way to do recovery should be to take each committed transaction you find in the log and put it on this list. Then writeback can just happen in its normal way.
A Transaction keeps the status of a transaction (INPROGRESS, COMMITTED, ABORTED), a list of writes for the transaction, and other information about the transaction.
ADisk.writeSector() adds a write to a transaction; ADisk.readSector() checks to see if the desired sector has been written within the transaction and, if so, returns the corresponding value. ADisk.commit() commits a transaction and ADisk.abort() aborts a transaction, which, among other things, means that any read or write that has not completed and that tries to use this transaction will return an error.
A Transaction can be written to the log. If a transaction is committed, Transaction.getSectorsForLog() returns an array of sectors to be written to the log. This array includes one or sector containing header data, stating which Disk sectors this transaction updates, then some sectors of updates, then a commit sector.
During recovery, sectors from the log can be read to produce a transaction. Transaction.parseHeader() takes an array of bytes corresponding to a disk sector that contains a transaction header and identifies how many sectors from the log to read to get the full transaction; then Transaction.parseLogBytes() takes that many sectors of data and tries to produce a committed Transaction (if the transaction from the log was not committed, an exception is thrown instead).
Don't forget to synchronize access to a Transaction in case multiple threads are simultaneously working on the same transaction. So you'll want locks. You'll also want to handle the case where a transaction gets committed or aborted and then a read or write tries to make use of it.
You'll need to keep track of what parts of the log are in use so that you know where to write the next committed transaction and so that you'll know when you need to wait for garbage collection before writing the next committed transaction. During recovery, you'll need to figure out where to start reading from. LogStatus will help coordinate access to the log.
You'll need to handle the asynchrony of the disk and have some way to track when sets of pending requests get done. One way to do this is to have a simple CallbackTracker object that stores results as they arrive, indexed by tag and that lets a requestor wait until a particular tag has a result.

Milestones

Break this project into a bunch of small, managable, testable steps. Here is one sequence. At each step, write and run some unit tests to make sure things are working as expected. Your README should describe the steps you took and the unit tests at each step.

Full pseudocode. I've sketched some basic data structures and outlined how they work, but you should not start building until you have figured out how they will all fit together.
Go top down. For the ADisk, list its member variables. Then, for each public method of ADisk, write the pseudo-code for that public method in terms of operations on ADisk's member variables.
Now you know the interface each of these internal classes needs. Write down those public methods for each of these classes (it's OK to change them from what I provided, to add new arguments, or to add new methods.) Then, for each of these classes, write down its member variables and the pseudocode for its methods.
Make sure you understand how all the pieces will fit together before you start writing code.
At this stage, list the tests you will run as unit tests for each of these classes.
Single Transaction works: addWrite(), checkRead(), commit(), and abort() all do appropriate things to the data structure. Can iterate through updates of a committed transaction using getNUpdatedSectors() and getUpdateI(). Can get an array of bytes using getSectorForLog() and reconstitute an equivalent transaction using parseHeader() and parseLogBytes() on that array. Can remember and recall log sectors.
WriteBackList works: can add committed Transactions and remove them in order; checkRead returns the appropriate write from the latest committed transaction (if there is one).
ActiveTransactionList works
Writes to log work (no writeback or recovery). Public ADisk interface works for a limited number of writes (then you get stuck because of no writeback).
Recovery works (no writeback)
Writeback works.
Writeback and recovery works.
All done.

Other hints

It is extremely important that you get each part of the project completely working before you move on to the next part. Bugs here could manifest in very strange behavior that will be difficult to diagnose in higher levels of the code. A general principle in engineering complex systems is "go from working system to working system." That is, build core functionality, debug it, then add a small subset of functionality and test that. It is much easier to find a bug by testing a 100-line module 20 minutes after you write it than to find the same bug by testing the entire 2000 line system after a week of writing code. If you try to write all your code first and test it later, you are very likely to fail this project. If you thoroughly test each module and each interface as you write it, each module may seem to take 10% longer, but the overall project will take much less time. It is also much more likely to work. Thorough testing of each phase of the project is a requirement of this project.
Begin this part of the project by writing down your main data structures, and then writing pseudo-code for each of the above functions in terms of methods on these data structures. Given the number of threads and objects, think about how to structure your modules for safe locking that avoids deadlock. A picture may help you here.
You need to be able to take an object in memory, store it to disk as an array of bytes, and later reconstruct the object by reading an array of bytes from disk and converting those bytes to an object. Java provides a serialization interface for accomplishing this type of things. See, for example http://java.sun.com/developer/technicalArticles/Programming/serialization/. Note that this example writes directly to a FileOutputStream, but you may instead want to write to a ByteArrayOutputStream and then write the ByteArray to a specific offset in your file. Similarly, you may want to read a byte array from a specific offset of your file and then use that byte array via a ByteArrayInputStream.

What to Turn In

All of your implementations must adhere to (e.g., must not change) ADisk's public interfaces specified above. Also, you may not modify the Disk interface in any way. You may add additional public methods to ADisk; we don't think you will need to do so (except perhaps to add testing interfaces). You can change the interfaces to the other "internal" objects/classes. Electronically turn in (1) your well commented and elegant source code and (2) a file called README.

Your README file should include 4 sections:

Section 1: Administrative: your names (both names in your 2-person team), the number of slip days that you have used on this project and so far. Note that for a two-person team, the number of slip days available to the team is the minimum of the number of slip days available to each member of the team.
Section 2: A list of all source files in the directory with a 1-2 line description of each
Section 3: 1 short paragraph, describing the high-level design of part 1. 1 short paragraph explaining the organization of locks and threads in your architecture. 1 short paragraph explaining how recovery works.
Section 4: A discussion of your testing strategy. Outline the programs you used (at a high level), what each one tests, and the results of those tests. (More detailed low-level comments should be in the programs, themselves.)

Logistics

The following guidelines should help smooth the process of delivering your project. You can help us a great deal by observing the following:

After you finish your work, please use the turnin utility to submit your work.

Usage:

turnin --submit taid cs372-labF your_files

You must include a Makefile to compile the program when "make" is typed and execute your tests when "make unit" is typed.
Do not include class files in your submission!! (Or core dumps!!!) (Or your disk data file!) (e.g., run "make clean" before turnin.)
The project will be graded on the public Linux cluster (run 'cshosts publinux' to get a list) Portability should not be a major issue if you develop on a different platform. But, if you chose to develop on a different platform, porting and testing on Linux by the deadline is your responsibility. The statement "it worked on my other machine" will not be considered in the grading process in any way.
You are required to adhere to the multi-threaded coding standards/rules discussed in class and described in the hand out.
Code will be evaluated based on its correctness, clarity, and elegance. Strive for simplicity. Think before you code.

Grading

80% Code

Remember that your code must be clear and easy for a human to read. Also remember that the tests we provide are for your convenience as a starting point. You should test more thoroughly. Just passing those tests is not a guarantee that you will get a good grade.

20% Documentation, testing, and analysis

Discussions of design and testing strategy.

Start early, we mean it!!!