Milawa on Jitawa

Magnus Myreen has developed and formally verified (in HOL4) a lisp implementation, Jitawa, that can run a new version of Milawa's trusted core. More recently, he has proved the soundness of Milawa's kernel and connected it to the verification of Jitawa, effectively proving that Milawa is sound all the way down to the machine code.

This page is about how to run Milawa on Jitawa. See Magnus' page instead for all the details about the Jitawa implementation and its verification.

Running Milawa on Jitawa

You probably need a machine with at least 32 GB of memory to try the tests, and 64 GB of memory is probably needed for SBCL at least. You probably also need 40-50 GB of free disk space.

Installing Lisp Systems

You'll want Jitawa and perhaps other Lisps to compare it against:

Milawa on Jitawa Source Code

Milawa's kernel was cleaned up and simplified to make it more suitable for Jitawa.

To actually run the core you can use these scripts. The syntax is always milawa2-[lisp].sh

You may want to edit these scripts to control how much space is used for stacks and heaps, etc., depending on your computer.

Getting the Milawa Core and Events Files

BOZO I should stop distributing tarballs and instead explain how to build the proofs.

The new trusted core can run on both Jitawa and Common Lisp implementations. This is nice because it means we can measure Jitawa's performance against real Lisp systems. Here are some old benchmarking results from around April 2011.

Test Jitawa CCL SBCL 3m 11s elapsed
188s user
3s system
3m 57s elapsed
209s user
13s system
4m 32s elapsed
255s user
18s system 1 hr 39 min elapsed
5,934s user
10.8s system
1 hr 22 min elapsed
4,850s user
18.8s system
2 hr 18 min elapsed
8,205s user
82s system 128 hr, 55 min elapsed
463,946s user
42.6s system
16 hr 13 min elapsed
58,110s user
51s system
22 hr 28 min elapsed
80,586s user
309s system

These results are taken from "identical" machines with Intel(R) Xeon(R) CPU E5450 @ 3.00GHz (8 cores) with 64 GB of memory, running 64-bit Linux. I've tried to make sure the machines are relatively free of other processes while the benchmarks were being run.

Efforts are made to make the Common Lisp systems perform well. I tell them to optimize for fastest execution speed and suggest that they inline the Milawa primitives (a full 20% speedup for CCL on the during my beta2 testing). I also configure their garbage collectors to allocate much more than usual between collections, which dramatically improves performance. In other words, Jitawa isn't just being compared against "out of the box" settings for CCL and SBCL, but rather more optimal settings.

Just what are these benchmarks?

Once you get all the above stuff, you can use the Makefile in the bench directory to conduct the tests. See the comments at the top of the Makefile for usage notes.

Hacks and Warts

BOZO did we fix these?

Thoughts about Performance

Jitawa was performing really close to CCL without inlining. Some thoughts as to why Jitawa's performance is pretty good despite not having any compiler optimizations or even register allocation:

Old Status Updates

These are only perhaps of some historic interest.

Update: Apr 4, 2011

We're benchmarking a new version with improved performance debugging capabilities (it prints basic information about heap usage, number of stored constants, timing information, etc.) and comparing the full run against snapshots to see what's going on. Magnus has also improved the handling of stored constants to be more efficient, but this is looking less likely as the culprit.

Update: Mar 28, 2011 -- It finished!

The full run on Jitawa finally finished tonight after 128 hours, 55 minutes, 28 seconds. This is about 8x worse than CCL, 5.8x worse than SBCL. This is somewhat disappointing given the incredibly competitive performance that Jitawa shows for the early events (but is also really cool anyway.)

Magnus and I are following up on this. It seems like Jitawa's performance really should be much better than this. Te early indications were that Jitawa was almost as fast as CCL, and moreover the tail end of the proof checking seemed to drag on and on even though a snapshot that was doing the same work finished reasonably quickly. Moreover, events that really should have been lightning fast at the end seemed to take a long time to process. So, it seems that there might be something that is leading to a progressive slowdown over time. We're currently speculating as to what the causes might be and looking into ways to get more useful debugging information out of the system.

Some other information from time --verbose:

It doesn't look like the new run I started up with a bigger heap size is doing any better, so it may be something unrelated to garbage collection, etc.

Update: Mar 27, 2011, 2:10 PM

Looking good -- the full run is up to 14049 and is heading toward the home stretch of easy proofs; maybe it will finish in a couple of hours.

Update: Mar 27, 2011

1:00 PM. Well, the ITP review deadline looms closer. The Jitawa run seems to be through the largest proof, and is up to event 13979, but it is still going. Things should be pretty fast from here on out: there are several proofs in the 13,000s that establish the correctness of Milawa's rewriter that are the real killers; after these are through the remaining events are a lot smaller and faster.

On the other hand, I'm really confused by the performance we're seeing. Based on the "resumed snapshot" performance, we should have been finished more than two days ago.

The following is my hunch, but at the moment it is completely unsubstantiated. I can imagine that, because of the leading proofs that were skipped, the snapshot might have "gotten lucky" and ended up GC'ing at different times than the full run. It's possible that the full run is getting bogged down by trying to GC in situations where most of the available memory is used up by live objects, and so each GC is (1) relatively unproductive since few objects are dead and can be freed, and (2) very costly since copying so many live objects requires a lot of work. We've sometimes seen this kind of behavior in certain regression tests at Centaur, and it can be truly deadly for performance.

On the other hand, while this theory could plausibly explain why it seemed to take much longer to process event 13613, it seems a little unlikely that later events would happen to run into the same situation. It's too bad that we had to disable the GC statistics printing, since those messages would have given us a good clue to start from.

In case this is the problem, I've started up a fresh, full run on another machine, and told Jitawa that it can use 2 billion conses instead of 1.5 billion. This run is still in the early stages (on event 5726 at the moment). But my hope is that it will finish much faster than the first run. Even if this isn't the case, it should be interesting to see the relative performance of the two runs.

UPDATE: Mar 25, 2011

It's 10:00 PM and it's still running. It spent forever on event 13613 (more than a whole day, which is odd since the snapshot seemed to get through it faster than that), but it looks like it's finally gotten through that. If it performs anything like the snapshot it should finish tomorrow. It's too bad we can't see how much time it's spending in GC. It would also be good to implement a (time ...) function that can report on how long it takes to do something, eventually.

UPDATE: Mar 24, 2011

As of 9:40 AM the program is still running. It's currently processing event 13613, which is in the middle of the biggest stretch of proofs. It previously segfaulted before getting this far, so it looks like the problem is fixed. But, it also seems to be running much more slowly than I'd expected during this strech of proofs, so I'm not sure what the new ETA ought to be. I no longer expect it to finish today.

At 8:00 PM it's still running; the snapshot I started from event 13561 on another computer finished after 27 hours, 38 minutes, so barring unforseen weirdness the real run should finish by tomorrow at noon or so.

UPDATE: Mar 23, 2011

Magnus traced down the segfault to our C routine for printing GC statistics. To work around this problem, he has just sent me a new executable that doesn't print these statistics. This version solves the segfault for the small example we came up with.

The new version is now working through the proofs. As of 2PM Central it is on event 5726, and we are hopeful that it will finish successfully. My best guess for its completion time would be around 10:00 tomorrow morning.

Meanwhile Magnus is trying to figure out how the bad interaction occurred. There is probably a mismatch between what the C compiler is expecting and what the Jitawa assembly code is doing when it calls this function.

UPDATE: Mar 22, 2011

We managed to come up with a very, very small example that leads to the segfault quickly. This should help with debugging. Meanwhile Magnus has done a lot to validate the model.

UPDATE: Mar 21, 2011

At this point, we think it is most likely that some bugs were introduced when the 32-bit x86 model was ported and extended to 64-bit x86. Magnus has actually already found and fixed a few bugs in the decoder, and has sent me updated versions of Jitawa to try. Unfortunately this hasn't fixed the problem.

In the past few days, Magnus has been developing testing programs to validate the model against real x86 execution. We hope that this will reveal what is amiss while also giving us more confidence in the model.

In another effort to be able to reproduce the bug more quickly, we developed an experimental Milawa version that can make snapshots of its progress. (This is quite barbaric and is certainly not something we want to ask people to trust: to make a snapshot we just print out the Milawa state; to resume we actually use a perl script to parse the output file and extract the printed state.)

When we use a snapshot to advance right up to the event before the segfault, the bug disappears for that event as we resume. In this way we can make additional progress, but the segfaults become much more frequent in the 13,500's and so I haven't managed to get all the way through the events in this way (and of course the bug casts doubt on the whole checking process, anyway).