Milawa on Jitawa
Magnus Myreen has
developed and formally verified (in HOL4) a lisp implementation,
that can run a new version of Milawa's trusted core. More recently,
he has proved the soundness of Milawa's kernel and connected it to
the verification of Jitawa, effectively proving that Milawa is sound
all the way down to the machine code.
This page is about how to run Milawa on Jitawa. See Magnus'
page instead for all the details about the Jitawa implementation
and its verification.
Running Milawa on Jitawa
You probably need a machine with at least 32 GB of memory to try the tests, and
64 GB of memory is probably needed for SBCL at least. You probably also need 40-50 GB of
free disk space.
Installing Lisp Systems
You'll want Jitawa and perhaps other Lisps to compare it against:
- Get jitawa,
build it, and add jitawa to your path
- Install CCL and add
ccl to your path (optional, for performance comparison)
- Install SBCL and add
sbcl to your path (optional, for performance comparison)
Milawa on Jitawa Source Code
Milawa's kernel was cleaned up and simplified to make it more
suitable for Jitawa.
- core.lisp -- new definitions for the
trusted core, used for both Jitawa and Common Lisp.
BOZO need to merge with Magnus, this is missing his fixes.
- cl-prelude.lisp -- setup of the
Milawa package and primitives for Common Lisp only (not needed
- cl-run.lisp -- file reading stuff
for Common Lisp only (not needed for Jitawa)
To actually run the core you can use these scripts. The syntax is
always milawa2-[lisp].sh file.events.
You may want to edit these scripts to control how much space is used
for stacks and heaps, etc., depending on your computer.
Getting the Milawa Core and Events Files
BOZO I should stop distributing tarballs and instead explain how to
build the proofs.
- Download milawa2-bench.tar.gz and extract it
in a new directory, say bench
- Download the events files (542 MB) and extract
them in the same bench directory
(this file says 'beta2', but the events are the same for 'beta3')
The new trusted core can run on both Jitawa and Common Lisp implementations. This is nice
because it means we can measure Jitawa's performance against real Lisp
systems. Here are some old benchmarking results from around April
||3m 11s elapsed
|3m 57s elapsed
|4m 32s elapsed
||1 hr 39 min elapsed
|1 hr 22 min elapsed
|2 hr 18 min elapsed
||128 hr, 55 min elapsed
|16 hr 13 min elapsed
|22 hr 28 min elapsed
These results are taken from "identical" machines with Intel(R) Xeon(R) CPU E5450 @ 3.00GHz
(8 cores) with 64 GB of memory, running 64-bit Linux. I've tried to make sure the machines are
relatively free of other processes while the benchmarks were being run.
Efforts are made to make the Common Lisp systems perform well. I tell them to optimize for fastest
execution speed and suggest that they inline the Milawa primitives (a full 20% speedup for CCL on the
full.events during my beta2 testing). I also configure their garbage collectors to allocate much
more than usual between collections, which dramatically improves performance. In other words, Jitawa
isn't just being compared against "out of the box" settings for CCL and SBCL, but rather more optimal
Just what are these benchmarks?
- 100.events is just the first 100 events from the self-verification proof.
This actually isn't very useful as a performance test because it's too short and really just
ends up measuring the startup times for the different Lisps. But it's a really
nice test to have a short test to run after making changes to scripts to make sure
things are basically working.
- 1000.events is the first 1000 events. This is a little longer and more
intense. This is possibly useful for getting a sense of how long things will take,
but may be somewhat dominated by startup costs. (File reading is optimized in all cases
for the full.events test, so on shorter tests we end up allocating far more memory than
we really need for the abbreviation tables and acceptable-object checks).
- level2.events is around 4500 events and goes all the way through the first
extended proof checker and a switch event. This is a pretty comprehensive workout for
Milawa's core, but isn't too demanding on the Lisp system.
- full.events is the full self-verification proof.
Once you get all the above stuff, you can use the Makefile in the
bench directory to conduct the tests. See the comments at the top of
the Makefile for usage notes.
Hacks and Warts
BOZO did we fix these?
- The comments in core.lisp are currently being removed by the shell script
for running Jitawa. It sounds like Magnus is going to eventually add
support for them in Jitawa so that this won't be necessary, but this is low
- core.lisp has temporary code for automatically quoting "bare" T and NIL
uses in definitions. I think Magnus is going to change Jitawa so that this
will not be necessary, which will be good because it's really nasty. Note
that the Common Lisp systems are currently being forced to also do this
quoting (since that way I can load the same core.lisp file no matter what
implementation is being used). But this rewriting is only applied to
definitions, so it should be pretty minimal no matter what system is being
- Input is kind of a mess. Right now, the Common Lisp version is using a reader
similar to the original Milawa core. But for Jitawa version I am just
literally using 'cat' and 'echo' to construct an input file by appending
the Milawa code onto the events file, which is really gross. It seems like
it would be nicer to try to build in the Milawa code into Jitawa in some
way, and then maybe have its own command loop call on a READ function to
get the next event. But this is longer-term stuff.
Thoughts about Performance
Jitawa was performing really close to CCL without inlining. Some thoughts
as to why Jitawa's performance is pretty good despite not having any compiler
optimizations or even register allocation:
Old Status Updates
These are only perhaps of some historic interest.
Update: Apr 4, 2011
We're benchmarking a new version with improved performance debugging capabilities (it prints
basic information about heap usage, number of stored constants, timing information, etc.) and comparing
the full run against snapshots to see what's going on. Magnus has also improved the handling of stored
constants to be more efficient, but this is looking less likely as the culprit.
Update: Mar 28, 2011 -- It finished!
The full run on Jitawa finally finished tonight after 128 hours, 55 minutes, 28 seconds. This is
about 8x worse than CCL, 5.8x worse than SBCL. This is somewhat disappointing given the incredibly
competitive performance that Jitawa shows for the early events (but is also really cool anyway.)
Magnus and I are following up on this. It seems like Jitawa's performance really should be much
better than this. Te early indications were that Jitawa was almost as fast as CCL, and moreover the
tail end of the proof checking seemed to drag on and on even though a snapshot that was doing the same
work finished reasonably quickly. Moreover, events that really should have been lightning fast at the
end seemed to take a long time to process. So, it seems that there might be something that is leading
to a progressive slowdown over time. We're currently speculating as to what the causes might be and
looking into ways to get more useful debugging information out of the system.
Some other information from time --verbose:
- Major (requiring I/O) page faults: 32
- Minor (reclaiming a frame) page faults: 6,648,693
- Voluntary context switches: 154,961
- Involuntary context switches: 1,508,401
- Swaps, file system inputs/outputs, socket messages sent/received, signals delivered,
average shared text size, average unshared data size, average stack size, average total
size, maximum/average resident set size: 0
- Page size: 4096
It doesn't look like the new run I started up with a bigger heap size is doing any better, so it
may be something unrelated to garbage collection, etc.
Update: Mar 27, 2011, 2:10 PM
Looking good -- the full run is up to 14049 and is heading toward the home stretch of easy proofs;
maybe it will finish in a couple of hours.
Update: Mar 27, 2011
1:00 PM. Well, the ITP review deadline looms closer. The Jitawa run seems to be through the largest proof,
and is up to event 13979, but it is still going. Things should be pretty fast from here on
out: there are several proofs in the 13,000s that establish the correctness of Milawa's rewriter that
are the real killers; after these are through the remaining events are a lot smaller and faster.
On the other hand, I'm really confused by the performance we're seeing. Based on the "resumed
snapshot" performance, we should have been finished more than two days ago.
The following is my hunch, but at the moment it is completely unsubstantiated. I can imagine that,
because of the leading proofs that were skipped, the snapshot might have "gotten lucky" and ended
up GC'ing at different times than the full run. It's possible that the full run is getting bogged down
by trying to GC in situations where most of the available memory is used up by live objects, and so each
GC is (1) relatively unproductive since few objects are dead and can be freed, and (2) very costly since
copying so many live objects requires a lot of work. We've sometimes seen this kind of behavior in
certain regression tests at Centaur, and it can be truly deadly for performance.
On the other hand, while this theory could plausibly explain why it seemed to take much longer to
process event 13613, it seems a little unlikely that later events would happen to run into the same
situation. It's too bad that we had to disable the GC statistics printing, since those messages
would have given us a good clue to start from.
In case this is the problem, I've started up a fresh, full run on another machine, and told Jitawa
that it can use 2 billion conses instead of 1.5 billion. This run is still in the early stages (on
event 5726 at the moment). But my hope is that it will finish much faster than the first run. Even if
this isn't the case, it should be interesting to see the relative performance of the two runs.
UPDATE: Mar 25, 2011
It's 10:00 PM and it's still running. It spent forever on event 13613 (more than a
whole day, which is odd since the snapshot seemed to get through it faster than that), but it
looks like it's finally gotten through that. If it performs anything like the snapshot it
should finish tomorrow. It's too bad we can't see how much time it's spending in GC. It would
also be good to implement a (time ...) function that can report on how long it takes to do something,
UPDATE: Mar 24, 2011
As of 9:40 AM the program is still running. It's currently processing event 13613, which
is in the middle of the biggest stretch of proofs. It previously segfaulted before getting
this far, so it looks like the problem is fixed. But, it also seems to be running much more
slowly than I'd expected during this strech of proofs, so I'm not sure what the new ETA ought
to be. I no longer expect it to finish today.
At 8:00 PM it's still running; the snapshot I started from event 13561 on another computer
finished after 27 hours, 38 minutes, so barring unforseen weirdness the real run should finish
by tomorrow at noon or so.
UPDATE: Mar 23, 2011
Magnus traced down the segfault to our C routine for printing GC statistics. To work
around this problem, he has just sent me a new executable that doesn't print these
statistics. This version solves the segfault for the small example we came up with.
The new version is now working through the proofs. As of 2PM Central it is on event 5726,
and we are hopeful that it will finish successfully. My best guess for its completion time
would be around 10:00 tomorrow morning.
Meanwhile Magnus is trying to figure out how the bad interaction occurred. There is
probably a mismatch between what the C compiler is expecting and what the Jitawa assembly
code is doing when it calls this function.
UPDATE: Mar 22, 2011
We managed to come up with a very, very small example that leads
to the segfault quickly. This should help with debugging. Meanwhile
Magnus has done a lot to validate the model.
UPDATE: Mar 21, 2011
At this point, we think it is most likely that some bugs were
introduced when the 32-bit x86 model was ported and extended to 64-bit
x86. Magnus has actually already found and fixed a few bugs in the
decoder, and has sent me updated versions of Jitawa to try.
Unfortunately this hasn't fixed the problem.
In the past few days, Magnus has been developing testing programs to
validate the model against real x86 execution. We hope that this will
reveal what is amiss while also giving us more confidence in the
In another effort to be able to reproduce the bug more quickly, we
developed an experimental Milawa version that can make snapshots of
its progress. (This is quite barbaric and is certainly not something
we want to ask people to trust: to make a snapshot we just print out
the Milawa state; to resume we actually use a perl script to parse the
output file and extract the printed state.)
When we use a snapshot to advance right up to the event before the
segfault, the bug disappears for that event as we resume. In this way
we can make additional progress, but the segfaults become much more
frequent in the 13,500's and so I haven't managed to get all the way
through the events in this way (and of course the bug casts doubt on
the whole checking process, anyway).