Distributed-builds
(Advanced) how to distribute ACL2 book building over a cluster
of machines.
Warning: getting a cluster set up and running smoothly is a
significant undertaking. Aside from hardware costs, it may take significant
energy to install and administer the system, and you will need to learn how to
effectively use the queuing system. You'll probably also need to be ready to
do some scripting to work around dumb problems. Think of this topic as:
some hints that may help you, not a usable guide to setting up a
cluster.
At Centaur, cert.pl is successfully used within a rocks cluster environment,
using the open-source queuing system torque
and the maui
scheduler. This clustering software allows for the submission of PBS scripts as
jobs. To support this cluster, cert.pl has certain features.
Support for PBS directives
For one, cert.pl writes out a PBS script for each book it is going to
certify. These scripts look like ordinary shell scripts (so they work fine for
use in non-cluster environments), but they contain special comments that the
clustering software understands.
These comments allow you to say, e.g., how much memory a job is going to
take, so that if a job takes more than its allotted memory, the clustering
software may choose to kill it. The clustering software also uses this memory
limit to ensure that when it allocates a job to a machine, the machine will
have enough physical machine to run the job.
This is really very useful. If you let a machine start swapping into the
gigabytes, at worst you will need to physically reset it, because it dies a
special kind of horrible death where its load average is 50 and you can't even
"kill" anything. In a slightly better case, you may run into the Linux
overcommit and OOM killer features, which are also really awful. My favorite
article on the topic, from back before we had the cluster and were running into
this frequently, is here.
At any rate, when cert.pl writes out the scripts to certify books, it
includes some PBS commands that say how much memory the book is expected to
take. This is done by a stupid heuristic: we search for set-max-mem lines;
if no such line is found we say the book will take 4 GB, and otherwise we
reserve I think 2-3 GB more than the set-max-mem line calls for. This extra
padding is because set-max-mem only affects the heap, and doesn't account for
the stacks, and we typically build a CCL image with large stacks, as explained
in centaur/ccl-config.lsp, and also because set-max-mem is sort of best thought
of as a soft cap, anyway.
Support for a Queuing System
Besides this support for PBS directives, cert.pl also consults an
environment variable $STARTJOB. If this variable isn't set, we default it
to your current $SHELL. When we run ACL2 jobs, we basically use:
$STARTJOB -c "acl2 < certify-commands &> foo.cert.out"
So, given a suitable startjob command, cert.pl can automatically
distribute the jobs to your cluster. A suitable command is one that:
- Accepts the -c syntax or (without -c) accepts a script.
- Waits for the job to finish.
- Exits "transparently", i.e., with the exit code of the job.
A suitable startjob command does not need to support any input/output
redirection; we embed that into the command itself.
Support for NFS Lag
We originally found that our builds would often "fail" due to the
following scenario:
- Head node: Makefile submits book A to the queue.
- Compute node: Certifies book A successfully.
- Head node: startjob returns control to the Makefile.
- Head node: Makefile runs ls A.cert to check success.
- Head node: ls fails because NFS isn't up to date.
- Make thinks there's been a problem and dies.
- Moments later A.cert shows up.
To avoid this, cert.pl now has special support for NFS lag. We now use
exit codes instead of files to determine success. In cases where the exit code
says the job completed successfully, we wait until A.cert becomes visible
to the head node before returning control to the Makefile.