• Top
    • Documentation
    • Books
      • Cert.pl
        • Preliminaries
        • Distributed-builds
          • Certifying-simple-books
          • Cert_param
          • Pre-certify-book-commands
          • Using-extended-ACL2-images
          • ACL2-system-feature-dependencies
          • Static-makefiles
          • Optimizing-build-time
          • Raw-lisp-and-other-dependencies
          • Custom-certify-book-commands
          • Include-events
          • Ifdef
          • Ifdef-define
          • Ifndef
          • Ifdef-undefine
          • Ifdef-undefine!
          • Ifdef-define!
          • Include-src-events
        • Community-books
        • Project-dir-alist
        • Bookdata
        • Book-hash
        • Uncertified-books
        • Sysfile
        • Show-books
        • Best-practices
        • Books-reference
        • Where-do-i-place-my-book
        • Books-tour
      • Recursion-and-induction
      • Boolean-reasoning
      • Projects
      • Debugging
      • Std
      • Proof-automation
      • Macro-libraries
      • ACL2
      • Interfacing-tools
      • Hardware-verification
      • Software-verification
      • Testing-utilities
      • Math
    • Cert.pl

    Distributed-builds

    (Advanced) how to distribute ACL2 book building over a cluster of machines.

    Warning: getting a cluster set up and running smoothly is a significant undertaking. Aside from hardware costs, it may take significant energy to install and administer the system, and you will need to learn how to effectively use the queuing system. You'll probably also need to be ready to do some scripting to work around dumb problems. Think of this topic as: some hints that may help you, not a usable guide to setting up a cluster.

    At Centaur, cert.pl is successfully used within a rocks cluster environment, using the open-source queuing system torque and the maui scheduler. This clustering software allows for the submission of PBS scripts as jobs. To support this cluster, cert.pl has certain features.

    Support for PBS directives

    For one, cert.pl writes out a PBS script for each book it is going to certify. These scripts look like ordinary shell scripts (so they work fine for use in non-cluster environments), but they contain special comments that the clustering software understands.

    These comments allow you to say, e.g., how much memory a job is going to take, so that if a job takes more than its allotted memory, the clustering software may choose to kill it. The clustering software also uses this memory limit to ensure that when it allocates a job to a machine, the machine will have enough physical machine to run the job.

    This is really very useful. If you let a machine start swapping into the gigabytes, at worst you will need to physically reset it, because it dies a special kind of horrible death where its load average is 50 and you can't even "kill" anything. In a slightly better case, you may run into the Linux overcommit and OOM killer features, which are also really awful. My favorite article on the topic, from back before we had the cluster and were running into this frequently, is here.

    At any rate, when cert.pl writes out the scripts to certify books, it includes some PBS commands that say how much memory the book is expected to take. This is done by a stupid heuristic: we search for set-max-mem lines; if no such line is found we say the book will take 4 GB, and otherwise we reserve I think 2-3 GB more than the set-max-mem line calls for. This extra padding is because set-max-mem only affects the heap, and doesn't account for the stacks, and we typically build a CCL image with large stacks, as explained in centaur/ccl-config.lsp, and also because set-max-mem is sort of best thought of as a soft cap, anyway.

    Support for a Queuing System

    Besides this support for PBS directives, cert.pl also consults an environment variable $STARTJOB. If this variable isn't set, we default it to your current $SHELL. When we run ACL2 jobs, we basically use:

    $STARTJOB -c "acl2 < certify-commands &> foo.cert.out"

    So, given a suitable startjob command, cert.pl can automatically distribute the jobs to your cluster. A suitable command is one that:

    • Accepts the -c syntax or (without -c) accepts a script.
    • Waits for the job to finish.
    • Exits "transparently", i.e., with the exit code of the job.

    A suitable startjob command does not need to support any input/output redirection; we embed that into the command itself.

    Support for NFS Lag

    We originally found that our builds would often "fail" due to the following scenario:

    • Head node: Makefile submits book A to the queue.
    • Compute node: Certifies book A successfully.
    • Head node: startjob returns control to the Makefile.
    • Head node: Makefile runs ls A.cert to check success.
    • Head node: ls fails because NFS isn't up to date.
    • Make thinks there's been a problem and dies.
    • Moments later A.cert shows up.

    To avoid this, cert.pl now has special support for NFS lag. We now use exit codes instead of files to determine success. In cases where the exit code says the job completed successfully, we wait until A.cert becomes visible to the head node before returning control to the Makefile.