PARALLELISM-TUTORIAL

a tutorial on how to use the parallelism library.
Major Section:  PARALLELISM

In this topic we introduce the ACL2 parallelism primitives using the example of a doubly-recursive Fibonacci function, whose basic definition is as follows. See for a very high-level summary of the parallelism capability described here, and see parallelism-build for how to build an executable image that supports parallel evaluation. Here, we assume that such an executable is being used.

Serial Fibonacci

(defun fib (x)
  (declare (xargs :guard (natp x)))
  (cond ((or (zp x) (<= x 0)) 0)
        ((= x 1) 1)
        (t (+ (fib (- x 1)) (fib (- x 2))))))

Introducing Pargs

A simple way to introduce parallelism into this function definition is to wrap the addition expression with a call of pargs, and the arguments to the addition will be computed in parallel whenever resources are available. As a result, we end up with a very similar and thus intuitive function definition. Note that we replaced + by binary-+, since pargs expects a function call, not a macro call.

(defun pfib (x)
  (declare (xargs :guard (natp x)))
  (cond ((or (zp x) (<= x 0)) 0)
        ((= x 1) 1)
        (t (pargs (binary-+ (pfib (- x 1))
                            (pfib (- x 2)))))))

Introducing the Granularity Problem

After you submit the above two versions of the Fibonacci function, test them with the following forms.

(time$ (fib 10))
(time$ (pfib 10))
Now increase the argument by increments of 5 to 10 until you find your curiosity satisfied or your patience wearing thin. You can interrupt evaluation if necessary and return to the ACL2 loop. You will immediately notice that you have not increased execution speed, at least not by much, by introducing parallelism.

First, consider the computation of (pfib 4). Assuming resources are available, (pfib 4) will create a thread for computing (pfib 3) and another thread for (pfib 2). It is easy to imagine that setting up each thread takes much longer than the entire computation of (fib 4).

Second, we must realize that if we have two threads available for computing (fib 10), then the evaluation of (fib 8) will probably complete before the evaluation of (fib 9). Once (fib 8) finishes, parallelism resources will become available for the next recursive call made on behalf of (fib 9). If for example that call is (fib 3), we will waste a lot of cycles just handing work to the thread that does this relatively small computation. We need a way to ensure that parallelism resources are only used on problems of a "large" size. Ensuring that only "large" problems are spawned is called the ``granularity problem.''

In summary: We want to tell ACL2 that it can evaluate the arguments of pargs in parallel only when the parameter of pfib is greater than some threshold. Our tests using OpenMCL have suggested that 27 is a reasonable threshold.

Explicit Programming for the Granularity Problem

One way to avoid the granularity problem is to duplicate code as follows.

(defun pfib (x)
  (declare (xargs :guard (natp x)))
  (cond ((or (zp x) (<= x 0)) 0)
        ((= x 1) 1)
        (t (if (> x 27) ; the granularity test
               (pargs (binary-+ (pfib (- x 1))
                                (pfib (- x 2))))
             (binary-+ (pfib (- x 1))
                       (pfib (- x 2)))))))
Duplicating code is fundamentally a bad design principle, because it can double the work for future maintenance. A ``granularity form'' is an expression
(declare (granularity <form>))
that can allow threads to be spawned (without duplicating code) whenever the evaluation of <form> results in a non-nil value. It may be placed inside any call of a parallelism primitive, in a position documentated separately for each primitive. Here is a definition of pfib using this feature for a call of the parallelism primitive pargs.
(defun pfib (x)
  (declare (xargs :guard (natp x)))
  (cond ((or (zp x) (<= x 0)) 0)
        ((= x 1) 1)
        (t (pargs
            (declare (granularity (> x 27)))
            (binary-+ (pfib (- x 1))
                      (pfib (- x 2)))))))

Test each version as follows (or substitute your own natural number for 33).

  (time$ (fib 33))
  (time$ (pfib 33))

Another Granularity Issue Related to Thread Limitations

Our implementation of parallelism primitives has the property that once a thread is assigned a computation, that assignment stays in effect until the computation is complete. In particular, if a thread encounters a parallelism primitive that spawns child threads, the parent thread stays assigned, waiting until the child computations complete before it can continue its own computation. In the meantime, the parent thread reduces the number of additional threads that Lisp can provide by 1, rather than being reassigned to do other work.

How can this lack of reassignment affect the user? Consider, for example, the application of a recursive function to a long list. Imagine that this function is written so that the body contains a recursive call, for example as (pargs (process (car x)) (recur (cdr x))). Each such pargs call that spawns child work must wait for its children, one of which is the work of evaluating (recur (cdr x)), to complete. There is an ACL2 limit on how much pending work can be in the system, limiting the number of pargs calls that can result in parallel evaluation. For example, if the ACL2 limit is k and each call of pargs actually spawns threads for evaluating its arguments, then after k cdrs there will be no further parallel evaluation.

Possible solutions may include reworking of algorithms (for example to be tree-based rather than list-based) or using appropriate granularity forms. We hope that future implementations will allow thread ``re-deployment'' in order to mitigate this problem.

Introducing Plet

We can use a let binding to compute the recursive calls of fib and then add the bound variables together, as follows.

(defun fib (x)
  (declare (xargs :guard (natp x)))
  (cond ((or (zp x) (<= x 0)) 0)
        ((= x 1) 1)
        (t (let ((a (fib (- x 1)))
                 (b (fib (- x 2))))
             (+ a b)))))

By using the parallelism primitive plet, we can introduce parallelism much as we did using pargs, with an optional granularity form, as follows.

(defun pfib (x)
  (declare (xargs :guard (natp x)))
  (cond ((or (zp x) (<= x 0)) 0)
        ((= x 1) 1)
        (t (plet 
            (declare (granularity (> x 27)))
            ((a (pfib (- x 1)))
             (b (pfib (- x 2))))
            (+ a b)))))
Notice that this time, we were able to use + rather than being forced to use binary-+. Unlike pargs, which expects a function call (not a macro call), plet allows macros at the top level of its body.

Introducing Pand (By Way of a Tree Validation Example)

Consider ``genetic trees'' that contains leaves of DNA elements, in the sense that each tip is one of the symbols A, G, C, or T. First we define the function valid-tip which recognizes whether a tip contains one of these symbols.

(defun valid-tip (tip)
  (declare (xargs :guard t))
  (or (eq tip 'A)
      (eq tip 'G)
      (eq tip 'C)
      (eq tip 'T)))

Now we define a function that traverses the tree, checking that every tip is valid.

(defun valid-tree-serial (tree)
  (declare (xargs :guard t))
  (if (atom tree)
      (valid-tip tree)
    (and (valid-tree-serial (car tree))
         (valid-tree-serial (cdr tree)))))

We also define a parallel version.

(defun valid-tree-parallel (tree)
  (declare (xargs :guard t))
  (if (atom tree)
      (valid-tip tree)
    (pand (valid-tree-parallel (car tree))
          (valid-tree-parallel (cdr tree)))))

Before we can time the functions, we need to create test trees. We have found that test trees need to be approximately 1 gigabyte before we clearly see speedup, and we make them asymmetric to demonstrate the ability of our implementation to adapt to asymmetric data. We can create the trees with the execution of the following forms.

(defun make-valid-binary-tree (x)
  (declare (xargs :mode :program))
  (if (< x 0)
      (cons (cons 'C 'G) (cons 'A 'T))
    (cons (make-valid-binary-tree (- x 2)) ; 2 to make asymmetric
          (make-valid-binary-tree (- x 1)))))

(defconst *valid-tree* (make-valid-binary-tree 30)) ; takes awhile (defconst *invalid-tree* (cons *valid-tree* nil)) ; nil is invalid tip

We can time our functions with the forms:

(time$ (valid-tree-serial *valid-tree*))
(time$ (valid-tree-parallel *valid-tree*))
Unfortunately, the serial version runs faster than the parallelism version; however, there is more to this story.

Demonstrating Early Termination with an Invalid Tree Now observe this magic:

(time$ (valid-tree-serial   *invalid-tree*))
(time$ (valid-tree-parallel *invalid-tree*))
The serial version took awhile, while the parallel version finished almost immediately. The test for validation was split into testing the car and the cdr of the *invalid-tree* root, and since the cdr was equal to nil, its test returned immediately. This immediate return quickly interrupted the computation associated with the car, and returned the result.

Granularity with Pand

We can also define a parallel version with a granularity form:

(defun valid-tree-parallel (tree depth)
  (declare (xargs :guard (natp depth)))
  (if (atom tree)
      (valid-tip tree)
    (pand 
     (declare (granularity (< depth 5)))
     (valid-tree-parallel (car tree) (1+ depth))
     (valid-tree-parallel (cdr tree) (1+ depth)))))

We can test this form by executing our previous forms. You will probably find some speedup on a machine with several cores available, but more speedup can likely be obtained with an expensive test on the leaves in place of valid-tip.

(time$ (valid-tree-serial   *valid-tree*))
(time$ (valid-tree-parallel *valid-tree* 0))

Introducing Por

Por can be used in the same way as pand, but with early termination occurring when an argument evaluates to a non-nil value, in which case the value returned is t. Finally, por also supports the use of a granularity form.