Major Section: PARALLELISM
In this topic we introduce the ACL2 parallelism primitives using the example of a doubly-recursive Fibonacci function, whose basic definition is as follows. See for a very high-level summary of the parallelism capability described here, and see parallelism-build for how to build an executable image that supports parallel evaluation. Here, we assume that such an executable is being used.
(defun fib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (+ (fib (- x 1)) (fib (- x 2))))))
A simple way to introduce parallelism into this function definition is to
wrap the addition expression with a call of
pargs, and the arguments to
the addition will be computed in parallel whenever resources are available.
As a result, we end up with a very similar and thus intuitive function
definition. Note that we replaced
pargs expects a function call, not a macro call.
(defun pfib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (pargs (binary-+ (pfib (- x 1)) (pfib (- x 2)))))))
Introducing the Granularity Problem
After you submit the above two versions of the Fibonacci function, test them with the following forms.
(time$ (fib 10)) (time$ (pfib 10))Now increase the argument by increments of 5 to 10 until you find your curiosity satisfied or your patience wearing thin. You can interrupt evaluation if necessary and return to the ACL2 loop. You will immediately notice that you have not increased execution speed, at least not by much, by introducing parallelism.
First, consider the computation of
(pfib 4). Assuming resources are
(pfib 4) will create a thread for computing
(pfib 3) and
another thread for
(pfib 2). It is easy to imagine that setting up each
thread takes much longer than the entire computation of
Second, we must realize that if we have two threads available for computing
(fib 10), then the evaluation of
(fib 8) will probably complete
before the evaluation of
(fib 9). Once
(fib 8) finishes, parallelism
resources will become available for the next recursive call made on behalf of
(fib 9). If for example that call is
(fib 3), we will waste a lot of
cycles just handing work to the thread that does this relatively small
computation. We need a way to ensure that parallelism resources are only
used on problems of a "large" size. Ensuring that only "large" problems
are spawned is called the ``granularity problem.''
In summary: We want to tell ACL2 that it can evaluate the arguments of
pargs in parallel only when the parameter of
pfib is greater
than some threshold. Our tests using OpenMCL have suggested that 27 is a
Explicit Programming for the Granularity Problem
One way to avoid the granularity problem is to duplicate code as follows.
(defun pfib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (if (> x 27) ; the granularity test (pargs (binary-+ (pfib (- x 1)) (pfib (- x 2)))) (binary-+ (pfib (- x 1)) (pfib (- x 2)))))))Duplicating code is fundamentally a bad design principle, because it can double the work for future maintenance. A ``granularity form'' is an expression
(declare (granularity <form>))that can allow threads to be spawned (without duplicating code) whenever the evaluation of
<form>results in a non-
nilvalue. It may be placed inside any call of a parallelism primitive, in a position documentated separately for each primitive. Here is a definition of
pfibusing this feature for a call of the parallelism primitive
(defun pfib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (pargs (declare (granularity (> x 27))) (binary-+ (pfib (- x 1)) (pfib (- x 2)))))))
Test each version as follows (or substitute your own natural number for 33).
(time$ (fib 33)) (time$ (pfib 33))
Another Granularity Issue Related to Thread Limitations
Our implementation of parallelism primitives has the property that once a thread is assigned a computation, that assignment stays in effect until the computation is complete. In particular, if a thread encounters a parallelism primitive that spawns child threads, the parent thread stays assigned, waiting until the child computations complete before it can continue its own computation. In the meantime, the parent thread reduces the number of additional threads that Lisp can provide by 1, rather than being reassigned to do other work.
How can this lack of reassignment affect the user? Consider, for example,
the application of a recursive function to a long list. Imagine that this
function is written so that the body contains a recursive call, for example
(pargs (process (car x)) (recur (cdr x))). Each such
that spawns child work must wait for its children, one of which is the work
(recur (cdr x)), to complete. There is an ACL2 limit on
how much pending work can be in the system, limiting the number of
pargs calls that can result in parallel evaluation. For example, if
the ACL2 limit is k and each call of
pargs actually spawns threads for
evaluating its arguments, then
cdrs there will be no further parallel evaluation.
Possible solutions may include reworking of algorithms (for example to be tree-based rather than list-based) or using appropriate granularity forms. We hope that future implementations will allow thread ``re-deployment'' in order to mitigate this problem.
We can use a
let binding to compute the recursive calls of
then add the bound variables together, as follows.
(defun fib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (let ((a (fib (- x 1))) (b (fib (- x 2)))) (+ a b)))))
By using the parallelism primitive
plet, we can introduce parallelism
much as we did using
pargs, with an optional granularity form, as
(defun pfib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (plet (declare (granularity (> x 27))) ((a (pfib (- x 1))) (b (pfib (- x 2)))) (+ a b)))))Notice that this time, we were able to use
+rather than being forced to use
pargs, which expects a function call (not a macro call),
pletallows macros at the top level of its body.
Pand (By Way of a Tree Validation Example)
Consider ``genetic trees'' that contains leaves of DNA elements, in the sense
that each tip is one of the symbols
T. First we
define the function
valid-tip which recognizes whether a tip contains one
of these symbols.
(defun valid-tip (tip) (declare (xargs :guard t)) (or (eq tip 'A) (eq tip 'G) (eq tip 'C) (eq tip 'T)))
Now we define a function that traverses the tree, checking that every tip is valid.
(defun valid-tree-serial (tree) (declare (xargs :guard t)) (if (atom tree) (valid-tip tree) (and (valid-tree-serial (car tree)) (valid-tree-serial (cdr tree)))))
We also define a parallel version.
(defun valid-tree-parallel (tree) (declare (xargs :guard t)) (if (atom tree) (valid-tip tree) (pand (valid-tree-parallel (car tree)) (valid-tree-parallel (cdr tree)))))
Before we can time the functions, we need to create test trees. We have found that test trees need to be approximately 1 gigabyte before we clearly see speedup, and we make them asymmetric to demonstrate the ability of our implementation to adapt to asymmetric data. We can create the trees with the execution of the following forms.
(defun make-valid-binary-tree (x) (declare (xargs :mode :program)) (if (< x 0) (cons (cons 'C 'G) (cons 'A 'T)) (cons (make-valid-binary-tree (- x 2)) ; 2 to make asymmetric (make-valid-binary-tree (- x 1)))))
(defconst *valid-tree* (make-valid-binary-tree 30)) ; takes awhile (defconst *invalid-tree* (cons *valid-tree* nil)) ; nil is invalid tip
We can time our functions with the forms:
(time$ (valid-tree-serial *valid-tree*)) (time$ (valid-tree-parallel *valid-tree*))Unfortunately, the serial version runs faster than the parallelism version; however, there is more to this story.
Demonstrating Early Termination with an Invalid Tree Now observe this magic:
(time$ (valid-tree-serial *invalid-tree*)) (time$ (valid-tree-parallel *invalid-tree*))The serial version took awhile, while the parallel version finished almost immediately. The test for validation was split into testing the
*invalid-tree*root, and since the
cdrwas equal to
nil, its test returned immediately. This immediate return quickly interrupted the computation associated with the
car, and returned the result.
We can also define a parallel version with a granularity form:
(defun valid-tree-parallel (tree depth) (declare (xargs :guard (natp depth))) (if (atom tree) (valid-tip tree) (pand (declare (granularity (< depth 5))) (valid-tree-parallel (car tree) (1+ depth)) (valid-tree-parallel (cdr tree) (1+ depth)))))
We can test this form by executing our previous forms. You will probably
find some speedup on a machine with several cores available, but more speedup
can likely be obtained with an expensive test on the leaves in place of
(time$ (valid-tree-serial *valid-tree*)) (time$ (valid-tree-parallel *valid-tree* 0))
Por can be used in the same way as
pand, but with early
termination occurring when an argument evaluates to a non-
nil value, in
which case the value returned is
por also supports the
use of a granularity form.