6. A first example of stepwise program composition

The little examples dealt with so far are not representative for the programs that have to be made: they are several orders of magnitude too small. A trained programmer ''sees'' them at a glance, he can think about them without pencil and paper. They are of the size of a paragraph, while we have to deal with programs of the size of a page, a chapter or a book and eventually —to quote A. Perlis— with programs that no longer fit into a single programmer's skull! The composition of such very large programs falls outside the scope of this little monograph, but the very least thing we can do is to show the reader how to organize his thoughts when composing, say, page-size programs. If in the following the reader thinks that I am too careful, he should bear the chapter-size programs in mind. (If he is still unconvinced he should study a single page program made by a messy programmer; he will then discover that even a single page has room enough for a disgusting and intellectually unhealthy among of unmastered complexity!)

Here we go. Consider sequences composed of 1's, 2's and 3's which contain only pairs of different adjoining non-empty subsequences. Examples of such good sequences are

1
12
12312
3212321
.

Examples of bad sequences are

11
12323131
32132132132
.

In all probability the list of good sequences is infinite. The problem is now: given that there is at least one good sequence of length 100 (i.e. consisting of 100 digits), make a program generating the beginning of this list of good sequences in alphabetical order up to and including the first sequence of length 100. (Alphabetical order under the convention that the 1 precedes the 2 and the 2 precedes the 3; the criterion can be translated into a numerical one by putting the decimal point in front of the sequence and then interpreting the sequence as a decimal fraction. With that convention alphabetical order is the order of increasing magnitude.)

I have used this example extensively in oral examinations. After some initial skirmishes, most students discovered for themselves

  1. that a bad sequence could never be made into a good one by extending it, i.e. all good sequences are either a one-digit sequence or a one-digit extension of a good sequence
  2. if sequence B is a good one-digit extension of sequence A, sequence A precedes sequence B in the alphabetical order, i.e. a good sequence is followed by all its possible extensions
  3. the alphabetical order requires that the good sequence A will first be followed by its extensions starting with a 1 (if any), then by those starting with a 2 (if any), then by those starting with a 3 (if any).

These observations lead to the following rule:

a good sequence should be printed and extended with a 1 as the next trial sequence; from a bad sequence, terminal 3's (if any) should be removed and the final digit (now ≠3) should be increased by 1, giving the next trial sequence.

The beginning of the list to be generated is:

1
12
121
1213
12131
121312
1213121
1213123
.......

by searching the following list of trial sequences (omitting the ones marked by *)


*


*
*


*


*
*
*
*
1
11
12
121
1211
1212
1213
12131
121311
121312
1213121
12131211
12131212
12131213
1213122
1213123
.......

Many of them suggested a program of the following structure.

program 1:

SET SEQUENCE TO ONE AND LENGTH TO ONE;
repeat if GOOD then begin PRINT; EXTEND WITH ONE end
               else ADJUST
until length > 100

in which the primitive ''EXTEND WITH ONE'' extends the given sequence with a 1 and the primitive ''ADJUST'' increases the last digit by 1 after removal of terminal 3's, if any. (For the operation ''ADJUST'' to be defined, the sequence remaining after removal of terminal 3's must not be empty; this follows from the fact that the list to be produced is known to contain a sequence of length = 100.)

A number of objections can be raised against a program made along the lines sketched. One objection is that at the beginning of the execution of the repeatable statement the length will be ≤ 100, and furthermore we know that the operation ''ADJUST'' will never increase the length; nevertheless each adjustment is followed in time by a test on the length, and the first objection is therefore that these tests are superfluous. A more serious objection is to be found in the tortuous reasoning required to establish that the end condition is all right. Instead of stopping when for the first time a solution of length 100 has been printed, it will stop when the first trial sequence of length > 100 has been generated. It is clear that the above program will never produce a solution larger than 100 because such a long trial sequence will never be subjected to the test ''GOOD''. To show, however, that it will stop after the production of the first solution of length = 100 is much harder.

A much nicer program is based upon the observation that we can regard the empty sequence as a virtual solution which does not need to be printed.

program 2:

SET SEQUENCE EMPTY AND LENGTH TO ZERO;
repeat EXTEND WITH ONE;
       while non GOOD do ADJUST;
until length = 100

The objections raised are no longer valid. The true reason, however, why the above program is so much more attractive, is to be found in the observation that we can mentally combine the first two statements of the repeatable statement. The above version is a refinement of the more abstract

program 3:

SET SEQUENCE EMPTY AND LENGTH TO ZERO;
repeat TRANSFORM SEQUENCE INTO NEXT SOLUTION;
       PRINT
until length = 100.

(Note. In programs 1, 2 and 3 the outer repetition could also have been controlled by a while clause.)

Observe that here, in program 3, we have a level of description from which the trial sequences have disappeared! It is a level of description which can be understood in terms of solutions only. By distinguishing, i.e. by mentally isolating the operator ''TRANSFORM SEQUENCE INTO NEXT SOLUTION'' and postponing its refinement, we have separated the task of formulating the correct criterion for termination from how the transition from one solution to the next will be performed via a number of trial sequences which may be rejected. Remembering the limited size of the programmer's skull, this separation is a vital achievement, as it enables us to deal with one thing at a time.

To show that all this is not just idle playing with words we shall proceed from program 3 as our starting point, refining from there onwards. By way of surprise we shall arrive at a refinement different from program 2, again without essentially changing the algorithm. (Although I had used the example extensively in examinations, the next version only occurred to me when writing this very text! This is just to show that such abstract programs are vital stepping stones in our process of constructive reasoning!)

To find this refinement we take a step backwards, asking ourselves what enabled us to make the transition from program 1 to program 2. It was the introduction of the empty sequence as ''virtual solution''. In program 1, the first solution was given, while the others were generated; in program 2 and 3, all solutions to be printed are generated by the same operator ''TRANSFORM SEQUENCE INTO NEXT SOLUTION''.

When refining this operator the trial sequences have to be generated, and in program 2 we find that the criterion ''GOOD'' has to be applied to trial sequences generated in two different ways, either by ''EXTEND WITH ONE'' or by ''ADJUST''. Can we clean up our refinement by insisting that all trial sequences to be tested are generated by the same operator? Yes we can, by slightly changing the extension operator and slightly generalizing the operate ''ADJUST'', as in the following refinement.

TRANSFORM SEQUENCE INTO NEXT SOLUTION:
      EXTEND WITH ZERO;
      repeat ADJUST until GOOD.

Here ''GOOD'' stands for a rather complicated function; an alternative form uses the boolean variable ''good'' and leaves us with the task of refining the operator ''SET GOOD''.

TRANSFORM SEQUENCE INTO NEXT SOLUTION:
      boolean good;
      EXTEND WITH ZERO;
      repeat ADJUST; SET GOOD until good.

(Note: In the above refinement the repetition is not to be controlled by a while-clause. Why?)

Now the time has come to make a decision on the representation of ''sequence''. It has a property ''length'', now satisfying the inequalities 0 ≤ length ≤ 100 and is an ordered sequence of length digits. An appropriate vehicle for representing this sequence is (part of) a linear array of integer variables. We suggest declaring an integer array d[1:100], such that at any moment the sequence will be represented by

d[1], d[2], ... , d[length].

We would like to point out that this is a well-motivated decision. An alternative representation would have been

d[101 - length], d[102 - length], ... , d[100]

but with the latter convention all operations changing the length of the sequence would imply moving the values upwards or downwards, whereas with the suggested representation, the values being kept can ''stay where they are''. When the chosen convention is made to apply to all sequences manipulated (i.e. to solutions as well as to trial sequences) the following four refinements are fairly obvious. (As far as they are concerned, the chosen representation is certainly adequate.)

SET SEQUENCE TO EMPTY AND LENGTH TO ZERO:
      length:= 0;

EXTEND WITH ZERO:
      length:= length + 1; d[length]:= 0;

ADJUST:
      while d[length] = 3 do length:= length - 1;
      d[length]:= d[length] + 1;

PRINT:
      i:= 0; repeat i:= i + 1; printdigit(d[i]) until i = length; newline

where we have assumed the availability of the primitives ''printdigit'' and ''newline'' for the transition to the beginning of the next line for output. The only refinement which can still cause headaches is the operator ''SET GOOD''.

To investigate an arbitrary sequence is indeed a gruesome task, but it becomes much easier if we exploit the circumstances that the only sequences to be subjected to the test are trial sequences, and each trial sequence is a one-digit extension of an (earlier) good sequence. As a result it can only violate the condition if its terminal element is included in one of the subsequences, i.e. it has to be rejected as bad if there exists an m (satisfying 0 < 2 * mlength) such that the pair of adjoining ''msequences''

d[length - 2 * m + 1], ... , d[length - m] and
d[length - m + 1], ... , d[length]

are equal. Assuming the availability of the operator needed to compare the msequences of this pair (for arbitrary, given m), out first refinement of ''SET GOOD'' is

SET GOOD:
      integer m;
      good:= true; m:= 1;
      while 2 * m
length and good do
      begin GIVE GOOD THE MEANING THAT SEQUENCES DIFFER;
            m:= m + 1;
      end

or (probably better)

SET GOOD:
      integer m, mbound;
      good:= true; mbound:= length div 2; m:= 1;
      while m
mbound and good do
      begin GIVE GOOD THE MEANING THAT SEQUENCES DIFFER;
            m:= m + 1;
      end

Here the operator div is the integer divide, rounding off the quotient to the nearest integer towards zero. The double condition for continuing the repetition expresses that the investigation can be stopped as soon as an equal pair has been found, as this is sufficient to establish its being bad. We have seen this construction at the end of the previous section.

Question. An alternative form would have been

integer m;
good:= true; m:= length div 2;
while m > 0 and good do
begin GIVE GOOD THE MEANING THAT SEQUENCES DIFFER;
     m:= m - 1;
end

Why do we propose to do the investigation in the order of increasing m?

Finally we refine the comparison of two msequences.

GIVE GOOD THE MEANING THAT THE SEQUENCES DIFFER:
      integer firstend, k;
      firstend:= length - m; k:= 0;
      repeat good:= (d[length - k]
d[firstend - k]);
             k:= k + 1;
      until k = m or good

again expressing that the comparison of the two msequences can be terminated as soon as it has been established that they differ somewhere.

Collecting the declarations and inserting the refinements —keeping their names as labels for explicative purposes— we arrive at the complete program as shown on the next page [i.e., immediately below]. The successive levels of detail have been indicated by systematic use of indentation.

*             *
*

begin   integer array d[1:100]; boolean good; integer length, i, m, mbound, k, firstend;
      SET SEQUENCE EMPTY AND LENGTH TO ZERO:
            length:= 0;
      repeat TRANSFORM SEQUENCE INTO NEXT SOLUTION:
                   EXTEND WITH ZERO:
                         length:= length + 1; d[length]:= 0;
                   repeat ADJUST:
                                while d[length] = 3 do length:= length - 1;
                                      d[length]:= d[length] + 1;
                          SET GOOD:
                                good:= true; m:= 1; mbound:= length div 2;
                                while m
mbound and good do
                                begin GIVE GOOD THE MEANING THAT THE MSEQUENCES DIFFER:
                                            firstend:= length - m; k:= 0;
                                            repeat good:= (d[length - k]
d[firstend - k]);
                                                   k:= k + 1;
                                            until k = m or good;
                                      m:= m + 1
                                end
                   until good;
             PRINT:
                    i:= 0; repeat i:= i + 1; printdigit(d[i]) until i = length; newline
      until length = 100
end *             *
*

Exercise.

Given a linear array of 36 positions, make a program generating all ways (if any) in which these positions can be filled with zeros and ones (one digit per position), such that the 32 quintuples of five adjoining positions represent the 32 different patterns of five binary digits, restricting ourselves to sequences starting with five zeros. C. Ligtmans has shown that any such solution must end with four zeros. His argument is as follows. Each solution must start with 000001..., because the pattern 00000 may occur only once. Somewhere in the sequence the pattern 10000 must occur once; as this pattern can only be followed by a 0 or a 1, its ''following'' quintuple must be either 00000 or 00001, presented already by the first two quintuples. As a result the pattern 10000 cannot occur in the interior of the linear sequence and therefore it must be the rightmost pattern. From Ligtmans' observation it follows that we can close the ring, making the last four zeros overlap with the four initial zeros. The different patterns are then arranged in a cycle of 32 positions.

The patterns have to be generated in alphabetical order.

Discussion and some hints:

I take for granted that, given a sequence of 36 binary digits, the boolean function stating whether this sequence represents a solution is computable, and that we could write an algorithm computing it. In principle we could write a program generating all 36-digit sequences with five leading zeros in alphabetical order and subjecting all these sequences to the test just mentioned, thereby selecting those satisfying the test. This gives a very unrealistic program and we shall not pursue it; we only remark that generating the trial sequences in alphabetical order will ensure that the solutions, when found, will be found in alphabetical order as well.

The program to be made could be regarded as a derivation from the ridiculous one sketched above, viz. by the introduction of some significant short cuts. At present we shall not stress this relation any further.

Instead of generating all 36-digit sequences and selecting from this set, we aim at generating only a much smaller set which is guaranteed to contain all solutions. Let us define as ''length of sequence'' the number of quintuples it contains (i.e. length = number of digits - 4). Let us call a sequence ''acceptable'' if no two different quintuples in it present the same digit pattern. With these definitions the solutions are a subset of the set of acceptable sequences, viz. those with length = 32.

We do not know whether there are any solutions at all but we do know that the set of acceptable sequences is non-empty (e.g. ''00000''); we do not have a ready-made criterion to recognize ''the last solution'' when we encounter it; in our set of acceptable sequences, however, we can designate a virtual last one (viz. ''00001''*); when that one is encountered we know that all acceptable sequences with five leading zeroes have been scanned and that no further solutions will be found.

Summarizing, we know if the set of acceptable sequences:

  1. it is non-empty and finite
  2. we know a first member (''00000'')
  3. we know a virtual last member (''00001''*)
  4. we can transform an acceptable sequence into the next acceptable sequence
  5. solutions are all acceptable sequences satisfying the further condition length = 32
  6. no extension of a sequence that is not acceptable will be acceptable.
     The last property makes this problem mathematically speaking very similar to the previous one.

Hint. The test for acceptability can be speeded up considerably by tabulating which quintuplets are to be found in the sequence.

Remark. This problem is difficult and it will take you many hours to produce a beautiful program. But these hours will be extremely well-spent.

[Note added to this transcription at the suggestion of Max Kohler: for the "virtual last" sequence read "10000".]

Back to top

Next chapter: 7. The shortest spanning subtree of a graph