; (ld "fast.lisp" :ld-pre-eval-print t) ; (certify-book "fast") ; A Proof of the Correctness of the Boyer-Moore Fast String Searching Algorithm ; by J Strother Moore and Matt Martinez ; May 2008 ; --------------------------------------------------------------------------- ; Summmary ; In this book we define the Boyer-Moore fast string searching algorithm and ; prove it correct. The original Boyer-Moore algorithm is described in ; "A Fast String Searching Algorithm," R.S. Boyer and J S. Moore, Communications ; of the Association for Computing Machinery, 20(10), 1977, pp. 762-772. ; In that version, two arrays were used, called delta1 and delta2. The skip ; was the maximum of those two arrays. Here we deal with an improvement of the ; original algorithm in which only one array, called delta, is used and it ; insures a skip that is never shorter than the maximum of delta1 and delta2 ; and is often greater than either. A key fact about the original algorithm is ; that preprocessing can be done in linear time in the length, k, of the ; pattern plus linear time in the alphabet size, a. We do not worry about ; preprocessing time, but the naive way of setting up our delta has complexity ; a*k^2. ; Roughly speaking the algorithm has three main parts: preprocessing, the loop ; which skips ahead by the contents of an array, and a wrapper which enters the ; loop with the right arguments, including the array set up by the ; preprocessing. The loop is not, in general, total -- that is, it does not ; terminate for arbitrary values of the array. Therefore, we cannot define the ; loop with the general array as an argument. Instead, we define the loop to ; skip ahead by the value computed by a function, delta, of the character just ; read, an index into the pattern, and the pattern. We prove this loop is ; total and admit it; then we define the wrapper that calls it appropriately. ; The loop is called ``fast-loop'' and the wrapper is called is called ``fast''. ; At no point in the execution of the algorithm is an array used. But we define ; the preprocessing anyway and we prove that indexing into the 2-dimensional array ; it computes returns the appropriate value of delta. ; To prove correctness we define the naive string searching algorithm, which we ; name ``correct'', and prove that ``fast'' is equivalent to ``correct''. ; These pieces -- the preprocessing to produce an array, a total recursive ; function that implements the algorithm in terms of the individually computed ; values of delta, and the proof of the correctness of the algorithm -- can be ; used to prove that code is correct. We do that in another book. ; --------------------------------------------------------------------------- ; Preamble ; This book relies on a collection of utility lemmas about firstn, nthcdr, and other ; basic list processing functions. (in-package "ACL2") (include-book "utilities") ; --------------------------------------------------------------------------- ; The Obviously Correct String Searching Algorithm ; We start by defining the obviously correct string searching algorithm. The ; fundamental notion is that of an ``exact match,'' the equality of ; corresponding characters from the two strings at a given alignment. (defun xmatch (pat j txt i) (declare (xargs :measure (nfix (- (length pat) j)))) (cond ((not (natp j)) nil) ((>= j (length pat)) t) ((>= i (length txt)) nil) ((equal (char pat j) (char txt i)) (xmatch pat (+ 1 j) txt (+ 1 i))) (t nil))) ; The correct algorithm just looks for the first xmatch of pat in txt, by ; trying them all starting at 0. (defun correct-loop (pat txt i) (declare (xargs :measure (nfix (- (length txt) i)))) (cond ((not (natp i)) nil) ((>= i (length txt)) nil) ((xmatch pat 0 txt i) i) (t (correct-loop pat txt (+ 1 i))))) (defun correct (pat txt) (correct-loop pat txt 0)) ; One might think the empty pat can always be found at location 0 of txt. ; However, this algorithm always return either a legal index into txt (where ; the first match with pat occurs) or else it returns nil. Since 0 is not a ; legal index into the empty txt, the algorithm returns nil when both pat and ; txt are empty. ; --------------------------------------------------------------------------- ; The Boyer-Moore Algorithm -- Preprocessing ; We assume the reader is familiar with the Boyer-Moore algorithm. In this ; section we show the preprocessing. ; We follow the convention that if a variable's name ends in the character "~" ; the variable ranges over lists. We explain later. The reader might want to ; pronounce "pat~" as "pat prime". We cannot use pat' as a variable in ACL2. ; First we define the amount, delta, by which you can skip. It is defined in ; terms of the first "pseudo match" you find for dt~ in pat~, where dt~ is the ; list of characters discovered in the txt before the algorithm encountered the ; unequal characters that caused it to skip ahead. "Pmatch" stands for "pseudo ; match" and allows dt~ to "fall off" the left end of pat~ and still match, in ; the sense that (A B C D E) pseudo matches (C D E F G). ; Note that pseudo-xmatch is defined on lists, not strings. We discuss this ; issue in more detail below. But every string has a unique list counterpart ; and some functions are more naturally defined with lists. Mathematically, ; the notion that one string, pat of length n, matches at location i in another ; string txt is really just an equality: the substring of length n starting at ; location i in txt IS pat. We didn't implement xmatch that way because it ; would require constructing the various substrings as objects, which is ; inefficient. It is more efficient to just compare corresponding characters. ; But the code we're about to write is used only in the preprocessing. This is ; only meant to specify what the array contains, not necessarily how it is ; computed. So here it is more natural to use the "mathematical" notion of ; match. But if we're doing that -- if we're in the business of constructing ; substrings of strings -- it is easier to deal with lists. (defun pmatch (dt~ pat~ j) (declare (xargs :guard (and (true-listp dt~) (true-listp pat~) (integerp j)))) (if (< j 0) (equal (firstn (len (nthcdr (- j) dt~)) pat~) (nthcdr (- j) dt~)) (equal (firstn (len dt~) (nthcdr j pat~)) dt~))) ; ``X marks the spot.'' The function x computes the right most position (to ; the left of j) at which dt~ occurs in pat~. For example, ; 0 1 2 3 4 5 6 7 8 ; (x '(a b c) '(a b c a b c x b c) 6) = 3 ; If we have just discovered dt in txt upon reading txt[i], we want to slide ; pat so that the character at x in pat is aligned with the one at i in txt. (defun x (dt~ pat~ j) (declare (xargs :measure (+ 1 (nfix (+ (len dt~) j))) :guard (and (true-listp dt~) (true-listp pat~) (integerp j)))) (cond ((or (not (integerp j)) (not (true-listp dt~))) 0) ((pmatch dt~ pat~ j) j) (t (x dt~ pat~ (- j 1))))) ; This computes how much to increment i to align the discovered text with the ; pat at x and then shift i further to align with the end of pat in that new ; alignment. (defun delta (v j pat) (declare (xargs :guard (and (characterp v) (natp j) (stringp pat)))) (let* ((pat~ (coerce pat 'list)) (dt~ (cons v (nthcdr (+ j 1) pat~)))) (+ (- (len pat~) 1) (- (x dt~ pat~ (- j 1)))))) ; Now we set up an "array", really a list of lists. There will be 256 rows, ; one for each of the ASCII codes. Each row will have n entries, where n is ; the length of pat. At row c and column j we will store (delta c' j pat), ; where c' is the character corresponding to ASCII code c. ; Note: The "j" in the code below runs between 1 and (length pat) and we store in ; column j-1. (defun setup-delta1-row (c j pat row) (if (zp j) row (setup-delta1-row c (- j 1) pat (cons (delta (code-char c) (- j 1) pat) row)))) (defun setup-delta (c pat ans) (if (zp c) ans (setup-delta (- c 1) pat (cons (setup-delta1-row (- c 1) (length pat) pat nil) ans)))) (defun preprocess (pat) (setup-delta 256 pat nil)) (defun index2 (array c j) (nth j (nth c array))) ; For example, consider the pattern ; j = 01234567 ; pat = EABCDABC ; Note that the substring ABC occurs in it twice, once starting at j=1 and ; again starting at j=5. In its first occurrence, it is preceded by E and in ; its second it is preceded by D. ; Here is the 2-dimensional array computed by preprocess. Most of the rows are ; identical because most of the ASCII characters do not occur in the pattern. ; j = 0 1 2 3 4 5 6 7 ; ASCII code (char) ; ; for row ; ((15 14 13 12 11 10 9 8) ; 0 ; (15 14 13 12 11 10 9 8) ; 1 ; ... ; ... ; (15 14 13 12 11 10 9 8) ; 63 (?) ; (15 14 13 12 11 10 9 8) ; 64 (@) ; (15 14 13 12 11 6 9 2) ; 65 (A) ; (15 14 13 12 11 10 5 1) ; 66 (B) ; (15 14 13 12 11 10 9 4) ; 67 (C) ; (15 14 13 12 11 10 9 3) ; 68 (D) ; (15 14 13 12 7 10 9 7) ; 69 (E) ; (15 14 13 12 11 10 9 8) ; 70 (F) ; ... ; ... ; (15 14 13 12 11 10 9 8) ; 255 ; ) ; Below we show several examples of the use of the array. Suppose pat and txt ; are aligned as shown and we are reading the character at txt position i. ; Suppose we read an F (ASCII code 70). This is suggested in the display ; below. ; j = 01234567 ; pat = EABCDABC ; txt = xxxxxxxxxxxxFxxxxxxxxxxxxx ; i = | ; Then we index the array with (index2 array 70 7), since the corresponding ; character in pat is at j=7. The array's contents is 8, which means we can ; skip ahead by 8 (i.e., increment i by 8) to produce this alignment: ; pat = EABCDABC ; txt = xxxxxxxxxxxxFxxxxxxxxxxxx ; i = | ; Suppose instead we are in this alignment: ; j = 01234567 ; pat = EABCDABC ; txt = xxxxxxxxxEABCxxxxxxxxxxxxx ; i = | ; That is, we have matched the ABC at the end of pat with the corresponding ; characters of txt and now read txt at i and get an E (ASCII 69). Then ; (index array 69 4) = 7, so we increment i by 7 and get: ; pat = EABCDABC ; txt = xxxxxxxxxEABCxxxxxxxxxxxxx ; i = | ; This shifts the pattern forward to align the discovered text, "EABC", with ; its right-most occurrence in pat. ; --------------------------------------------------------------------------- ; A Total Recursive Function that Performs the Boyer-Moore Search ; Fast will be defined in terms of fast-loop, which is like boyer-moore-loop ; but (a) checks for all the preconditions we assume and (b) uses delta ; directly instead of an array. To prove that fast-loop terminates, we need ; some theorems: (defthm integerp-delta (implies (integerp j) (integerp (delta v j pat))) :rule-classes :type-prescription) (defthm x-mono (implies (and (true-listp dt~) (integerp j)) (<= (x dt~ pat~ j) j)) :rule-classes :linear) ; This is the key one. It shows that delta is big enough to make sure the ; pattern monotonically shifts to the right. (defthm delta-very-positive (implies (and (stringp pat) (natp j) (< j (length pat)) (not (equal v (char pat j)))) (<= (- (length pat) j) (delta v j pat))) :rule-classes :linear) ; (The truth is we do not need integerp-delta and delta-very-positive until we ; get to the penultimate lemma about fast-loop and correct-loop at the string ; level. If we did, we'd have to disable delta here since it is non-recursive. ; For now, we just leave delta enabled and prove these two results as needed ; from the defun of delta and x-mono. But when we disable delta to stay at the ; string level, we'll need those two.) ; To prove that a function terminates we show a measure that decreases as the ; function recurs. Our measure is a lexicographic combination of two natural ; numbers. The first measures how far the right-hand of the pattern is from ; the right-hand end of the text. The second is how far down the pattern we've ; matched. When we shift the pattern, the first number decreases. When we ; leave the pattern in place and decrement j as we match right-to-left, the ; second number decreases. (defun measure (pat j txt i) (declare (xargs :guard (and (stringp pat) (integerp j) (<= -1 j) (stringp txt) (integerp i) (<= j i)))) (llist (- (+ (length txt) (length pat)) ; first component (+ i (- (length pat) j) -1)) (+ j 1))) ; second component ; --------------------------------------------------------------------------- ; Aside: This is a hack to print some trace output when the algorithm runs. ; Note that it prints only for short pat and txt. That is because we want the ; printing to fit neatly on a line. When the strings get "too long" it runs ; anyway, but it doesn't print trace output. (defun show-config (pat j txt i) (declare (xargs :guard (and (stringp pat) (integerp j) (<= -1 j) (stringp txt) (integerp i) (<= j i)))) (cond ((> (+ (length pat) (length txt)) 55) nil) (t (cw "~%~ pat:~t4~s0~t5j=~x1~%~ txt: ~s2~t5i=~x3~%~ ~t6|~t5m=~x7~%" pat ; 0 j ; 1 txt ; 2 i ; 3 (+ 5 (- i j)) ; 4 tab position for pat 61 ; 5 tab position for j= and i= (+ 5 i) ; 6 tab position for | marker (measure pat j txt i) ; 7 measure that is supposed to decrease )))) ; --------------------------------------------------------------------------- ; Here is the runnable, terminating, version of the Boyer-Moore loop. We'll ; prove that in a moment. First we admit it and prove that it terminates. ; Then we'll prove it is equal to boyer-moore-loop under certain conditions. (defun fast-loop (pat j txt i) (declare (xargs :measure (measure pat j txt i) :well-founded-relation l< :guard (and (stringp pat) (integerp j) (<= -1 j) (< j (length pat)) (stringp txt) (integerp i) (<= j i)))) (prog2$ (show-config pat j txt i) (cond ((not (and (stringp pat) (integerp j) (stringp txt) (integerp i) (<= -1 j) (< j (length pat)) (<= j i))) nil) ((< j 0) (+ 1 i)) ((<= (length txt) i) nil) ((equal (char pat j) (char txt i)) (fast-loop pat (- j 1) txt (- i 1))) (t (fast-loop pat (- (length pat) 1) txt (+ i (delta (char txt i) j pat))))))) ; Here is the Boyer-Moore algorithm, which doesn't actually precompute all ; possible deltas but just computes the ones it needs, every time it needs one. (defun fast (pat txt) (declare (xargs :guard (and (stringp pat) (stringp txt)))) (if (equal pat "") (if (equal txt "") nil 0) (fast-loop pat (- (length pat) 1) txt (- (length pat) 1)))) ; This function terminates. All the subroutines in it were proved to ; terminate. ; --------------------------------------------------------------------------- ; Proving That the Preprocessing is Correct ; Now we prove that preprocessing is correct and causes the index2 expression ; used in boyer-moore-loop to produce delta when array is (preprocess pat). (defthm car-nthcdr-setup-delta1-row (implies (natp maxpat) (equal (car (nthcdr j (setup-delta1-row c maxpat pat ans))) (if (natp j) (if (< j maxpat) (delta (code-char c) j pat) (nth (- j maxpat) ans)) (if (zp maxpat) (car ans) (delta (code-char c) 0 pat)))))) (defthm car-nthcdr-setup-delta (implies (natp maxchar) (equal (car (nthcdr c (setup-delta maxchar pat ans))) (if (natp c) (if (< c maxchar) (setup-delta1-row c (length pat) pat nil) (nth (- c maxchar) ans)) (if (zp maxchar) (car ans) (setup-delta1-row 0 (length pat) pat nil)))))) (defthm preprocess-correct (implies (and (stringp pat) (characterp v) (natp j) (< j (length pat))) (equal (index2 (preprocess pat) (char-code v) j) (delta v j pat)))) (in-theory (disable preprocess index2)) ; --------------------------------------------------------------------------- ; Our Basic Proof Strategy ; For any theorem about strings there is a corresponding theorem about ; true-lists of characters. In fact, because we almost never use the fact that ; there are only a finite number of characters, for any theorem about strings ; there is almost always a corresponding theorem about true-lists. ; Of particular interest is the fact that if you have a string pat then you can ; convert it to a unique true-list of characters with (coerce pat 'list). Many ; string processing functions are defined in terms of their list processing ; counterparts. For example, if pat is a string, then (length pat) is (len ; (coerce pat 'list)) and the jth character of pat, i.e., (char pat j), is ; defined to be (nth j (coerce pat 'list)). So, a theorem about a stringp pat ; that talks about length and char might be proved by proving the corresponding ; theorem about a true-listp pat~ that talks about len and nth. Instantiating ; the latter, replacing pat~ with (coerce pat 'list), proves the former. ; In this file, the variables pat and txt range over strings and the variables ; pat~ and txt~ range over lists. ; We call len the "list level counterpart" of length. Similarly, pat~ is the ; "list level counterpart" of pat. ; There are two "great ideas" in our proof strategy. ; The first is that xmatch has a list level counterpart that can be expressed ; entirely with equality and the familiar functions firstn and nthcdr. Firstn ; returns the first n elements of a list and nthcdr returns all but the first ; n elements of a list. ; Let pat and txt be strings and let pat~ and txt~ be their list level ; counterparts. Let n be the length of pat (i.e., the len of pat~). Then the ; list level counterpart of (xmatch pat j txt i) is ; ; (equal (firstn n ; (nthcdr i txt')) ; (nthcdr j pat')) ; That is the xmatch holds iff the first n elements of the sublist of txt' ; starting at i is equal to the sublist of pat' starting at j. This "trades" ; xmatch for an equality between two list expressions in terms of firstn and ; nthcdr. ; Here is the formal statement of this first great strategy. (defthm xmatch-trade (implies (and (stringp pat) (stringp txt) (natp j) (natp i)) (equal (xmatch pat j txt i) (equal (firstn (len (nthcdr j (coerce pat 'list))) (nthcdr i (coerce txt 'list))) (nthcdr j (coerce pat 'list)))))) ; As long as this lemma is active, xmatch expressions will be traded for firstn ; and nthcdr equalities. We will disable this lemma when we no longer want ; that to happen. ; The second great idea is that we can prove things about firstn and nthcdr ; by what ACL2 calls "destructor elimination." See the lemma ; firstn-nthcdr-elim in utilities, which states that ; (implies (and (natp n) ; (< n (len x))) ; (equal (append (firstn n x) (nthcdr n x)) x)). ; The strategy is as follows. ; Suppose you have a list x and some index n into it. Suppose you ; you want to prove something about (firstn n x) and/or (nthcdr n x), ; such as: ; (implies (and (true-listp x) ; (natp n) ; (< n (len x))) ; (p n (firstn n x) (nthcdr n x) x)) ; Then you can prove instead: ; (implies (and (true-listp a) ; (< 0 (len b)) ; (true-listp b)) ; (p (len a) a b (append a b))) ; because if you prove the latter, you can instantiate it, replacing a by ; (firstn n x) and b by (nthcdr n x) and use facts in the utilities book to ; recover the former. The key fact is, of course, firstn-nthcdr-elim. ; This strategy "trades" firstn and nthcdr for append! The utilities book not ; only implements this strategy but contains a lot of theorems to simplify ; firstn, nthcdr, and append expressions. In the interests of canonicalizing ; everything, utilities also rewrites (nth i x) to (car (nthcdr i x)). ; --------------------------------------------------------------------------- ; The Proof Plan ; We want to prove that fast is correct, i.e., ; (defthm fast-is-correct ; (implies (and (stringp pat) ; (stringp txt)) ; (equal (fast pat txt) ; (correct pat txt)))) ; from which it will follow that boyer-moore is correct. Below we sketch the ; proof. ; The plan has some special markers in it (*** 1 ***), (*** 2 ***), etc. ; These markers help you locate the corresponding steps in the ACL2 proof in ; the next section. ; Proof Plan ; To prove fast-is-correct (*** 1 ***), we need to prove that the two loops are ; equivalent, i.e., that ; (equal (fast-loop pat j txt i) ; (correct-loop pat txt (- i j))) ; under suitable hypotheses. The main hypothesis is that pat starting at j+1 ; matches txt starting at i+1. The actual theorem we prove is called ; fast-loop-is-correct-loop below (*** 2 ***). ; The difference expression above reflects the fact that the j and i in the ; fast-loop term point to the right end of the current alignment while the ; index in correct-loop points to the left end. ; To prove this we'll do an induction to unwind fast-loop. There will be two ; inductive cases, one where we're backing up because the corresponding ; characters were identical and one where we're skipping forward by delta. We ; discuss the skipping forward case. ; The inductive step in that case will look something like this: ; (implies (and ... ; (not (equal (char pat j) (char txt i))) ; [hyp 1] ; ... ; (equal (fast-loop pat j' txt i') ; [hyp 2] ; (correct-loop pat txt (- i' j'))) ; ...) ; (equal (fast-loop pat j txt i) ; [concl] ; (correct-loop pat txt (- i j)))) ; where [hyp 1] specifies the case we're in (we've found mismatched chars) and ; [hyp 2] is the inductive hypothesis, where we've assumed the formula for j ; replaced by j' and i replaced by i'. ; The values of j' and i' are just those that are used in the recursive ; call of fast-loop in this case: ; j' = (- (length pat) 1) ; i' = (+ i (delta (char txt i) j pat)) ; The conclusion, [concl], equates fast-loop to correct-loop. But the ; fast-loop call will expand under the [hyp 1] case analysis and become ; (fast-loop pat j' txt i'). Note that this is the term [hyp2] mentions. ; Then we will use the induction hypothesis, [hyp2], and produce a new ; conclusion of the form: ; (equal (correct-loop pat txt (- i' j')) ; (correct-loop pat txt (- i j))) ; So the proof about fast-loop is EASY if we can prove this lemma about ; correct-loop! The actual lemma is called crux below (*** 3 ***). If we ; substitute in the values of j' and i' and do a little arithmetic we see the ; main idea: ; [crux]: ; (equal (correct-loop pat txt ; (+ 1 i ; (- (length pat)) ; (delta (char txt i) j pat))) ; (correct-loop pat txt ; (- i j))) ; Of course, crux has all sorts of hypotheses, the main two being that (a) pat ; starting at j+1 matches txt starting at i+1 and (b) the characters at pat[j] ; and txt[i] are different. ; Think of the 3rd argument of correct-loop being the index of a proposed ; alignment of pat and txt. Correct-loop just tests whether there is a match ; there and if not, moves the proposed alignment down by 1. Crux tells us that ; we can shift down by a larger amount than 1 without missing a match. ; That's the crux of the Boyer-Moore algorithm. ; The main lesson is that we can do the proof of the equivalence of fast-loop ; and correct-loop with one standard induction on fast-loop, if we can prove ; that correct-loop allows the same big skips that fast-loop does. This will ; take some clever reasoning about matches, mismatched characters, and delta. ; We'd like to do that reasoning at the list level, not the string level. ; So to prove crux (which is about strings) we will jump to its list level ; counterpart, which is called crux~ (*** 4 ***). But to even state the lemma ; at the list level we need a recursive function on lists that is the ; list-level correspondent of the string-level correct-loop. ; We call that function correct-loop~ (*** 5 ***) and we prove that it is the ; list-level counterpart of correct-loop in (*** 6 ***). If you look at [crux] ; and think about how to jump into the list world, we'll replace correct-loop ; by correct-loop~, pat by pat~ txt by txt~ and length by len. But we also ; have to deal with the delta and char expressions, which operate on strings. ; Fortunately, char just becomes nth after we coerce txt to a list and delta ; can be expanded to be expressed in terms of x, which deals with lists ; already. ; So with those transformations we get: ; [crux~]: ; (equal ; (correct-loop~ pat~ txt~ ; (+ i ; (- (x ; (cons (car (nthcdr i txt~)) (nthcdr j (cdr pat~))) ; pat~ ; (+ -1 j))))) ; (correct-loop~ pat~ txt~ ; (+ i (- j)))) ; This may look ugly but it's exactly the list-level counterpart of the crux of ; the algorithm. ; So how do you prove that correct-loop~ allows jumps to alignment x? ; The answer is really simple: ; First, correct-loop~ allows jumps by ANY amount -- provided no matches are ; skipped! To state this we need to invent a list-level predicate that says ; there are no matches between here and there. That predicate is called clear ; (*** 7 ***) and the basic idea is formalized as clear-implies-skip (*** 8 ; ***). ; Second, we have to prove that there are no matches in the region jumped over ; when we shift the pattern to the x alignment. That is proved in clear-x ; (*** 9 ***). ; We cannot find a good way ``code'' these lemmas, clear-implies-skip and ; clear-x, that will allow ACL2 to use them automatically as rewrite rules and ; prove crux~. So we prove crux~ by an explicit hint that instantiates the two ; lemmas. ; "Q.E.D." -- But this is just the plan! ; --------------------------------------------------------------------------- ; The ACL2 Proof ; Below we carry out the proof plan. The numbers appear in different order ; because ACL2 proofs are presented bottom up and our proof was more or less ; top down. We don't provide many comments because the numbering and plan ; pretty much say all there is to say. (defun correct-loop~ (pat~ txt~ i) ; (*** 5 ***) (declare (xargs :measure (nfix (- (len txt~) i)))) (cond ((not (natp i)) nil) ((>= i (len txt~)) nil) ((equal (firstn (len pat~) (nthcdr i txt~)) pat~) i) (t (correct-loop~ pat~ txt~ (+ 1 i))))) (defthm correct-loop-trade ; (*** 6 ***) (implies (and (stringp pat) (stringp txt)) (equal (correct-loop pat txt i) (correct-loop~ (coerce pat 'list) (coerce txt 'list) i)))) (defun clear (pat~ txt~ k n) ; (*** 7 ***) (declare (xargs :measure (nfix n))) (cond ((zp n) t) ((equal (firstn (len pat~) (nthcdr k txt~)) pat~) nil) (t (clear pat~ txt~ (+ k 1) (- n 1))))) ; Next is one of the two lemmas in our decomposition of crux~. It is not ; stated as a rewrite rule because in actual fact when the lemma is used k, ; below, becomes (- i j) and the (+ k n) becomes a difference expression. (defthm clear-implies-skip ; (*** 8 ***) (implies (and (natp k) (< k (len txt~)) (natp n) (clear pat~ txt~ k n)) (equal (correct-loop~ pat~ txt~ (+ k n)) (correct-loop~ pat~ txt~ k))) :rule-classes nil) ; The next lemma, clear-x, is the other half of our decomposition. The trick ; to stating it is how one writes dt, the discovered text. Technically, it is ; txt[i] consed onto the tail end of pat starting at j+1. But this (a) hides ; it relation to txt -- dt is in fact just a certain substring of txt starting ; at i -- and (b) is a function of j. But to prove the theorem below we must ; induct on j. We cannot tolerate dt being a function of the induction ; variable. So we have to generalize the theorem and not deal with the actual ; dt but with some arbitrary substring of txt of d characters. But this ; introduction of d and the separation of dt from pat (now it is just a piece ; of txt) makes this lemma unuseful as a rewrite rule in crux~. (defthm clear-x ; (*** 9 ***) (implies (and (true-listp pat~) (true-listp txt~) (consp pat~) (natp i) (natp d) (<= (+ i d) (len txt~)) (integerp j) (natp (- i j)) (natp (+ j d)) (<= (+ j d) (len pat~))) (clear pat~ txt~ (- i j) (- j (x (firstn d (nthcdr i txt~)) pat~ j)))) :rule-classes nil) (defthm crux~ ; (*** 4 ***) (implies (and (true-listp pat~) (true-listp txt~) (integerp j) (<= -1 j) (integerp i) (<= -1 i) (<= 0 j) (< i (len txt~)) (not (equal (nth j pat~) (nth i txt~))) (consp pat~) (<= 0 (len pat~)) (< j (len pat~)) (<= j i) (equal (firstn (len (nthcdr (+ 1 j) pat~)) (nthcdr (+ 1 i) txt~)) (nthcdr (+ 1 j) pat~))) (equal (correct-loop~ pat~ txt~ (+ i (- (x (cons (car (nthcdr i txt~)) (cdr (nthcdr j pat~))) pat~ (+ -1 j))))) (correct-loop~ pat~ txt~ (+ i (- j))))) ; Note: Recall the plan: You can skip over any distance if you know there are ; no matches in the region and we know there are no matches in the region ; scanned by x. We could not find a way to express the two lemmas ; (numbers 8 and 9 above) to make ACL2 just use them automatically. So we ; literally provide ACL2 with the required instantiations of them below. :hints (("Goal" :use ((:instance clear-x (d (- (len pat~) j))) (:instance clear-implies-skip (pat~ pat~) (txt~ txt~) (k (+ i (- j))) (n (+ j (- (x (cons (nth i txt~) (nthcdr j (cdr pat~))) pat~ (+ -1 j)))))))))) (defthm crux ; (*** 3 ***) (implies (and (stringp pat) (stringp txt) (integerp j) (<= -1 j) (integerp i) (<= -1 i) (<= 0 j) (< i (length txt)) (not (equal (char pat j) (char txt i))) (not (equal pat "")) (< j (length pat)) (<= j i) (xmatch pat (+ 1 j) txt (+ 1 i))) (equal (correct-loop pat txt (+ 1 i (- (length pat)) (delta (char txt i) j pat))) (correct-loop pat txt (+ i (- j)))))) ; Note: The plan calls for us to ascend to the string level now and prove that ; fast-loop is correct-loop by a routine fast-loop induction. The plan left ; out two things that come up in trying to carry out that proof. ; This first lemma, below, says that (xmatch pat j txt i) always succeeds if j ; is (length pat). Recall that the equivalence of fast-loop and correct-loop ; has the hypothesis that pat starting at j+1 matches txt starting at i+1. But ; when we slide the pattern down, j becomes (- (length pat) 1). That is, the ; "pre-matched" part of the pattern becomes empty. This theorem lets us detach ; the induction hypothesis in that case. (defthm empty-xmatch (implies (and (stringp pat) (stringp txt) (natp i)) (xmatch pat (length pat) txt i))) ; The next lemma says that correct-loop fails if the pat overhangs the right ; end of the txt. The fast loop stops under this condition, but the correct ; loop keeps going until the left end of the pat is past the end of txt. This ; lemma tells us correct-loop could stop early. (defthm early-termination (implies (and (natp k) (stringp pat) (stringp txt) (<= (length txt) (+ k (- (length pat) 1)))) (not (correct-loop pat txt k)))) ; Now we want to do everything else at the string level. So we disable the ; rules that are driving us down to the list level. (in-theory (disable xmatch-trade correct-loop-trade delta length char)) ; So now we're at the string level again and we're almost done! Here we do a routine ; fast-loop induction and appeal to crux: (defthm fast-loop-is-correct-loop ; (*** 2 ***) (implies (and (stringp pat) (integerp j) (stringp txt) (integerp i) (<= -1 j) (< j (length pat)) (<= j i) (not (equal pat "")) (xmatch pat (+ j 1) txt (+ i 1))) (equal (fast-loop pat j txt i) (correct-loop pat txt (- i j))))) ; And now we're done. (defthm fast-is-correct ; (*** 1 ***) (implies (and (stringp pat) (stringp txt)) (equal (fast pat txt) (correct pat txt)))) ; Q.E.D.