• Top
    • Documentation
    • Books
    • Boolean-reasoning
    • Projects
    • Debugging
    • Std
    • Proof-automation
    • Macro-libraries
    • ACL2
      • Theories
      • Rule-classes
      • Proof-builder
      • Recursion-and-induction
      • Hons-and-memoization
      • Events
      • Parallelism
      • History
      • Programming
      • Operational-semantics
      • Real
      • Start-here
      • Debugging
      • Miscellaneous
      • Output-controls
      • Macros
      • Interfacing-tools
        • Io
        • Defttag
        • Sys-call
        • Save-exec
        • Quicklisp
        • Std/io
        • Oslib
        • Bridge
        • Clex
          • Example-lexer
            • Token-p
            • Lex-punctuation
            • Lex-id/keyword
            • Lex-string
            • Lex-whitespace
            • Lex-comment
            • Lex1
            • Lex-main
            • Lex*
            • Tokenlist-p
            • Letter-char-p
            • Idtail-char-p
            • Whitespace-char-p
            • Number-char-p
            • Tokentype-p
            • Lex*-exec
            • Newline-string
          • Sin
          • Matching-functions
          • Def-sin-progress
        • Tshell
        • Unsound-eval
        • Hacker
        • ACL2s-interface
        • Startup-banner
        • Command-line
    • Interfacing-tools
      • Io
      • Defttag
      • Sys-call
      • Save-exec
      • Quicklisp
      • Std/io
      • Oslib
      • Bridge
      • Clex
        • Example-lexer
          • Token-p
          • Lex-punctuation
          • Lex-id/keyword
          • Lex-string
          • Lex-whitespace
          • Lex-comment
          • Lex1
          • Lex-main
          • Lex*
          • Tokenlist-p
          • Letter-char-p
          • Idtail-char-p
          • Whitespace-char-p
          • Number-char-p
          • Tokentype-p
          • Lex*-exec
          • Newline-string
        • Sin
        • Matching-functions
        • Def-sin-progress
      • Tshell
      • Unsound-eval
      • Hacker
      • ACL2s-interface
      • Startup-banner
      • Command-line
    • Hardware-verification
    • Software-verification
    • Math
    • Testing-utilities
  • Clex

Example-lexer

A very basic lexer written using some CLEX utilities.

We implement a simple lexer for the following, contrived language:

// supporting definitions:

   Letter ::= 'A' | ... | 'Z'
            | 'a' | ... | 'z'

   Number ::= '0' | ... | '9'

// top-level tokens:

  Whitespace ::= (Space | Newline | Tab)*

  Punctuation ::= ';'
                | '+'  | '-'  | '*' | '/'
                | '++' | '--'

  Keyword ::= 'void' | 'int' | 'float'

  Identifier ::= Letter ( Letter | Number | '_' )*    // except keywords

  Comment ::= '//' [anything but newline]* (Newline | EOF)

Our lexer produces simple token-p structures that have a type and some text.

The main lexer loop is lex*. As basic correctness properties, we prove that it returns a valid tokenlist-p and that, on success, we can flatten out the tokens it produces to recreate the original input stream:

Theorem: tokenlist-all-text-of-lex*

(defthm tokenlist-all-text-of-lex*
  (b* (((mv okp tokens ?new-sin) (lex* sin)))
    (implies okp
             (equal (tokenlist-all-text tokens)
                    (implode (strin-left sin))))))

This seems pretty good. It isn't a total correctness statement—for that, we might also want to know something like: if there exists any valid way to tokenize the input, then we will find it. Or we might want to know: there is always exactly one unique way to validly tokenize a list. These seem harder to state and prove.

Subtopics

Token-p
Representation of a single token.
Lex-punctuation
Match punctuation characters to create a :punctuation token.
Lex-id/keyword
Match identifier characters to create an :id or :keyword token, as appropriate.
Lex-string
Complete wrapper: lex an ACL2 string.
Lex-whitespace
Match whitespace to create a :whitespace token.
Lex-comment
Match comment characters to create an :comment token.
Lex1
Lex a single token from the input stream.
Lex-main
Wrapper: lex the entire input stream or fail with a good error message.
Lex*
Main lexer loop: completely tokenize the entire input stream.
Tokenlist-p
(tokenlist-p x) recognizes lists where every element satisfies token-p.
Letter-char-p
Recognize upper- and lower-case letters.
Idtail-char-p
Recogize characters that are okay in the "tails" of identifiers: letters, numbers, and underscores.
Whitespace-char-p
Recognize basic whitespace.
Number-char-p
Recognize digits 0-9.
Tokentype-p
Valid types for tokens.
Lex*-exec
Tail-recursive version of lex* (to prevent stack overflows).
Newline-string
A string with just the newline character.