Search-engine friendly clone of the ACL2 documentation.

Parser

Lex-lexeme

Lex a lexeme.

Signature
(lex-lexeme parstate) → (mv erp lexeme? span new-parstate)
Arguments: parstate — Guard (parstatep parstate).
Returns: lexeme? — Type (lexeme-optionp lexeme?).; span — Type (spanp span).; new-parstate — Type (parstatep new-parstate), given (parstatep parstate).

This is the top-level lexing function. It returns the next lexeme found in the parser state, or nil if we reached the end of the file; an error is returned if lexing fails.

First we get the next character, propagating errors. If there is no next character, we return nil for no lexeme, with the span whose start and end positions are both the position just past the end of the file. Otherwise, we do a case analysis on that next character.

If the next character is white space, we return a white-space lexeme. No other lexeme starts with a white-space character, so this is the only possibility.
If the next character is a letter, it could start an identifier or keyword, but it could also start character constants or string literals. Specifically, if the letter is u, U, or L, it could be a prefix of a character constant or string literal. We must try this possibility before trying an identifier or keyword, because we always need to lex the longest possible sequence of characters [C17:6.4/4]: if we tried identifiers or keywords first, for example we would erroneously lex the character constant u'a' as the identifier u followed by the unprefixed character constant 'a'. According to the grammar, an identifier is also an enumeration constant, so the lexing of an identifier is always ambiguous; we always consider it as an identifier (not an enumeration constant), but we can reclassify it as an enumeration during type checking (outside the lexer and parser).
If the next character is u, and there are no subsequent characters, we lex it as an identifier. If the following character is a single quote, we attempt to lex a character constant with the appropriate prefix; if the following character is a double quote, we attempt to lex a string literal with the appropriate prefix. These are the only two real possibilities in these two cases. Strictly speaking, if the lexing of the character constant or string literal fails, we should lex u as an identifier and then continue lexing, but at that point the only possibilty would be an unprefixed character constant or string literal, which would fail again; so we can fail sooner without loss. If the character immediately following u is 8, then we need to look at the character after that. If there is none, we lex the identifier u8. If there is one and is double quote, then we attempt to lex a string literal with the appropriate prefix, which again is the only possibilty, and again we can immediately fail if this fails. If the character after u8 is not a double quote, we put back that character and 8, and we lex u... as an identifier or keyword. Also, if the character after u was not any of the ones mentioned above, we put it back and we lex u... as an identifier or keyword.
If the next character is U or L, we proceed similarly to the case of u, but things are simpler because there is no 8 to handle.
If the next character is a letter or underscore, it must start an identifier or keyword. This is the only possibility, since we have already tried a prefixed character constant or string literal.
If the next character is a digit, it must start an integer or floating constant. This is the only possibility.
If the next character is ., it may start a decimal floating constant, or it could be the punctuator ., or it could start the punctuator .... So we examine the following characters. If there is none, we have the punctuator .. If the following character is a digit, this must start a decimal floating constant. If the following character is another ., and there is a further . after it, we have the punctuator .... In all other cases, we just have the punctuator ., and we put back the additional character(s) read, since they may be starting a different lexeme.
If the next character is a single quote, it must start an unprefixed character constant.
If the next character is a double quote, it must start an unprefixed string literal.
If the next character is /, it could start a comment, or the punctuator /=, or it could be just the punctuator /. We examine the following character. If there is none, we have the punctuator /. If the following character is *, it must be a block comment. If the following character is /, it must be a line comment. If the following character is =, it must be the punctuator /=. If the following character is none of the above, we just have the punctuator /.
The remaining cases are for punctuators. Some punctuators are prefixes of others, and so we need to first try and lex the longer ones, using code similar to the one for other lexemes explained above. Some punctuators are not prefixes of others, and so they can be immediately decided.

Definitions and Theorems

Function: lex-lexeme

(defun lex-lexeme (parstate)
 (declare (xargs :stobjs (parstate)))
 (declare (xargs :guard (parstatep parstate)))
 (let ((__function__ 'lex-lexeme))
  (declare (ignorable __function__))
  (b* (((reterr) nil (irr-span) parstate)
       ((erp char first-pos parstate)
        (read-char parstate))
       ((unless char)
        (retok nil
               (make-span :start first-pos
                          :end first-pos)
               parstate)))
   (cond
    ((or (= char 32)
         (and (<= 9 char) (<= char 12)))
     (retok (lexeme-whitespace)
            (make-span :start first-pos
                       :end first-pos)
            parstate))
    ((= char (char-code #\u))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
      (cond
       ((not char2)
        (retok (lexeme-token (token-ident (ident "u")))
               (make-span :start first-pos
                          :end first-pos)
               parstate))
       ((= char2 (char-code #\'))
        (lex-character-constant (cprefix-locase-u)
                                first-pos parstate))
       ((= char2 (char-code #\"))
        (lex-string-literal (eprefix-locase-u)
                            first-pos parstate))
       ((= char2 (char-code #\8))
        (b* (((erp char3 & parstate)
              (read-char parstate)))
         (cond
          ((not char3)
           (retok (lexeme-token (token-ident (ident "u8")))
                  (make-span :start first-pos :end pos2)
                  parstate))
          ((= char3 (char-code #\"))
           (lex-string-literal (eprefix-locase-u8)
                               first-pos parstate))
          (t (b* ((parstate (unread-char parstate))
                  (parstate (unread-char parstate)))
               (lex-identifier/keyword char first-pos parstate))))))
       (t (b* ((parstate (unread-char parstate)))
            (lex-identifier/keyword char first-pos parstate))))))
    ((= char (char-code #\U))
     (b* (((erp char2 & parstate)
           (read-char parstate)))
      (cond
          ((not char2)
           (retok (lexeme-token (token-ident (ident "U")))
                  (make-span :start first-pos
                             :end first-pos)
                  parstate))
          ((= char2 (char-code #\'))
           (lex-character-constant (cprefix-upcase-u)
                                   first-pos parstate))
          ((= char2 (char-code #\"))
           (lex-string-literal (eprefix-upcase-u)
                               first-pos parstate))
          (t (b* ((parstate (unread-char parstate)))
               (lex-identifier/keyword char first-pos parstate))))))
    ((= char (char-code #\L))
     (b* (((erp char2 & parstate)
           (read-char parstate)))
      (cond
          ((not char2)
           (retok (lexeme-token (token-ident (ident "L")))
                  (make-span :start first-pos
                             :end first-pos)
                  parstate))
          ((= char2 (char-code #\'))
           (lex-character-constant (cprefix-upcase-l)
                                   first-pos parstate))
          ((= char2 (char-code #\"))
           (lex-string-literal (eprefix-upcase-l)
                               first-pos parstate))
          (t (b* ((parstate (unread-char parstate)))
               (lex-identifier/keyword char first-pos parstate))))))
    ((or (and (<= (char-code #\A) char)
              (<= char (char-code #\Z)))
         (and (<= (char-code #\a) char)
              (<= char (char-code #\z)))
         (= char (char-code #\_)))
     (lex-identifier/keyword char first-pos parstate))
    ((and (<= (char-code #\0) char)
          (<= char (char-code #\9)))
     (b* (((erp const last-pos parstate)
           (lex-iconst/fconst (code-char char)
                              first-pos parstate)))
       (retok (lexeme-token (token-const const))
              (make-span :start first-pos
                         :end last-pos)
              parstate)))
    ((= char (char-code #\.))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
      (cond
          ((not char2)
           (retok (lexeme-token (token-punctuator "."))
                  (make-span :start first-pos
                             :end first-pos)
                  parstate))
          ((and (<= (char-code #\0) char2)
                (<= char2 (char-code #\9)))
           (b* (((erp const last-pos parstate)
                 (lex-dec-fconst (code-char char2)
                                 pos2 parstate)))
             (retok (lexeme-token (token-const const))
                    (make-span :start first-pos
                               :end last-pos)
                    parstate)))
          ((= char2 (char-code #\.))
           (b* (((erp char3 pos3 parstate)
                 (read-char parstate)))
             (cond ((not char3)
                    (b* ((parstate (unread-char parstate)))
                      (retok (lexeme-token (token-punctuator "."))
                             (make-span :start first-pos
                                        :end first-pos)
                             parstate)))
                   ((= char3 (char-code #\.))
                    (retok (lexeme-token (token-punctuator "..."))
                           (make-span :start first-pos :end pos3)
                           parstate))
                   (t (b* ((parstate (unread-char parstate))
                           (parstate (unread-char parstate)))
                        (retok (lexeme-token (token-punctuator "."))
                               (make-span :start first-pos
                                          :end first-pos)
                               parstate))))))
          (t (b* ((parstate (unread-char parstate)))
               (retok (lexeme-token (token-punctuator "."))
                      (make-span :start first-pos
                                 :end first-pos)
                      parstate))))))
    ((= char (char-code #\'))
     (lex-character-constant nil first-pos parstate))
    ((= char (char-code #\"))
     (lex-string-literal nil first-pos parstate))
    ((= char (char-code #\/))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "/"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\*))
              (lex-block-comment first-pos parstate))
             ((= char2 (char-code #\/))
              (lex-line-comment first-pos parstate))
             ((= char2 (char-code #\=))
              (retok (lexeme-token (token-punctuator "/="))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "/"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((or (= char (char-code #\[))
         (= char (char-code #\]))
         (= char (char-code #\())
         (= char (char-code #\)))
         (= char (char-code #\{))
         (= char (char-code #\}))
         (= char (char-code #\~))
         (= char (char-code #\?))
         (= char (char-code #\,))
         (= char (char-code #\;)))
     (retok
      (lexeme-token
         (token-punctuator (acl2::implode (list (code-char char)))))
      (make-span :start first-pos
                 :end first-pos)
      parstate))
    ((= char (char-code #\*))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "*"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\=))
              (retok (lexeme-token (token-punctuator "*="))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "*"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\^))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "^"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\=))
              (retok (lexeme-token (token-punctuator "^="))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "^"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\!))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "!"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\=))
              (retok (lexeme-token (token-punctuator "!="))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "!"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\=))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "="))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\=))
              (retok (lexeme-token (token-punctuator "=="))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "="))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\:))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator ":"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\>))
              (retok (lexeme-token (token-punctuator ":>"))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator ":"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\#))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "#"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\#))
              (retok (lexeme-token (token-punctuator "##"))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "#"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\&))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "&"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\&))
              (retok (lexeme-token (token-punctuator "&&"))
                     (make-span :start first-pos :end pos2)
                     parstate))
             ((= char2 (char-code #\=))
              (retok (lexeme-token (token-punctuator "&="))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "&"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\|))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "|"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\|))
              (retok (lexeme-token (token-punctuator "||"))
                     (make-span :start first-pos :end pos2)
                     parstate))
             ((= char2 (char-code #\=))
              (retok (lexeme-token (token-punctuator "|="))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "|"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\+))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "+"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\+))
              (retok (lexeme-token (token-punctuator "++"))
                     (make-span :start first-pos :end pos2)
                     parstate))
             ((= char2 (char-code #\=))
              (retok (lexeme-token (token-punctuator "+="))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "+"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\-))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
       (cond ((not char2)
              (retok (lexeme-token (token-punctuator "-"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))
             ((= char2 (char-code #\>))
              (retok (lexeme-token (token-punctuator "->"))
                     (make-span :start first-pos :end pos2)
                     parstate))
             ((= char2 (char-code #\-))
              (retok (lexeme-token (token-punctuator "--"))
                     (make-span :start first-pos :end pos2)
                     parstate))
             ((= char2 (char-code #\=))
              (retok (lexeme-token (token-punctuator "-="))
                     (make-span :start first-pos :end pos2)
                     parstate))
             (t (b* ((parstate (unread-char parstate)))
                  (retok (lexeme-token (token-punctuator "-"))
                         (make-span :start first-pos
                                    :end first-pos)
                         parstate))))))
    ((= char (char-code #\>))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
      (cond
         ((not char2)
          (retok (lexeme-token (token-punctuator ">"))
                 (make-span :start first-pos
                            :end first-pos)
                 parstate))
         ((= char2 (char-code #\>))
          (b* (((erp char3 pos3 parstate)
                (read-char parstate)))
            (cond ((not char3)
                   (retok (lexeme-token (token-punctuator ">>"))
                          (make-span :start first-pos :end pos2)
                          parstate))
                  ((= char3 (char-code #\=))
                   (retok (lexeme-token (token-punctuator ">>="))
                          (make-span :start first-pos :end pos3)
                          parstate))
                  (t (b* ((parstate (unread-char parstate)))
                       (retok (lexeme-token (token-punctuator ">>"))
                              (make-span :start first-pos :end pos2)
                              parstate))))))
         ((= char2 (char-code #\=))
          (retok (lexeme-token (token-punctuator ">="))
                 (make-span :start first-pos
                            :end first-pos)
                 parstate))
         (t (b* ((parstate (unread-char parstate)))
              (retok (lexeme-token (token-punctuator ">"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))))))
    ((= char (char-code #\%))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
      (cond
       ((not char2)
        (retok (lexeme-token (token-punctuator "%"))
               (make-span :start first-pos
                          :end first-pos)
               parstate))
       ((= char2 (char-code #\=))
        (retok (lexeme-token (token-punctuator "%="))
               (make-span :start first-pos :end pos2)
               parstate))
       ((= char2 (char-code #\:))
        (b* (((erp char3 & parstate)
              (read-char parstate)))
         (cond
          ((not char3)
           (retok (lexeme-token (token-punctuator "%:"))
                  (make-span :start first-pos :end pos2)
                  parstate))
          ((= char3 (char-code #\%))
           (b* (((erp char4 pos4 parstate)
                 (read-char parstate)))
            (cond ((not char4)
                   (b* ((parstate (unread-char parstate)))
                     (retok (lexeme-token (token-punctuator "%:"))
                            (make-span :start first-pos :end pos2)
                            parstate)))
                  ((= char4 (char-code #\:))
                   (retok (lexeme-token (token-punctuator "%:%:"))
                          (make-span :start first-pos :end pos4)
                          parstate))
                  (t (b* ((parstate (unread-char parstate))
                          (parstate (unread-char parstate)))
                       (retok (lexeme-token (token-punctuator "%:"))
                              (make-span :start first-pos :end pos2)
                              parstate))))))
          (t (b* ((parstate (unread-char parstate)))
               (retok (lexeme-token (token-punctuator "%:"))
                      (make-span :start first-pos :end pos2)
                      parstate))))))
       (t (b* ((parstate (unread-char parstate)))
            (retok (lexeme-token (token-punctuator "%"))
                   (make-span :start first-pos
                              :end first-pos)
                   parstate))))))
    ((= char (char-code #\<))
     (b* (((erp char2 pos2 parstate)
           (read-char parstate)))
      (cond
         ((not char2)
          (retok (lexeme-token (token-punctuator "<"))
                 (make-span :start first-pos
                            :end first-pos)
                 parstate))
         ((= char2 (char-code #\<))
          (b* (((erp char3 pos3 parstate)
                (read-char parstate)))
            (cond ((not char3)
                   (retok (lexeme-token (token-punctuator "<<"))
                          (make-span :start first-pos :end pos2)
                          parstate))
                  ((= char3 (char-code #\=))
                   (retok (lexeme-token (token-punctuator "<<="))
                          (make-span :start first-pos :end pos3)
                          parstate))
                  (t (b* ((parstate (unread-char parstate)))
                       (retok (lexeme-token (token-punctuator "<<"))
                              (make-span :start first-pos :end pos2)
                              parstate))))))
         ((= char2 (char-code #\=))
          (retok (lexeme-token (token-punctuator "<="))
                 (make-span :start first-pos :end pos2)
                 parstate))
         ((= char2 (char-code #\:))
          (retok (lexeme-token (token-punctuator "<:"))
                 (make-span :start first-pos :end pos2)
                 parstate))
         ((= char2 (char-code #\%))
          (retok (lexeme-token (token-punctuator "<%"))
                 (make-span :start first-pos :end pos2)
                 parstate))
         (t (b* ((parstate (unread-char parstate)))
              (retok (lexeme-token (token-punctuator "<"))
                     (make-span :start first-pos
                                :end first-pos)
                     parstate))))))
    (t
     (reterr-msg
      :where (position-to-msg first-pos)
      :expected
      "a white-space character ~
                               (space, ~
                               new-line, ~
                               horizontal tab, ~
                               vertical tab, ~
                               form feed) ~
                               or a letter ~
                               or a digit ~
                               or an underscore ~
                               or a round parenthesis ~
                               or a square bracket ~
                               or a curly brace ~
                               or an angle bracket ~
                               or a dot ~
                               or a comma ~
                               or a colon ~
                               or a semicolon ~
                               or a plus ~
                               or a minus ~
                               or a star ~
                               or a slash ~
                               or a percent ~
                               or a tilde ~
                               or an equal sign ~
                               or an exclamation mark ~
                               or a question mark ~
                               or a vertical bar ~
                               or a caret ~
                               or hash"
      :found (char-to-msg char)))))))

Theorem: lexeme-optionp-of-lex-lexeme.lexeme?

(defthm lexeme-optionp-of-lex-lexeme.lexeme?
  (b* (((mv acl2::?erp ?lexeme? ?span ?new-parstate)
        (lex-lexeme parstate)))
    (lexeme-optionp lexeme?))
  :rule-classes :rewrite)

Theorem: spanp-of-lex-lexeme.span

(defthm spanp-of-lex-lexeme.span
  (b* (((mv acl2::?erp ?lexeme? ?span ?new-parstate)
        (lex-lexeme parstate)))
    (spanp span))
  :rule-classes :rewrite)

Theorem: parstatep-of-lex-lexeme.new-parstate

(defthm parstatep-of-lex-lexeme.new-parstate
  (implies (parstatep parstate)
           (b* (((mv acl2::?erp ?lexeme? ?span ?new-parstate)
                 (lex-lexeme parstate)))
             (parstatep new-parstate)))
  :rule-classes :rewrite)

Theorem: parsize-of-lex-lexeme-uncond

(defthm parsize-of-lex-lexeme-uncond
  (b* (((mv acl2::?erp ?lexeme? ?span ?new-parstate)
        (lex-lexeme parstate)))
    (<= (parsize new-parstate)
        (parsize parstate)))
  :rule-classes :linear)

Theorem: parsize-of-lex-lexeme-cond

(defthm parsize-of-lex-lexeme-cond
  (b* (((mv acl2::?erp ?lexeme? ?span ?new-parstate)
        (lex-lexeme parstate)))
    (implies (and (not erp) lexeme?)
             (<= (parsize new-parstate)
                 (1- (parsize parstate)))))
  :rule-classes :linear)