Search-engine friendly clone of the ACL2 documentation.

Parser

Read-char

Read a character.

Signature
(read-char parstate) → (mv erp char? pos new-parstate)
Arguments: parstate — Guard (parstatep parstate).
Returns: char? — Type (nat-optionp char?).; pos — Type (positionp pos).; new-parstate — Type (parstatep new-parstate), given (parstatep parstate).

Besides the character, we also return its position. No character is returned if we are at the end of the file; in this case, the returned file position is the one of the non-existent character just past the end of the file (i.e. the position of a character if we added it to the end of the file).

First we check whether the sequence of unread characters is not empty. If it is not empty, we move a character from that sequence to the sequence of read characters, which amounts to incrementing chars-read and decrementing chars-unread, with no change to the array; see parstate. We also need to check that the size stobj component is not 0, which should be an invariant but it is not explicated in the stobj, because we also decrement that stobj component, to maintain the invariant. We read the character (and its position) from the array so we can return it, but we need to check that the index (i.e. chars-read) is within the bounds of the array. Maybe we could optimize this check away with suitable invariants, presumably via abstract stobjs; this could also let us optimize away the extra check on size mentioned above.

If the sequence of unread characters is empty, we need to read the next character from the bytes. If there are no more bytes, we have reached the end of file. We return nil, for no character, and we leave the parser state unchanged.

Otherwise, we read the first byte, which is removed from the parser state. Since we support Unicode, we perform UTF-8 decoding, which may involve reading additional bytes.

Looking at the rules in the ABNF grammar for basic and extended characters, we see that the codes of the three ASCII non-new-line extended characters (namely dollar, at sign, and backquote) fill gaps in the ASCII codes of the basic characters, so that the codes 9, 11, 12, and 32-126 are all valid ASCII characters, which are thus returned, incrementing the position by one column. If instead the byte is the ASCII code 10 for line feed, we increment the position by one line. If instead the byte is the ASCII code 13 for carriage return, there are two cases based on whether the immediately following byte exists and is the ASCII code 10 for line feed: if it does, we consume two bytes instead of one, but we return just a line feed, since we only really need one new-line character (also in line with [C17:5.2.1/3]); if it does not, we just consume the carriage return, but return a line feed, again for normalizing the new-line character. In both cases, we increment the position by one line, which also resets the column to 0 (see position-inc-line).

Note that horizontal tab, vertical tab, and form feed just increment the column number by 1 and leave the line number unchanged, like most other characters. This may not match the visual appearance, but the parser has no way to know how many columns a horizontal tab takes, or how many lines a vertical tab or form feed takes. So, at least for now, we just treat these as most other characters.

We exclude most ASCII control characters, except for the basic ones and for the new-line ones, since there should be little need to use those in C code. Furthermore, some are dangerous, particularly backspace, since it may make the code look different from what it is, similarly to Trojan Source in Unicode.

If the first byte is 128 or more, it must start a non-ASCII Unicode character. There are three cases to consider. It is easy to find references to UTF-8 encoding/decoding, for instance this Wikipedia page.

First case: If the first byte has the form 110xxxyy, it must be followed by a second byte of the form 10yyzzzz, which together encode xxxyyyyzzzz, which covers the range from 0 to 7FFh. We return an error if there is no second byte. We return an error if the encoded value is below 80h. If all these checks pass, the code covers the character range from U+80 to U+7FF.

Second case: If the first byte has the form 1110xxxx, it must be followed by a second byte of the form 10yyyyzz and by a third byte of the form 10zzwwww, which together encode xxxxyyyyzzzzwwww, which covers the range from 0 to FFFFh. We return an error if there is no second or third byte. We return an error if the encoded value is below 800h. If all these checks pass, the code covers the character range from U+800 to U+FFFF, but we return an error if the character is between U+202A and U+202E or between U+2066 and U+2069; see the safe-nonascii rule in our ABNF grammar.

Third case: If the first byte has the form 11110xyy, it must be followed by a second byte of the form 10yyzzzz and by a third byte of the form 10wwwwuu and by a fourth byte of the form 10uuvvvv, which together encode xyyyyzzzzwwwwuuuuvvvv, which covers the range from 0 to 1FFFFFh. We return an error if there is no second or third or fourth byte. We return an error if the encoded value is below 10000h or above 10FFFFh. If all these checks pass, the code covers the character range from U+10000 to U+1FFFFF.

If the first byte read has any other value, either it is an invalid UTF-8 encoding (e.g. 111...) or it is an ASCII character that we do not accept (e.g. 00000000). We return an error in this case.

Note that all the non-ASCII characters that we accept just increment the column number by 1 and leave the line number unchanged. This may not be appropriate for certain Unicode characters, but for now we treat them in this simplified way.

Definitions and Theorems

Function: read-char

(defun read-char (parstate)
 (declare (xargs :stobjs (parstate)))
 (declare (xargs :guard (parstatep parstate)))
 (let ((__function__ 'read-char))
  (declare (ignorable __function__))
  (b*
   (((reterr) nil (irr-position) parstate)
    (parstate.bytes (parstate->bytes parstate))
    (parstate.position (parstate->position parstate))
    (parstate.chars-read (parstate->chars-read parstate))
    (parstate.chars-unread (parstate->chars-unread parstate))
    (parstate.size (parstate->size parstate))
    ((when (and (> parstate.chars-unread 0)
                (> parstate.size 0)))
     (b*
      (((unless (< parstate.chars-read
                   (parstate->chars-length parstate)))
        (raise "Internal error: index ~x0 out of bound ~x1."
               parstate.chars-read
               (parstate->chars-length parstate))
        (reterr t))
       (char+pos (parstate->char parstate.chars-read parstate))
       (parstate
           (update-parstate->chars-unread (1- parstate.chars-unread)
                                          parstate))
       (parstate
            (update-parstate->chars-read (1+ parstate.chars-read)
                                         parstate))
       (parstate (update-parstate->size (1- parstate.size)
                                        parstate)))
      (retok (char+position->char char+pos)
             (char+position->position char+pos)
             parstate)))
    ((unless (and (consp parstate.bytes)
                  (> parstate.size 0)))
     (retok nil parstate.position parstate))
    (byte (car parstate.bytes))
    (bytes (cdr parstate.bytes))
    ((when (or (= byte 9)
               (= byte 11)
               (= byte 12)
               (and (<= 32 byte) (<= byte 126))))
     (b* ((parstate (update-parstate->bytes bytes parstate))
          (parstate (update-parstate->position
                         (position-inc-column 1 parstate.position)
                         parstate))
          ((unless (< parstate.chars-read
                      (parstate->chars-length parstate)))
           (raise "Internal error: index ~x0 out of bound ~x1."
                  parstate.chars-read
                  (parstate->chars-length parstate))
           (reterr t))
          (parstate
               (update-parstate->char
                    parstate.chars-read
                    (make-char+position :char byte
                                        :position parstate.position)
                    parstate))
          (parstate
               (update-parstate->chars-read (1+ parstate.chars-read)
                                            parstate))
          (parstate (update-parstate->size (1- parstate.size)
                                           parstate)))
       (retok byte parstate.position parstate)))
    ((when (= byte 10))
     (b* ((parstate (update-parstate->bytes bytes parstate))
          (parstate (update-parstate->position
                         (position-inc-line 1 parstate.position)
                         parstate))
          ((unless (< parstate.chars-read
                      (parstate->chars-length parstate)))
           (raise "Internal error: index ~x0 out of bound ~x1."
                  parstate.chars-read
                  (parstate->chars-length parstate))
           (reterr t))
          (parstate
               (update-parstate->char
                    parstate.chars-read
                    (make-char+position :char 10
                                        :position parstate.position)
                    parstate))
          (parstate
               (update-parstate->chars-read (1+ parstate.chars-read)
                                            parstate))
          (parstate (update-parstate->size (1- parstate.size)
                                           parstate)))
       (retok 10 parstate.position parstate)))
    ((when (= byte 13))
     (b* (((mv bytes count)
           (if (and (consp bytes)
                    (> parstate.size 1)
                    (= (car bytes) 10))
               (mv (cdr bytes) 2)
             (mv bytes 1)))
          (parstate (update-parstate->bytes bytes parstate))
          (parstate (update-parstate->position
                         (position-inc-line 1 parstate.position)
                         parstate))
          ((unless (< parstate.chars-read
                      (parstate->chars-length parstate)))
           (raise "Internal error: index ~x0 out of bound ~x1."
                  parstate.chars-read
                  (parstate->chars-length parstate))
           (reterr t))
          (parstate
               (update-parstate->char
                    parstate.chars-read
                    (make-char+position :char 10
                                        :position parstate.position)
                    parstate))
          (parstate
               (update-parstate->chars-read (1+ parstate.chars-read)
                                            parstate))
          (parstate (update-parstate->size (- parstate.size count)
                                           parstate)))
       (retok 10 parstate.position parstate)))
    ((when (= (logand byte 224) 192))
     (b*
      (((unless (and (consp bytes) (> parstate.size 1)))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 110... ~
                                          (i.e. between 192 and 223) ~
                                          of a two-byte UTF-8 encoding"
          byte)
         :found "end of file"))
       (byte2 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte2 192) 128))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 110... ~
                                          (i.e. between 192 and 223) ~
                                          of a two-byte UTF-8 encoding"
          byte)
         :found (msg "the byte ~x0" byte2)))
       (code (+ (ash (logand byte 31) 6)
                (logand byte2 63)))
       ((when (< code 128))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "a value between 80h and 7FFh ~
                                          UTF-8-encoded in the two bytes ~
                                          (~x0 ~x1)"
          byte byte2)
         :found (msg "the value ~x0" code)))
       (parstate (update-parstate->bytes bytes parstate))
       (parstate (update-parstate->position
                      (position-inc-column 1 parstate.position)
                      parstate))
       ((unless (< parstate.chars-read
                   (parstate->chars-length parstate)))
        (raise "Internal error: index ~x0 out of bound ~x1."
               parstate.chars-read
               (parstate->chars-length parstate))
        (reterr t))
       (parstate
            (update-parstate->char
                 parstate.chars-read
                 (make-char+position :char code
                                     :position parstate.position)
                 parstate))
       (parstate
            (update-parstate->chars-read (1+ parstate.chars-read)
                                         parstate))
       (parstate (update-parstate->size (- parstate.size 2)
                                        parstate)))
      (retok code parstate.position parstate)))
    ((when (= (logand byte 240) 224))
     (b*
      (((unless (and (consp bytes) (> parstate.size 1)))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 1110... ~
                                          (i.e. between 224 to 239) ~
                                          of a three-byte UTF-8 encoding"
          byte)
         :found "end of file"))
       (byte2 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte2 192) 128))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 1110... ~
                                          (i.e. between 224 and 239) ~
                                          of a three-byte UTF-8 encoding"
          byte)
         :found (msg "the byte ~x0" byte2)))
       ((unless (and (consp bytes) (> parstate.size 2)))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 1110... ~
                                          (i.e. between 224 to 239) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a three-byte UTF-8 encoding"
          byte byte2)
         :found "end of file"))
       (byte3 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte3 192) 128))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 1110... ~
                                          (i.e. between 224 and 239) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a three-byte UTF-8 encoding"
          byte byte2)
         :found (msg "the byte ~x0" byte3)))
       (code (+ (ash (logand byte 15) 12)
                (ash (logand byte2 63) 6)
                (logand byte3 63)))
       ((when (< code 2048))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "a value between 800h and FFFFh ~
                                          UTF-8-encoded in the three bytes ~
                                          (~x0 ~x1 ~x2)"
          byte byte2 byte3)
         :found (msg "the value ~x0" code)))
       ((when (or (and (<= 8234 code) (<= code 8238))
                  (and (<= 8294 code) (<= code 8297))))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         "a Unicode character with code ~
                                     in the range 9-13 or 32-126 ~
                                     or 128-8233 or 8239-8293 or ~
                                     or 8298-55295 or 57344-1114111"
         :found (char-to-msg code)))
       (parstate (update-parstate->bytes bytes parstate))
       (parstate (update-parstate->position
                      (position-inc-column 1 parstate.position)
                      parstate))
       ((unless (< parstate.chars-read
                   (parstate->chars-length parstate)))
        (raise "Internal error: index ~x0 out of bound ~x1."
               parstate.chars-read
               (parstate->chars-length parstate))
        (reterr t))
       (parstate
            (update-parstate->char
                 parstate.chars-read
                 (make-char+position :char code
                                     :position parstate.position)
                 parstate))
       (parstate
            (update-parstate->chars-read (1+ parstate.chars-read)
                                         parstate))
       (parstate (update-parstate->size (- parstate.size 3)
                                        parstate)))
      (retok code parstate.position parstate)))
    ((when (= (logand 248 byte) 240))
     (b*
      (((unless (and (consp bytes) (> parstate.size 1)))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 to 247) ~
                                          of a four-byte UTF-8 encoding"
          byte)
         :found "end of file"))
       (byte2 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte2 192) 128))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 and 247) ~
                                          of a four-byte UTF-8 encoding"
          byte)
         :found (msg "the byte ~x0" byte2)))
       ((unless (and (consp bytes) (> parstate.size 2)))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 to 247) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a four-byte UTF-8 encoding"
          byte byte2)
         :found "end of file"))
       (byte3 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte3 192) 128))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 and 247) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a four-byte UTF-8 encoding"
          byte byte2)
         :found (msg "the byte ~x0" byte3)))
       ((unless (and (consp bytes) (> parstate.size 3)))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 to 247) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          and the third byte ~x2 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a four-byte UTF-8 encoding"
          byte byte2 byte3)
         :found "end of file"))
       (byte4 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte4 192) 128))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 and 247) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          and the third byte ~x2 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a four-byte UTF-8 encoding"
          byte byte2 byte3)
         :found (msg "the byte ~x0" byte4)))
       (code (+ (ash (logand byte 7) 18)
                (ash (logand byte2 63) 12)
                (ash (logand byte3 63) 6)
                (logand byte4 63)))
       ((when (or (< code 65536) (> code 1114111)))
        (reterr-msg
         :where (position-to-msg parstate.position)
         :expected
         (msg
          "a value between 10000h and 10FFFFh ~
                                          UTF-8-encoded in the four bytes ~
                                          (~x0 ~x1 ~x2 ~x3)"
          byte byte2 byte3 byte4)
         :found (msg "the value ~x0" code)))
       (parstate (update-parstate->bytes bytes parstate))
       (parstate (update-parstate->position
                      (position-inc-column 1 parstate.position)
                      parstate))
       ((unless (< parstate.chars-read
                   (parstate->chars-length parstate)))
        (raise "Internal error: index ~x0 out of bound ~x1."
               parstate.chars-read
               (parstate->chars-length parstate))
        (reterr t))
       (parstate
            (update-parstate->char
                 parstate.chars-read
                 (make-char+position :char code
                                     :position parstate.position)
                 parstate))
       (parstate
            (update-parstate->chars-read (1+ parstate.chars-read)
                                         parstate))
       (parstate (update-parstate->size (- parstate.size 4)
                                        parstate)))
      (retok code parstate.position parstate))))
   (reterr-msg
        :where (position-to-msg parstate.position)
        :expected "a byte in the range 9-13 or 32-126 or 192-223"
        :found (msg "the byte ~x0" byte)))))

Theorem: nat-optionp-of-read-char.char?

(defthm nat-optionp-of-read-char.char?
  (b* (((mv acl2::?erp
            ?char? acl2::?pos ?new-parstate)
        (read-char parstate)))
    (nat-optionp char?))
  :rule-classes :rewrite)

Theorem: positionp-of-read-char.pos

(defthm positionp-of-read-char.pos
  (b* (((mv acl2::?erp
            ?char? acl2::?pos ?new-parstate)
        (read-char parstate)))
    (positionp pos))
  :rule-classes :rewrite)

Theorem: parstatep-of-read-char.new-parstate

(defthm parstatep-of-read-char.new-parstate
  (implies (parstatep parstate)
           (b* (((mv acl2::?erp
                     ?char? acl2::?pos ?new-parstate)
                 (read-char parstate)))
             (parstatep new-parstate)))
  :rule-classes :rewrite)

Theorem: parsize-of-read-char-uncond

(defthm parsize-of-read-char-uncond
  (b* (((mv acl2::?erp
            ?char? acl2::?pos ?new-parstate)
        (read-char parstate)))
    (<= (parsize new-parstate)
        (parsize parstate)))
  :rule-classes :linear)

Theorem: parsize-of-read-char-cond

(defthm parsize-of-read-char-cond
  (b* (((mv acl2::?erp
            ?char? acl2::?pos ?new-parstate)
        (read-char parstate)))
    (implies (and (not erp) char?)
             (<= (parsize new-parstate)
                 (1- (parsize parstate)))))
  :rule-classes :linear)