• Top
    • Documentation
    • Books
    • Boolean-reasoning
    • Projects
    • Debugging
    • Std
    • Proof-automation
    • Macro-libraries
    • ACL2
    • Interfacing-tools
    • Hardware-verification
    • Software-verification
      • Kestrel-books
        • Crypto-hdwallet
        • Apt
        • Error-checking
        • Fty-extensions
        • Isar
        • Kestrel-utilities
        • Set
        • Soft
        • C
        • Bv
        • Imp-language
        • Event-macros
        • Java
          • Atj
          • Aij
          • Language
            • Syntax
              • Grammar
              • Unicode-escapes
                • Uniescape-parse-constraints-p
                • Even-backslashes-tree-constraints-p
                • Len-of-string-of-prefix-of-unicode-input-character-trees
                • Uniescape-parse
                • Nonzero-uletters-after-p
                • Abs-unicode-escape
                • Even-backslashes-before-p
                • Uniescape-tree-constraints-p
                • Uniescape-process
                • Uniescape-candidate-valid-p
                • Uniescape-candidate-p
                • Abs-raw-input-character
                • Abs-hex-digit
                • Abs-unicode-input-character
                • Some-uniescape-candidate-invalid-p
                • Uniescapep
                • Abs-unicode-input-character-list
                • Uniescape-parse-p
                • Eligible-backslash-p
                • Unicode-input-character-tree-is-escape-p
              • Unicode-input-char
              • Escape-sequence
              • Identifiers
              • Primitive-types
              • Reference-types
              • Keywords
              • Unicode-characters
              • Integer-literals
              • String-literals
              • Octal-digits
              • Hexadecimal-digits
              • Decimal-digits
              • Binary-digits
              • Character-literals
              • Null-literal
              • Floating-point-literals
              • Boolean-literals
              • Package-names
              • Literals
            • Semantics
        • Bitcoin
        • Ethereum
        • Yul
        • Zcash
        • ACL2-programming-language
        • Prime-fields
        • Json
        • Syntheto
        • File-io-light
        • Cryptography
        • Number-theory
        • Lists-light
        • Axe
        • Builtins
        • Solidity
        • Helpers
        • Htclient
        • Typed-lists-light
        • Arithmetic-light
      • X86isa
      • Axe
      • Execloader
    • Math
    • Testing-utilities
  • Syntax

Unicode-escapes

First Java lexical translation step: Unicode escapes [JLS14:3.3].

A Java program is a finite sequence of Unicode characters satisfying a number of constraints. The first set of constraints is described in [JLS14:3.3], which also describes how a sequence of Unicode characters satisfying these constraints is turned into another sequence of Unicode characters, which is then subjected to further constraint checks and associated transformations.

The grammar rules in [JLS14:3.3] express part of this first set of constraints, but they are ambiguous if taken alone: one could always choose the raw-input-character alternative for unicode-input-character, without recognizing any unicode-escape, and therefore leave the original Unicode sequence unchanged. The English text in [JLS14:3.3] provides additional constraints.

One additional constraint in this first set is that, in order for a Unicode escape to be recognized (i.e. in order for unicode-escape to be chosen), the backslash that starts the escape must be preceded by an even number of backslashes. Without this constraint, it would be impossible, for example, to have the Java literal string "The Unicode escape \\u0020 is the ASCII space character.", where the double backslash is a (non-Unicode) escape sequence for a single backslash [JLS14:3.10.6]: without the constraint, this Java literal string would be equivalent to "The Unicode escape \ is the ASCII space character.". [JLS14:3.3] introduces the notion of `eligible' backslash as one preceded by an even number of backslashes (possibly none). In the example string above, the backslash just before the u is not eligible, because it is preceded by an odd number of backslashes.

For each eligible backslash, there are two cases: either the eligible backslash is followed by one or more u letters, or it is not. In the second case, there is no Unicode escape, because the grammar requires the presence of one or more u letters; the backslash must be presumably the start of a (non-Unicode) escape sequence [JLS14:3.10.6], e.g. the backslash in the Java string literal "A line.\nAnother line." is eligible, but there is no Unicode escape.

If instead an eligible backslash is followed by one or more u letters, there are two sub-cases: either the last u is followed by four hexadecimal digits, or it is not. In the second sub-case, the original sequence of Unicode characters is not a valid Java program; none of the (non-Unicode) escape sequences [JSL:3.10.6] has u following the backslash. In the first sub-case, we have a possible Unicode escape according to the grammar, and it must be recognized as such, thus removing the inherent grammatical ambiguity.

We formalize the processing of Unicode escapes via a function from lists of Unicode characters to lists of Unicode characters that recognizes the Unicode escapes according to the above rules and turns them into the corresponding single Unicode characters. This function is constructed as the composition of (i) a parser from lists of Unicode characters to lists of unicode-input-character trees and (ii) an abstractor from lists of unicode-input-character trees to lists of Unicode characters.

The parser always succeeds, even if there is an eligible backslash followed by one of more u letters but with the last u not followed by four hexadecimal digits (in which case, as noted above, the processing of Unicode escapes fails). In this case, the parser just leaves the characters as they are, without recognizing any Unicode escape (since there is not one). This makes the parser slightly more general, and perhaps useful in other circumstances. Presumably [JLS14] prescribes an error in this situation (as opposed to simply leaving the characters unchanged) because the resulting Unicode character sequence could never be a valid Java program anyhow, and so parsing can stop immediately instead of stopping later anyhow. Nonetheless, when we compose the parser with the abstractor (see above), we perform the check for that situation, and return an error if the situation occurs, as prescribed by [JLS14]. In other words, our formalization of Unicode escape processing is faithful to [JLS14], but its parser component is a bit more general.

Subtopics

Uniescape-parse-constraints-p
All the input/output constraints for the Unicode escape parser.
Even-backslashes-tree-constraints-p
Necessary condition for parsed trees to be Unicode escapes.
Len-of-string-of-prefix-of-unicode-input-character-trees
A theorem about the length of the string of a (proper) prefix of a list of unicode-input-character trees.
Uniescape-parse
Parse the Unicode escapes in a list of Unicode characters.
Nonzero-uletters-after-p
Check if, in a list of Unicode characters, the character at position pos is followed by a non-zero number of `u' letters.
Abs-unicode-escape
Abstract a unicode-escape tree to a Unicode character.
Even-backslashes-before-p
Check if, in a list of Unicode characters, the character at position pos is preceded by an even number of backslashes.
Uniescape-tree-constraints-p
Sufficient condition for parsed trees to be Unicode escapes.
Uniescape-process
Perform Java's first lexical translation step.
Uniescape-candidate-valid-p
Check if a candidate Unicode escape is valid.
Uniescape-candidate-p
Check if, in a list of Unicode characters, a Unicode escape may start at position pos.
Abs-raw-input-character
Abstract a raw-input-character tree to a Unicode character.
Abs-hex-digit
Abstract a hex-digit tree to a natural number below 16.
Abs-unicode-input-character
Abstract a unicode-input-character tree to a Unicode character.
Some-uniescape-candidate-invalid-p
Check if a list of Unicode characters includes an invalid Unicode escape candidate.
Uniescapep
Check if, in a list of Unicode characters, there is a Unicode escape at position pos.
Abs-unicode-input-character-list
Abstract a list of unicode-input-character trees to a list of Unicode characters.
Uniescape-parse-p
Check if a list of Unicode characters can be parsed into a list of trees.
Eligible-backslash-p
Check if, in a list of Unicode characters, the character at position pos is an eligible backslash.
Unicode-input-character-tree-is-escape-p
Check if a unicode-input-character tree is a Unicode escape.