Grammar of the Java language [JLS].
[JLS] presents the grammar of Java using the notation in [JLS:2.4],
which is similar to EBNF (Extended Backus-Naur Form).
But since we currently have a verified ABNF grammar parser, we use ABNF (Augmented Backus-Naur Form) to formalize the Java grammar.
The ABNF grammar of Java is in the files
lexical-grammar.abnf and syntactic-grammar.abnf
in this directory;
this splitting corresponds to [JLS:2.2] and [JLS:2.3].
Note that these files, according to ABNF,
must have their lines terminated by carriage-return and line-feed pairs:
see the notes here for details about this.
ABNF is a little different from EBNF.
A difference is that EBNF has a construct for syntactic exception
(e.g. consonant = letter - vowel),
while ABNF does not.
The notation in [JLS:2.4]
has a syntactic exception construct (but not)
that corresponds to EBNF's - construct.
However, the Java grammar alone is ambiguous anyhow,
and we need extra-grammatical predicates to formalize Java syntax:
we use those to formalize the syntactic exceptions in the Java grammar,
since ABNF does not capture them.
While ABNF lacks syntactic exceptions,
it has constructs that are not in EBNF or in the notation in [JLS:2.4],
which actually allow us to capture more constraints in the grammar,
or the same constraints slightly more concisely.
In particular, ABNF has value range alternatives,
which allow us to define RawInputCharacter [JLS:3.3]
without using informal prose.
ABNF also has case-insensitive string terminal notations,
which allow us to list the letters just once
in the definition of HexDigit [JLS:3.3],
and similarly to express certain rules more concisely.
Besides the rule for RawInputCharacter mentioned above,
the Java grammar in [JLS] includes two other rules
defined entirely using informal prose,
namely the rules for JavaLetter and JavaLetterOrDigit.
Even though ABNF includes prose notation,
we prefer to avoid its use altogether,
and instead define these two nonterminals
as synonyms of RawInputCharacter,
delegating a more precise definition to extra-grammatical predicates
(which is what the prose notation does anyways).
It is obvious that the Java letters and numbers defined by those two rules
are Unicode characters,
since their prose definitions are `any Unicode character that ...',
and the prose definition of RawUnicodeCharacter is
`any Unicode character'.
Avoiding prose notation in our ABNF grammar of Java
lets us prove properties such as the fact that
all the terminal strings generated by the grammar
consist of Unicode characters;
this would not be possible with prose notations,
which are completely unconstrained.
The Java grammar in [JLS] uses camelcase nonterminals.
Since rule names (i.e. nonterminals) are case-insensitive in ABNF,
we systematically turn those camelcase nonterminals
into dash-separated lowercase nonterminals.
In the grammar files,
we use ABNF comments for the but not syntactic exceptions
in the Java grammar in [JLS];
this is just for documentation,
because, as noted above, we formalize these syntactic exceptions
via extra-grammatical predicates.
We also use ABNF comments for the prose notations in [JLS],
which, as explained above, we express as non-prose notations in ABNF.
We also use ABNF comments to separate the files into sections,
and to reference the parts of [JLS] where the the rules appear.
The ABNF notation is documented in RFC 5234 and RFC 7405. The correspondence with the notation in [JLS] should be clear.
- Recognize terminated ABNF trees whose root is the given rule name,
for the Java grammar.
- The Java grammar, in ABNF.
- Lift abnf-tree-with-root-p to lists.
- The Java syntactic grammar, in ABNF.
- The Java lexical grammar, in ABNF.