Parser
A parser of C into our abstract syntax.
We provide a parser to turn C code into
the abstract syntax defined in abstract-syntax.
The parser is based on our C concrete syntax formulation
in concrete-syntax.
In line with the rationale for our abstract syntax,
the parser preserves much of the information from the concrete syntax.
Currently the parser handles all C code constructs after preprocessing;
our parser does not do any preprocessing.
We plan to extend our abstract syntax with some preprocessing constructs,
and accordingly extend our parser to recognize and preserve those.
We may also develop our own C preprocessor in the future.
Parsing C, even after preprocessing, is notoriously complicated.
There are syntactic ambiguities stemming from the fact that
an identifier may be an expression or a type name.
This is often addressed by performing
some static semantic analysis during parsing,
in order to tell apart identifier expressions and identifier type names.
Our parser instead parses the ambiguous constructs
into explicit representations of ambiguous constructs:
see abstract-syntax.
Our approach avoids the static semantic analysis during parsing,
at the cost of more complicated parsing logic,
but we prefer the cleaner separation of concerns.
The current implementation of our parser
does not capture all ambiguous constructs yet.
It is possible that our parser may reject some valid C code.
However, we plan to cover all ambiguous constructs soon.
Our parser uses recursive descent,
both for lexing and for parsing proper.
The parser is closely based on the ABNF grammar in grammar,
which should be consulted alongside the parser code.
Since that grammar is left-recursive,
we perform the usual left recursion elimination.
Although currently lexing should be context-independent
(i.e. it should be possible to lex the code and then parse it),
our parser is written so that lexing is called on the fly.
This makes it possible to accommodate context-dependent lexing,
which may be needed as we add support for some preprocessing constructs.
Our parser uses error-value tuples to handle errors; see that documentation page.
The current parser is amenable to
returning more informative error messages,
but we have already put some effort into doing that.
This parser is currently not verified, for expediency.
We plan to go back and work on verifying, or synthesizing,
components of this parser, and ideally eventually the whole parser.
This work will be based on our ABNF library and tools. Even better, we may investigate generating the parser automatically
from the grammar with suitable additional information;
The aforementioned ABNF library already has some parsing generation tools,
but they are fairly simple and preliminary,
so they would need significant extensions.
The parser is amenable to some optimizations.
For now we have favored simplicity and regularity,
but if performance turns out to be important,
we can optimize the implementation in some respects.
Even better, we could investigate applying optimizing transformations
to the current parser implementation,
or perhaps to a simpler and higher-level implementation;
this could be part of the idea of generating the parser automatically,
mentioned above.
Subtopics
- Parse-exprs/decls/stmts
- Parse expressions, declarations, statements, and related entities.
- Parstate
- Fixtype of parser states.
- Check-full-ppnumber
- Check that the numerical constant just read
is a full preprocessing number.
- Read-char
- Read a character.
- Lex-oct-iconst-/-dec-fconst
- Lex-lexeme
- Lex a lexeme.
- Lex-identifier/keyword
- Lex an identifier or keyword.
- Parse-external-declaration
- Parse an external declaration.
- Parse-cast-expression
- Parse a cast expression.
- Lex-isuffix-if-present
- Lex an integer suffix, if present.
- Parse-postfix-expression
- Parse a postfix expression.
- Lex-hex-iconst/fconst
- Lex a hexadecimal integer or floating constant.
- Lex-dec-iconst/fconst
- Lex a decimal integer or floating constant.
- Lex-block-comment
- Lex a block comment.
- Lex-escape-sequence
- Lex an escape sequence.
- Token
- Fixtype of tokens.
- Read-token
- Read a token.
- Lex-*-hexadecimal-digit
- Lex zero or more hexadecimal digits, as many as available.
- Lex-*-c-char
- Lex zero or more characters and escape sequences
in a character constant.
- Lex-*-s-char
- Lex zero or more characters and escape sequences
in a string literal.
- Parse-specifier/qualifier
- Parse a specifier or qualifier.
- Lex-*-digit
- Lex zero or more (decimal) digits, as many as available.
- Lex-line-comment
- Lex a line comment.
- Char-to-msg
- Represent an optional character as a message.
- Parstate$
- Parse-pointer
- Parse a pointer.
- Parse-declaration-specifier
- Parse a declaration specifier.
- Lex-iconst/fconst
- Lex an integer or floating constant.
- Parse-expression-or-type-name
- Parse an expression or a type name.
- Parse-external-declaration-list
- Parse a list of one or more external declarations.
- Parse-declaration-specifiers
- Parse a list of one or more declaration specifiers.
- Lex-dec-expo-if-present
- Lex a decimal exponent, if present.
- Parse-asm-name-specifier
- Parse an assembler name specifier.
- Lex-character-constant
- Lex a character constant.
- Parse-primary-expression
- Parse a primary expression.
- Parse-expression-rest
- Parse the rest of an expression.
- Lex-dec-fconst
- Lex a decimal floating constant.
- Token-unary-expression-start-p
- Check if an optional token may start a unary expression.
- Parse-declarator-or-abstract-declarator
- Parse a declarator or an abstract declarator.
- Parse-asm-goto-labels
- Parse zero or more assembler goto labels.
- Parse-asm-clobbers
- Parse zero or more assembler clobbers, separated by commas.
- Lex-stringlit
- Lex a string literal.
- Lex-non-octal-digit
- Lex a non-octal digit.
- Position
- Fixtype of positions.
- Lexeme
- Fixtype of lexemes.
- Lex-fsuffix-if-present
- Lex a floating suffix, if present.
- Init-parstate
- Initialize the parser state.
- Parse-type-qualifier-list
- Parse a list of one or more type qualifiers.
- Lex-hexadecimal-digit
- Lex a hexadecimal digit.
- Parse-translation-unit
- Parse a translation unit.
- Token-postfix-expression-rest-start-p
- Check if an optional token may start
the rest of a postfix expression.
- Parse-?-asm-name-specifier
- Parse an optional assembler name specifier.
- Parse-postfix-expression-rest
- Parse the rest of a postfix expression.
- Parse-expression
- Parse an expression.
- Lex-sign-if-present
- Lex a sign, if present.
- Lex-dec-expo
- Lex a decimal exponent.
- Lex-bin-expo
- Lex a binary exponent.
- Parse-*-stringlit
- Parse a list of zero or more string literals.
- Parse-statement
- Parse a statement.
- Parse-fileset
- Parse a file set.
- Make-expr-unary-with-preinc/predec-ops
- Apply to an expression
all the pre-increment and pre-decrement operators in a list.
- Token-type-specifier-keyword-p
- Check if an optional token is a type specifier
that consists of a single keyword.
- Token-option
- Fixtype of optional tokens.
- Lexeme-option
- Fixtype of optional lexemes.
- Parse-*-attribute-specifier
- Parse zero or more attribute specifiers.
- Parse-initializer-list
- Parse a list of one or more initializers.
- Parse-file
- Parse (the data bytes of) a file.
- Unread-token
- Unread a token.
- Parse-array/function-declarator
- Parse an array or function declarator.
- Span
- Fixtype of spans.
- Read-stringlit
- Read a string literal token.
- Lex-hex-quad
- Lex a quadruple of hexadecimal digits.
- Unread-char
- Unread a character.
- Token-type-qualifier-p
- Check if an optional token is a type qualifier.
- Read-identifier
- Read an identifier token.
- Parse-*-asm-qualifier
- Parse zero or more assembler qualifiers.
- Parse-attribute-name
- Parse an attribute name.
- Parse-argument-expressions
- Parse zero or more argument expressions.
- Make-expr-cast/add-or-cast/sub-ambig
- Create an ambiguous cast expression based on
a token that is an additive operator.
- Read-punctuator
- Read a specific punctuator token.
- Parse-assignment-expression
- Parse an assignment expression.
- Parse-asm-clobber
- Parse an assembler clobber.
- Token-struct-declaration-start-p
- Check if an optional token may start a structure declaration.
- Token-specifier/qualifier-start-p
- Check if an optional token may start a specifier or qualifier.
- Token-primary-expression-start-p
- Check if an optional token may start a primary expression.
- Token-function-specifier-p
- Check if an optional token is a function specifier.
- Reterr-msg
- Return an error consisting of a message
with information about what was expected and what was found where.
- Read-keyword
- Read a specific keyword token.
- Parse-*-increment/decrement
- Parse zero or more increment and decrement operators.
- Parse-direct-abstract-declarator
- Parse a direct abstract declarator.
- Parse-declaration-or-statement
- Parse a declaration or a statement.
- Char+position
- Fixtype of pairs each consisting of a character and a position.
- Update-parstate->tokens-unread
- Update-parstate->tokens-read-len
- Update-parstate->chars-unread
- Token+span
- Fixtype of pairs each consisting of a token and a span.
- Token-expression-start-p
- Check if an optional token may start an expression.
- Backtrack-checkpoint
- Backtrack to the latest checkpoint.
- Update-parstate->tokens-read
- Update-parstate->checkpoints
- Update-parstate->chars-read
- Token-to-msg
- Represent a token as a message.
- Update-parstate->position
- Parse-parameter-declaration
- Parse a parameter declaration.
- Parse-argument-expressions-rest
- Parse the rest of one or more argument expressions.
- Parse-declaration
- Parse a declaration.
- Clear-checkpoint
- Clear the latest checkpoint.
- Update-parstate->bytes
- Parse-unary-expression
- Parse a unary expression.
- Parse-generic-associations-rest
- Parse zero or more reamaining generic associations.
- Update-parstate->size
- Update-parstate->gcc
- Token-to-type-specifier-keyword
- Map a token that is a type specifier consisting of a single keyword
to the corresponding type specifier.
- Token-declaration-specifier-start-p
- Check if an optional token may start a declaration specifier.
- Parsize
- Size of the parsing state.
- Parse-direct-abstract-declarator-rest
- Parse the rest of a direct abstract declartor.
- Parse-designator-list
- Parse a designator list.
- Unread-tokens
- Unread a specified number of tokens.
- Unread-chars
- Unread a specified number of characters.
- Token-designation?-initializer-start-p
- Check if an optional token may start
an initializer optionally preceded by a designation.
- Token-abstract-declarator-start-p
- Check if an optional token may start an abstract declarator.
- Record-checkpoint
- Record a checkpoint for possible backtracking.
- Parse-?-asm-output-operands
- Parse zero or more assembler output operands, separated by commas.
- Parse-?-asm-input-operands
- Parse zero or more assembler input operands, separated by commas.
- Parse-struct-declaration
- Parse a structure declaration.
- Parse-specifier-qualifier-list
- Parse a list of one or more specifiers and qualifiers.
- Parse-parameter-declaration-list
- Parse a list of one or more parameter declarations.
- Parse-constant-expression
- Parse a constant expression.
- Parse-conditional-expression
- Parse a conditional expression.
- Token-type-specifier-start-p
- Check if an optional token may start a type specifier.
- Parse-static-assert-declaration
- Parse a static assert declaration.
- Parse-fileset-loop
- Parse-direct-declarator
- Parse a direct declarator.
- Parse-attribute-parameters
- Parse attribute parameters.
- Token-unary-operator-p
- Check if an optional token is a unary operator.
- Token-to-unary-operator
- Map a token that is a unary operator
to the corresponding unary operator.
- Token-to-type-qualifier
- Map a token that is a type qualifier
to the correspoding type qualifier.
- Token-storage-class-specifier-p
- Check if an optional token is a storage class specifier.
- Token-punctuatorp
- Check if a token is a given punctuator.
- Token-direct-abstract-declarator-start-p
- Check if an optional token may start a direct abstract declarator.
- Token-declarator-start-p
- Check if an optional token may start a declarator.
- Parse-direct-declarator-rest
- Parse the rest of a direct declarator.
- Token-to-storage-class-specifier
- Map a token that is a storage class specifier
to the correspoding storage class specifier.
- Token-to-assignment-operator
- Map a token that is an assignment operator
to the corresponding assignment operator.
- Token-to-asm-qualifier
- Map a token that is an assembler qualifier
to the corresponding assembler qualifier.
- Token-struct-declarator-start-p
- Check if an optional token may start a structure declarator.
- Token-keywordp
- Check if a token is a given keyword.
- Token-initializer-start-p
- Check if an optional token may start an initializer.
- Token-direct-declarator-start-p
- Check if an optional token may start a direct declarator.
- Token-assignment-operator-p
- Check if an optional token is an assignment operator.
- Parstate->tokens-unread
- Parse-struct-or-union-specifier
- Parse or structure or union specifier.
- Parse-enumerator-list
- Parse a list of one or more enumerators.
- Parse-designation?-initializer
- Parse an initializer with an optional designation.
- Parse-block-item
- Parse a block item.
- Token-type-name-start-p
- Check if an optional token may start a type name.
- Token-to-function-specifier
- Map a token that is a function specifier
to the corresponding function specifier.
- Token-preinc/predec-operator-p
- Check if an optional token is
a preincrement or predecrement operator.
- Token-multiplicative-operator-p
- Check if an optional token is a multiplicative operator.
- Token-designator-start-p
- Check if an optional token may start a designator.
- Token-designation-start-p
- Check if an optional token may start a designation.
- Parstate->tokens-read-len
- Parstate->tokens-read
- Parstate->checkpoints
- Parstate->chars-unread
- Parstate->chars-read
- Parstate->bytes
- Parse-initializer
- Parse an initializer.
- Parse-generic-association
- Parse a generic association.
- Parse-declaration-list
- Parse a list of one or more declarations.
- Parse-attribute-specifier
- Parse an attribute specifier.
- Parse-asm-output-operands
- Parse one or more assembler output operands, separated by commas.
- Token-to-relational-operator
- Map a token that is a relational operator
to the corresponding relational operator.
- Token-to-preinc/predec-operator
- Map a token that is a preincrement or predecrement operator
to the corresponding preincrement or predecrement operator.
- Token-to-multiplicative-operator
- Map a token that is a multiplicative operator
to the corresponding additive operator.
- Token-relational-operator-p
- Check if an optional token is a relational operator.
- Token-equality-operator-p
- Check if an optional token is an equality operator.
- Token-asm-qualifier-p
- Check if an optional token is an assembler qualifier.
- Token-additive-operator-p
- Check if an optional token is an additive operator.
- Span-join
- Join two spans.
- Parstate->size
- Parstate->position
- Parse-compound-literal
- Parse a compound literal.
- Parse-asm-input-operands
- Parse one or more assembler input operands, separated by commas.
- Token-to-equality-operator
- Map a token that is an equality operator
to the corresponding equality operator.
- Token-to-additive-operator
- Map a token that is an additive operator
to the corresponding additive operator.
- Token-shift-operator-p
- Check if an optional token is a shift operator.
- Position-inc-line
- Increment a position by a number of lines.
- Position-inc-column
- Increment a position by a number of columns.
- Parstate->gcc
- Parse-type-name
- Parse a type name.
- Parse-struct-declarator-list
- Parse a list of one or more structure declarator.
- Parse-struct-declaration-list
- Parse a list of one or more structure declarations.
- Parse-relational-expression-rest
- Parse the rest of a relational expression.
- Parse-multiplicative-expression-rest
- Parse the rest of a multiplicative expression.
- Parse-logical-or-expression-rest
- Parse the rest of a logical disjunction expression.
- Parse-logical-and-expression-rest
- Parse the rest of a logical conjunction expression.
- Parse-inclusive-or-expression-rest
- Parse the rest of an inclusive disjunction expression.
- Parse-exclusive-or-expression-rest
- Parse the rest of an exclusive disjunction expression.
- Parse-equality-expression-rest
- Parse the rest of an equality expression.
- Parse-array/function-abstract-declarator
- Parse an array or function abstract declarator.
- Parse-additive-expression-rest
- Parse the rest of an additive expression.
- Token-to-shift-operator
- Map a token that is a shift operator
to the corresponding shift operator.
- Parse-struct-declarator
- Parse a structure declarator.
- Parse-shift-expression-rest
- Parse the rest of a shift expression.
- Parse-member-designor
- Parse a member designator.
- Parse-init-declarator-list
- Parse a list of one or more initializer declarators.
- Parse-init-declarator
- Parse an initializer declarator.
- Parse-and-expression-rest
- Parse the rest of a conjunction expression.
- Parse-alignment-specifier
- Parse an alignment specifier.
- Position-to-msg
- Represent a position as a message.
- Parse-shift-expression
- Parse a shift expression.
- Parse-relational-expression
- Parse a relational expression.
- Parse-multiplicative-expression
- Parse a multiplicative expression.
- Parse-logical-or-expression
- Parse a logical disjunction expression.
- Parse-logical-and-expression
- Parse a logical conjunction expression.
- Parse-inclusive-or-expression
- Parse an inclusive disjunction expression.
- Parse-exclusive-or-expression
- Parse an exclusive disjunction expression.
- Parse-equality-expression
- Parse an equality expression.
- Parse-enum-specifier
- Parse an enumeration specifier.
- Parse-block-item-list
- Parse a list of one or more block items.
- Parse-attribute-list
- Parse a list of one or more attributes, separated by commas.
- Parse-and-expression
- Parse a conjunction expression.
- Parse-additive-expression
- Parse an additive expression.
- Parse-abstract-declarator
- Parse an abstract declarator.
- Char+position-list
- Fixtype of lists of
pairs each consisting of a character and a position.
- Token+span-list
- Fixtype of lists of pairs each consisting of a token and a span.
- Span-to-msg
- Represent a span as a message.
- Position-init
- Initial position in a file.
- Parstate-fix
- Parse-member-designor-rest
- Parse the rest of a member designator.
- Parse-designator
- Parse a designator.
- Parse-declarator
- Parse a declarator.
- Parse-attribute
- Parse an attribute.
- Irr-token
- An irrelevant token.
- Irr-span
- An irrelevant span.
- Irr-position
- An irrelevant position.
- Irr-lexeme
- An irrelevant lexeme.
- Token-list
- Fixtype of lists of tokens.
- To-parstate$
- Parse-enumerator
- Parse an enumerator.
- Parse-asm-output-operand
- Parse an assembler output operand.
- Parse-asm-input-operand
- Parse an assembler input operand.