Project
contact: Keshav Pingali
Project description:
The Grammar Flow Graph (GFG) is a graphical representation of
context-free grammars (CFGs) that plays the same role for CFGs
that non-deterministic finite-state automata (NFA) play for
regular grammars. Just as parsing problems for regular grammars
can be formulated as path problems in the corresponding NFA,
parsing problems for a CFG can be formulated as path problems in
the corresponding GFG. The GFG and its use for parsing are
described in the tech report by Pingali and Bilardi [1].
One of the algorithms described using the GFG is Earley's
algorithm, which can parse any (possibly ambiguous) general
context-free grammar. While Earley's algorithm is almost 50
years old, it is quite difficult to understand in the standard
presentations. The GFG makes it much easier to understand this
algorithm.
Recently, a number of researchers have developed an alternative
to Earley's algorithm for parsing general context-free grammars,
called
parsing with derivatives. A paper on this work
was published in PLDI 2016 [2].
The proposed project has two options and you can choose one.
Option 1:
(1) Canonical parsing algorithms such as those for SLL(k),
LL(k), SLR(k) and LR(k) grammars can be formulated in a unified
way using the GFG. Formulate parsing with derivatives also in
terms of the GFG. This will permit us to understand what is new
and different about this parsing strategy.
(2) Earley's algorithm runs in O(n^3) time and O(n^2) space,
where n is the number of terminals in the input string. Parsing
with derivatives is also claimed to be cubic in running time.
Implement efficient versions of both algorithms and study their
relative efficiency in practice (naive versions of both
algorithms can be implemented in Python in about 200 lines of
code each). Ideally, your implementations of both algorithms
would use the GFG so we can have a common representation of the
grammar, but this depends on being able to solve (1). What are
the pros and cons of each approach?
Option 2:
(1) In [1], a number of
standard parsing algorithms - Earley for general context-free
grammars, SLL(k) parsing, and SLR(k) parsing - are described
using inference rules. The paper also has handwritten proofs
of correctness of these algorithms. The goal of this project
is to mechanically verify these inference rules and prove the
correctness of this formulation of standard parsing
algorithms. You are free to use ACL2, Coq or some other
theorem prover.
This would be a good project to do if you are taking William
Cook's PL course this semester.
Project deliverables
and timelines:
- (Option 1): A GFG description of parsing with derivatives,
and highly optimized implementations of Earley and parsing
with derivatives. An experimental study of the pros and cons
of each approach, using real-world inputs.
- (Nov 1): A clear description of parsing with derivatives
- (Nov 8): A go/no go decision about implementing parsing
with derivatives using the GFG.
- (Dec 6): Highly optimized implementation of both
algorithms, and results of experimental study.
- (Dec 6): Project report and code.
- (Option 2): A mechanically verified proof of correctness
of the inference rules for Early, SLL(k) and SLR(k) grammars
given in (1). If you find a bug in these rules, fix the
inference rules.
- (Nov 1): Decision about which algorithms you will verify
and which theorem prover you will use.
- (Dec 6) Mechanically verified proofs of those
algorithms.
- (Dec 6) Project report and code.
Papers:
- Parsing
with pictures, Keshav Pingali and Gianfranco Bilardi,
UTCS tech report TR-2012.
- On
the complexity and performance of parsing with
derivatives. Michael Adams, Celeste Hollenbeck,
and Matt Might, PLDI 2016.