Parsing algorithms for general context-free
grammars

Project
contact: Keshav Pingali

Project description:

The Grammar Flow Graph (GFG) is a graphical representation of context-free grammars (CFGs) that plays the same role for CFGs that non-deterministic finite-state automata (NFA) play for regular grammars. Just as parsing problems for regular grammars can be formulated as path problems in the corresponding NFA, parsing problems for a CFG can be formulated as path problems in the corresponding GFG. The GFG and its use for parsing are described in the tech report by Pingali and Bilardi [1].

One of the algorithms described using the GFG is Earley's algorithm, which can parse any (possibly ambiguous) general context-free grammar. While Earley's algorithm is almost 50 years old, it is quite difficult to understand in the standard presentations. The GFG makes it much easier to understand this algorithm.

Recently, a number of researchers have developed an alternative to Earley's algorithm for parsing general context-free grammars, called*parsing with derivatives*. A paper on this work
was published in PLDI 2016 [2].

The proposed project has two options and you can choose one.

**Option 1: **

(1) Canonical parsing algorithms such as those for SLL(k), LL(k), SLR(k) and LR(k) grammars can be formulated in a unified way using the GFG. Formulate parsing with derivatives also in terms of the GFG. This will permit us to understand what is new and different about this parsing strategy.

(2) Earley's algorithm runs in O(n^3) time and O(n^2) space, where n is the number of terminals in the input string. Parsing with derivatives is also claimed to be cubic in running time. Implement efficient versions of both algorithms and study their relative efficiency in practice (naive versions of both algorithms can be implemented in Python in about 200 lines of code each). Ideally, your implementations of both algorithms would use the GFG so we can have a common representation of the grammar, but this depends on being able to solve (1). What are the pros and cons of each approach?

**Option 2:**

(1) In [1], a number of
standard parsing algorithms - Earley for general context-free
grammars, SLL(k) parsing, and SLR(k) parsing - are described
using inference rules. The paper also has handwritten proofs
of correctness of these algorithms. The goal of this project
is to mechanically verify these inference rules and prove the
correctness of this formulation of standard parsing
algorithms. You are free to use ACL2, Coq or some other
theorem prover.

This would be a good project to do if you are taking William Cook's PL course this semester.

Project description:

The Grammar Flow Graph (GFG) is a graphical representation of context-free grammars (CFGs) that plays the same role for CFGs that non-deterministic finite-state automata (NFA) play for regular grammars. Just as parsing problems for regular grammars can be formulated as path problems in the corresponding NFA, parsing problems for a CFG can be formulated as path problems in the corresponding GFG. The GFG and its use for parsing are described in the tech report by Pingali and Bilardi [1].

One of the algorithms described using the GFG is Earley's algorithm, which can parse any (possibly ambiguous) general context-free grammar. While Earley's algorithm is almost 50 years old, it is quite difficult to understand in the standard presentations. The GFG makes it much easier to understand this algorithm.

Recently, a number of researchers have developed an alternative to Earley's algorithm for parsing general context-free grammars, called

The proposed project has two options and you can choose one.

(1) Canonical parsing algorithms such as those for SLL(k), LL(k), SLR(k) and LR(k) grammars can be formulated in a unified way using the GFG. Formulate parsing with derivatives also in terms of the GFG. This will permit us to understand what is new and different about this parsing strategy.

(2) Earley's algorithm runs in O(n^3) time and O(n^2) space, where n is the number of terminals in the input string. Parsing with derivatives is also claimed to be cubic in running time. Implement efficient versions of both algorithms and study their relative efficiency in practice (naive versions of both algorithms can be implemented in Python in about 200 lines of code each). Ideally, your implementations of both algorithms would use the GFG so we can have a common representation of the grammar, but this depends on being able to solve (1). What are the pros and cons of each approach?

This would be a good project to do if you are taking William Cook's PL course this semester.

Project deliverables
**and timelines: **

- (Option 1): A GFG description of parsing with derivatives,
and highly optimized implementations of Earley and parsing
with derivatives. An experimental study of the pros and cons
of each approach, using real-world inputs.

- (Nov 1): A clear description of parsing with derivatives
- (Nov 8): A go/no go decision about implementing parsing with derivatives using the GFG.
- (Dec 6): Highly optimized implementation of both algorithms, and results of experimental study.
- (Dec 6): Project report and code.

- (Option 2): A mechanically verified proof of correctness
of the inference rules for Early, SLL(k) and SLR(k) grammars
given in (1). If you find a bug in these rules, fix the
inference rules.

- (Nov 1): Decision about which algorithms you will verify and which theorem prover you will use.
- (Dec 6) Mechanically verified proofs of those algorithms.
- (Dec 6) Project report and code.

Papers:

- Parsing with pictures, Keshav Pingali and Gianfranco Bilardi, UTCS tech report TR-2012.
- On
the complexity and performance of parsing with
derivatives. Michael Adams, Celeste Hollenbeck,
and Matt Might, PLDI 2016.