CS380C: Assignment 1

Recursive-Descent Parser and SaM Code Generator

Assigned: Monday, September 7th

Due: Sunday, September 20th at 11:59 PM


Update:

Background

The first assignment involves implementing a compiler that takes a simple C-like language called Bali (described below) as input and generates code for a stack machine named SaM (described below). The compiler will use a recursive-descent parser. The assignment is intended to help you better understand recursive-descent parsing and get a feel of what it means to implement a compiler.

To know more about the SaM stack machine and parsing, please refer to the following lecture material:

  1. Stack Machines: SaM
  2. Parsing
  3. Recursive-descent parsing and code generation

1. Building the Bali compiler

Create a handwritten recursive-descent parser and SaM code generator for the Bali language. For lexical analysis, you will be using the SaMTokenizer. The compiler should take a file with Bali program as input and produce another file containing the SaM program that executes the Bali program.

1.1 Grammar

The following is the grammar specification of the Bali language. In the grammar specification:

PROGRAM    -> METH_DECL*

METH_DECL  -> TYPE ID '(' FORMALS? ')' BODY
FORMALS    -> TYPE ID (',' TYPE ID)*
TYPE       -> int

BODY       -> '{' VAR_DECL*  STMT* '}'
VAR_DECL   -> TYPE ID ('=' EXP)? (',' ID ('=' EXP)?)* ';'

STMT       -> ASSIGN ';'
          | return EXP ';'
          | if '(' EXP ')' STMT else STMT
          | while '(' EXP ')' STMT
          | break ';'
          | BLOCK
          | ';'

BLOCK      -> '{' STMT* '}'
ASSIGN     -> LOCATION '=' EXP
LOCATION   -> ID
METHOD     -> ID

EXP        -> LOCATION
          | LITERAL
          | METHOD '(' ACTUALS? ')'
          | '('EXP '+' EXP')'
          | '('EXP '-' EXP')'
          | '('EXP '*' EXP')'
          | '('EXP '/' EXP')'
          | '('EXP '&' EXP')'
          | '('EXP '|' EXP')'
          | '('EXP '<' EXP')'
          | '('EXP '>' EXP')'
          | '('EXP '=' EXP')'
          | '(''-' EXP')'
          | '(''!' EXP')'
          | '(' EXP ')'

ACTUALS    -> EXP (',' EXP)*

LITERAL    -> INT | true | false

INT        -> [0-9]+
ID         -> [a-zA-Z] ( [a-zA-Z] | [0-9] | '_' )*

Summary:

1.2 Template

You can use the template below to get started. The template shows various ways the SaM tokenizer can be used. It also contains methods to help you get started on the project. The template contains lots of TODOs that you would need to implement. It starts by visiting all the methods in the program using the getMethod() function which contains logic to accept a valid method declaration. The getExp() method accepts a valid expression and is mostly left blank. In this function, the implementation should ensure that the following invariant is always maintained: the result of every expression is present on the top of the stack.

NOTE: The template is provided to give you some initial idea on your implementation and you're free to modify the template or develop in your own style.

package assignment1;

import edu.cornell.cs.sam.io.SamTokenizer;
import edu.cornell.cs.sam.io.Tokenizer;
import edu.cornell.cs.sam.io.Tokenizer.TokenType;

public class BaliCompiler {

    static String compiler(String fileName) {
        // returns SaM code for program in file
        try {
            SamTokenizer f = new SamTokenizer(fileName);
            String pgm = getProgram(f);
            return pgm;
        } catch (Exception e) {
            System.out.println("Fatal error: could not compile program");
            return "STOP\n";
        }
    }

    static String getProgram(SamTokenizer f) {
        try {
            String pgm = "";
            while (f.peekAtKind() != TokenType.EOF)
                pgm += getMethod(f);
            return pgm;
        } catch (Exception e) {
            System.out.println("Fatal error: could not compile program");
            return "STOP\n";
        }
    }

    static String getMethod(SamTokenizer f) {
        // TODO: add code to convert a method declaration to SaM code.
        // Since the only data type is an int, you can safely check for int
        // in the tokenizer.
        // TODO: add appropriate exception handlers to generate useful error msgs.
        f.check("int"); // must match at begining
        String methodName = f.getWord();
        f.check("("); // must be an opening parenthesis
        String formals = getFormals(f);
        f.check(")"); // must be an closing parenthesis
        // You would need to read in formals if any
        // And then have calls to getDeclarations and getStatements.
        return null;
    }

    static String getExp(SamTokenizer f) {
        // TODO implement this
        switch (f.peekAtKind()) {
        case INTEGER: // E -> integer
            return "PUSHIMM " + f.getInt() + "\n";
        case OPERATOR: {
        }
        default:
            return "ERROR\n";
        }
    }

    static String getFormals(SamTokenizer f) {
        // TODO implement this.
        return null;
    }
}

1.3 Additional information regarding implementation

Here are some additional assertions regarding the grammar and the language.

  1. There will always be a main method in the input program. The program starts executing from the main method. The main method does not take any arguments.

  2. There will always be a return statement at the end of a method in the input program.

  3. Comments in the input program are automatically handled by the SamTokenizer. The tokenizer discards characters between // and the end of the line (including //) and you do not need to worry about it.

  4. There is no overloading of methods.

  5. Methods can be defined either before or after corresponding function calls. You will need a symbol table for each method. One approach would be to have a separate class for the symbol table (using hash tables or any approach). A symbol table object would be created inside your getMethod() method, and be initialized by the getDeclarations() method call. Once initialized it would be passed to all (almost) other method invocations inside the getMethod to make sure each rule has the appropriate information. Each method would have its own symbol table.

  6. A break statement must be lexically nested within one or more loops, and when it is executed, it terminates the execution of the innermost loop in which it is nested. Please take care of illegal break statements.

If a program does not satisfy the grammar above or does not satisfy the textual description of the language, your compiler should print a short, informative error message and/or exit with a non-zero exit status. If you have any question or confusion, you can make a post on Piazza.

1.4 Logistics

Make sure that your compiler is in the java class assignment1.BaliCompiler. Your compiler should take two command-line arguments. The first argument is an input file containing a Bali program. The second argument is an output file that will contain your generated SaM code.

1.5 Testing

The compiler will be tested using both public and private test cases. The public test cases are available in the resources section of this page. After the deadline, your submission will be tested on a few private test cases. The private test cases will be made available to the students after the deadline.

1.6 Resources

  1. SaM-2.6.2.jar
    SaM library containing the lexer. Compile your code with this jar. If you will be using an IDE, you will need to add this jar file as an external library in your project.
  2. SaM-2.6.2.src.zip
    (Optional) Source code of the SaM API for easy reference. If you are using an IDE, load this file as a source attachment to easily access the library functions.
  3. SaMAPI-2.6.2.zip
    (Optional) HTML documentation of the SaM API. The documentation provides information on the usage of specific library functions.
  4. SaM design document
    (Optional) Design document containing the complete working of SaM interpreter. It describes how to use the SaM UI, various options for input/output, description of SaM ISA, and so on.
  5. public_testcases.zip
    Few Bali programs to test the correctness of the compiler. More details on the test case like the test case description, expected output, etc. are present in the zip file.

Tips:
Eclipse is a popular IDE for Java programming. You can use it for your project. You are recommended to add the source attachment in addition to the jar file to easily navigate the SaM library.

2. Submission instructions

Submit (to Canvas) the following:

  1. compiler.jar
    A runnable .jar file of your project. Please refer to online resources to learn how to create a runnable .jar file for your Java project. Make sure that your .jar file can be executed using the command line mentioned in the evaluation section below.
  2. source.zip
    A .zip file containing all your source files, including any libraries you may have used. Ideally, we should be able to re-create the .jar file from the source, if needed.

3. Evaluation

The following sequence of commands will be used to evaluate your submission on both public and private testcases.

  1. Compiling Bali program:
    java -jar compiler.jar test1.bali output.sam
    The above command should read the Bali program in test1.bali file, generate the SaM program, and dump it to output.sam file.

  2. Running the SaM program:
    java -cp SaM-2.6.2.jar edu.cornell.cs.sam.ui.SamText output.sam
    The above command reads the SaM program in output.sam file and executes the SaM interpreter. The output of the program i.e. the exit status of the program will be displayed on the terminal. From the SaM interpreters perspective, the exit status is the element on the top of the stack when the interpreter executed the STOP instruction. From the Bali programs perspective, the exit status is the return value of the main() method. The exit status is used to evaluate the correctness of your compiler. An example output of this command is shown below:

    Program assembled.
    Program loaded. Executing.
    ==========================
    Exit Status: 30

    If there is an error in the SaM program, a error message will be displayed instead of the exit status.

Important:

  1. The public testcases provided is not exhaustive and only covers simple scenarios. You are higly recommended to test your compiler by creating more testcases and using the above commands.
  2. Before submitting your .jar file, please execute the above commands and make sure the exit status matches the expected output on all testcases. This check will ease the grading process.

4. Grading

Each testcase is assigned a difficulty level: easy, medium and hard. Points are assigned to each testcase based on the difficulty level. 3, 5, and 7 points for each, medium, and hard testcase respectively. The points assigned to each testcase is mentioned at the top of the testcase.

There are 10 easy, 5 medium, and 5 hard testcases. The maximum points for succesfully passing all testcases is 90 (= 10 * 3 + 5 * 5 + 5 * 7). 10 points are assigned for code quality and packaging. If you did not submit the assignment according to the submission guidelines, you will lose 10 points. Total points for this assignment is 100.