CS380C: Assignment 1

Recursive-descent parser and SaM code generator

 

 

Background

The first assignment involves implementing a compiler that takes a simple C-like language called Bali (described below) as input and generates code for a stack machine named SaM (described below). The compiler will use a recursive-descent parser. The assignment is intended to better understand recursive-descent parsing and to get a feel of what it means to implement a compiler.

To know more about the SaM stack machine and parsing, please refer to the following lecture material:

  1. Stack Machines: SaM
  2. Parsing
  3. Recursive-descent parsing and code generation

1. Building the Bali compiler

Create a handwritten recursive-descent parser and SaM code generator for the Bali language. For lexical analysis, you will be using the SaMTokenizer. The compiler should take a file with Bali program as input and produce another file containing the SaM program that executes the Bali program.

1.1 Grammar

The following is the grammar specification of the Bali language. In the grammar specification, all lower-case symbols denote a literal value. These literals are keywords (reserved words) and can not be used as identifiers for variables or methods. Non-alphanumeric characters surrounded by single quotes denote literals consisting of only the non-alphanumeric characters. For example, in BLOCK -> '{' STMT* '}', '{' and '}' are terminals. Upper-case symbols are non-terminals. Additionally, several symbols in the grammar have special meaning. ‘*’ means zero or more occurrences, ‘?’ means one or zero occurrences and ‘[ ]’ is the character class construction operator. These are the standard symbols used in context-free grammars. Finally, parentheses are used to group sequences of symbols together.

A Bali program is a sequence of zero or more method declarations. ‘int’ is the only type in this language. Each method declaration has a return type, zero or more formal arguments, and a method body. The method body consists of zero or more variable declarations and a sequence of statements. Unlike C programs, all variables must be declared at the beginning of the program. Additionally, variables can be initialized during declaration. Each statement in the method body can be an assignment statement, a conditional statement (if-else), a while loop, a return statement, a break statement, a block, or a null statement. These statements have the usual meaning. Each statement contains one or more expressions. Expressions can be arithmetic operations, boolean operations, or function calls. Expressions are fully parenthesized to avoid problems with associativity and precedence.

The literals true and false have the values 1 and 0 respectively. For expressions used in conditions, any non-zero value is true and the value zero is false.

Comments in the input program are automatically handled by the SamTokenizer. The tokenizer discards characters between ‘//’ and the end of the line (including ‘//’) and you do not need to worry about it.

PROGRAM    -> METH_DECL*

METH_DECL  -> TYPE ID '(' FORMALS? ')' BODY
FORMALS    -> TYPE ID (',' TYPE ID)*
TYPE       -> int

BODY       -> '{' VAR_DECL*  STMT* '}'
VAR_DECL   -> TYPE ID ('=' EXP)? (',' ID ('=' EXP)?)* ';'

STMT       -> ASSIGN ';'
          | return EXP ';'
          | if '(' EXP ')' STMT else STMT
          | while '(' EXP ')' STMT
          | break ';'
          | BLOCK
          | ';'

BLOCK      -> '{' STMT* '}'
ASSIGN     -> LOCATION '=' EXP
LOCATION   -> ID
METHOD     -> ID

EXP        -> LOCATION
          | LITERAL
          | METHOD '(' ACTUALS? ')'
          | '('EXP '+' EXP')'
          | '('EXP '-' EXP')'
          | '('EXP '*' EXP')'
          | '('EXP '/' EXP')'
          | '('EXP '&' EXP')'
          | '('EXP '|' EXP')'
          | '('EXP '<' EXP')'
          | '('EXP '>' EXP')'
          | '('EXP '=' EXP')'
          | '(''-' EXP')'
          | '(''!' EXP')'
          | '(' EXP ')'

ACTUALS    -> EXP (',' EXP)*

LITERAL    -> INT | true | false

INT        -> [1-9] [0-9]*
ID         -> [a-zA-Z] ( [a-zA-Z] | [0-9] | '_' )*

1.2 Template

You can use the template below to get started. The template shows various ways the SaM tokenizer can be used. It also contains methods to help you get started on the project. The template contains lots of TODOs that you would need to implement. It starts by visiting all the methods in the program using the getMethod() function. The getMethod() function contains logic to accept a valid method declaration. The getExp() method accepts a valid expression and is mostly left blank. In this function, the implementation should ensure that the following invariant is always maintained: the result of every expression is present on the top of the stack.

package assignment1;

import edu.cornell.cs.sam.io.SamTokenizer;
import edu.cornell.cs.sam.io.Tokenizer;
import edu.cornell.cs.sam.io.Tokenizer.TokenType;

public class BaliCompiler {

    static String compiler(String fileName) {
        // returns SaM code for program in file
        try {
            SamTokenizer f = new SamTokenizer(fileName);
            String pgm = getProgram(f);
            return pgm;
        } catch (Exception e) {
            System.out.println("Fatal error: could not compile program");
            return "STOP\n";
        }
    }

    static String getProgram(SamTokenizer f) {
        try {
            String pgm = "";
            while (f.peekAtKind() != TokenType.EOF)
                pgm += getMethod(f);
            return pgm;
        } catch (Exception e) {
            System.out.println("Fatal error: could not compile program");
            return "STOP\n";
        }
    }

    static String getMethod(SamTokenizer f) {
        // TODO: add code to convert a method declaration to SaM code.
        // Since the only data type is an int, you can safely check for int
        // in the tokenizer.
        // TODO: add appropriate exception handlers to generate useful error msgs.
        f.check("int"); // must match at begining
        String methodName = f.getString();
        f.check("("); // must be an opening parenthesis
        String formals = getFormals(f);
        f.check(")"); // must be an closing parenthesis
        // You would need to read in formals if any
        // And then have calls to getDeclarations and getStatements.
        return null;
    }

    static String getExp(SamTokenizer f) {
        // TODO implement this
        switch (f.peekAtKind()) {
        case INTEGER: // E -> integer
            return "PUSHIMM " + f.getInt() + "\n";
        case OPERATOR: {
        }
        default:
            return "ERROR\n";
        }
    }

    static String getFormals(SamTokenizer f) {
        // TODO implement this.
        return null;
    }
}

1.3 Additional information regarding implementation

Here are some additional points regarding the grammar and the language.

  1. There will always be a main method in the input program. The program starts executing from the main method.

  2. The main method does not take any arguments.

  3. There is no overloading of methods.

  4. You will need a symbol table for each method. One approach would be to have a separate class for the symbol table (using hash tables or any approach). A symbol table object would be created inside your getMethod() method, and be initialized by the getDeclarations() method call. Once initialized it would be passed to all (almost) other method invocations inside the getMethod to make sure each rule has the appropriate information. Each method would have its own symbol table.

  5. A break statement must be lexically nested within one or more loops, and when it is executed, it terminates the execution of the innermost loop in which it is nested. Please take care of illegal break statements.

If a program does not satisfy the grammar above or does not satisfy the textual description of the language, your compiler should print a short, informative error message and/or exit with a non-zero exit status.

1.4 Logistics

Make sure that your compiler is in the java class assignment1.BaliCompiler. Your compiler should take two command-line arguments. The first argument is an input file containing a Bali program. The second argument is an output file that will contain your generated SaM code.

1.5 Testing

The compiler will be tested using both public and private test cases. The public test cases are available in the resources section of this page. After the deadline, your submission will be tested on a few private test cases. The private test cases will be made available to the students after the deadline.

1.6 Resources

  1. SaM design document
    Design document containing the complete working of SaM interpreter. It describes how to use the SaM UI, various options for input/output, description of SaM ISA, and so on.
  2. SaM-2.6.2.jar
    SaM library containing the lexer. Use this jar while compiling your code.
  3. SaM-2.6.2.src.zip
    Source code of the SaM API for easy reference. If you are using an IDE, load this file as a source attachment to easily access the library functions.
  4. SaMAPI-2.6.2.zip
    HTML documentation of the SaM API. The documentation provides information on the usage of specific library functions.
  5. public_testcases.zip
    Few Bali programs to test the correctness of the compiler. More details on the test case like the test case description, expected output, etc. are present in the zip file.

Few tips:
Eclipse is a popular IDE for Java programming. You can use it for your project. If you will be using Eclipse, you will need to add the SaM jar file as an external library in your project so that you can compile your program. Additionally, I suggest you add the source attachment to the jar file to easily navigate the SaM library.

2. Submission instructions

Submit (in canvas) the following:

  1. compiler.jar
    A runnable .jar file of your project. Please refer online resources to learn how to create a runnable .jar file for your Java project. Make sure that your .jar file can be executed using the command line mentioned in the evaluation section below.
  2. source.zip
    A .zip file containing all your source files, including any libraries you may have used. Ideally, I should be able to re-create the .jar file from the source, if needed.

3. Evaluation

I will be using the following sequence of commands to evaluate your submission on both public and private testcases.

  1. Compiling Bali program:
    java -jar compiler.jar test1.bali output.sam
    The above command should read the Bali program in test1.bali file, generate the SaM program, and dump it to output.sam file.

  2. Running the SaM program:
    java -cp SaM-2.6.2.jar edu.cornell.cs.sam.ui.SamText output.sam
    The above command reads the SaM program in output.sam file and executes the SaM interpreter. The output of the program i.e. the exit status of the program will be displayed on the terminal. From the SaM interpreter’s perspective, the exit status is the element on the top of the stack when the interpreter executed the STOP instruction. From the Bali program’s perspective, the exit status is the return value of the main() method. The exit status is used to evaluate the correctness of your compiler. An example output of this command is shown below:

    Program assembled.
    Program loaded. Executing.
    ==========================
    Exit Status: 30
    

    If there is an error in the SaM program, a error message will be displayed instead of the exit status.

Important:

  1. The public testcases provided is not exhaustive and only covers simple scenarios. I higly recommend testing your compiler by creating more testcases and evaluting them using the above commands.
  2. Before submitting your .jar file, please execute the above commands and make sure the exit status matches the expected output on all testcases. This check will ease the grading process.

4. Grading

Each testcase is assigned a difficulty level: easy, medium and hard. Points are assigned to each testcase based on the difficulty level. 3, 5, and 7 points for each, medium, and hard testcase respectively. The points assigned to each testcase is mentioned at the top of the testcase.

There are 10 easy, 5 medium, and 5 hard testcases. The maximum points for succesfully passing all testcases is 90 (= 10 * 3 + 5 * 5 + 5 * 7). 10 points are assigned for code quality and packaging. If you did not submit the assignment according to the submission guidelines, you will lose 10 points. Total points for this assignment is 100.