The first assignment involves implementing a compiler that takes a simple C-like language called Bali (described below) as input and generates code for a stack machine named SaM (described below). The compiler will use a recursive-descent parser. The assignment is intended to better understand recursive-descent parsing and to get a feel of what it means to implement a compiler.
To know more about the SaM stack machine and parsing, please refer to the following lecture material:
Create a handwritten recursive-descent parser and SaM code generator for the Bali language. For lexical analysis, you will be using the SaMTokenizer. The compiler should take a file with Bali program as input and produce another file containing the SaM program that executes the Bali program.
The following is the grammar specification of the Bali language. In the grammar specification, all lower-case symbols denote a literal value. These literals are keywords (reserved words) and can not be used as identifiers for variables or methods. Non-alphanumeric characters surrounded by single quotes denote literals consisting of only the non-alphanumeric characters. For example, in BLOCK -> '{' STMT* '}'
, '{'
and '}'
are terminals. Upper-case symbols are non-terminals. Additionally, several symbols in the grammar have special meaning. ‘*’ means zero or more occurrences, ‘?’ means one or zero occurrences and ‘[ ]’ is the character class construction operator. These are the standard symbols used in context-free grammars. Finally, parentheses are used to group sequences of symbols together.
A Bali program is a sequence of zero or more method declarations. ‘int’ is the only type in this language. Each method declaration has a return type, zero or more formal arguments, and a method body. The method body consists of zero or more variable declarations and a sequence of statements. Unlike C programs, all variables must be declared at the beginning of the program. Additionally, variables can be initialized during declaration. Each statement in the method body can be an assignment statement, a conditional statement (if-else), a while loop, a return statement, a break statement, a block, or a null statement. These statements have the usual meaning. Each statement contains one or more expressions. Expressions can be arithmetic operations, boolean operations, or function calls. Expressions are fully parenthesized to avoid problems with associativity and precedence.
The literals true
and false
have the values 1 and 0 respectively. For expressions used in conditions, any non-zero value is true and the value zero is false.
Comments in the input program are automatically handled by the SamTokenizer. The tokenizer discards characters between ‘//’ and the end of the line (including ‘//’) and you do not need to worry about it.
PROGRAM -> METH_DECL*
METH_DECL -> TYPE ID '(' FORMALS? ')' BODY
FORMALS -> TYPE ID (',' TYPE ID)*
TYPE -> int
BODY -> '{' VAR_DECL* STMT* '}'
VAR_DECL -> TYPE ID ('=' EXP)? (',' ID ('=' EXP)?)* ';'
STMT -> ASSIGN ';'
| return EXP ';'
| if '(' EXP ')' STMT else STMT
| while '(' EXP ')' STMT
| break ';'
| BLOCK
| ';'
BLOCK -> '{' STMT* '}'
ASSIGN -> LOCATION '=' EXP
LOCATION -> ID
METHOD -> ID
EXP -> LOCATION
| LITERAL
| METHOD '(' ACTUALS? ')'
| '('EXP '+' EXP')'
| '('EXP '-' EXP')'
| '('EXP '*' EXP')'
| '('EXP '/' EXP')'
| '('EXP '&' EXP')'
| '('EXP '|' EXP')'
| '('EXP '<' EXP')'
| '('EXP '>' EXP')'
| '('EXP '=' EXP')'
| '(''-' EXP')'
| '(''!' EXP')'
| '(' EXP ')'
ACTUALS -> EXP (',' EXP)*
LITERAL -> INT | true | false
INT -> [1-9] [0-9]*
ID -> [a-zA-Z] ( [a-zA-Z] | [0-9] | '_' )*
You can use the template below to get started. The template shows various ways the SaM tokenizer can be used. It also contains methods to help you get started on the project. The template contains lots of TODOs that you would need to implement. It starts by visiting all the methods in the program using the getMethod() function. The getMethod() function contains logic to accept a valid method declaration. The getExp() method accepts a valid expression and is mostly left blank. In this function, the implementation should ensure that the following invariant is always maintained: the result of every expression is present on the top of the stack.
package assignment1;
import edu.cornell.cs.sam.io.SamTokenizer;
import edu.cornell.cs.sam.io.Tokenizer;
import edu.cornell.cs.sam.io.Tokenizer.TokenType;
public class BaliCompiler {
static String compiler(String fileName) {
// returns SaM code for program in file
try {
SamTokenizer f = new SamTokenizer(fileName);
String pgm = getProgram(f);
return pgm;
} catch (Exception e) {
System.out.println("Fatal error: could not compile program");
return "STOP\n";
}
}
static String getProgram(SamTokenizer f) {
try {
String pgm = "";
while (f.peekAtKind() != TokenType.EOF)
pgm += getMethod(f);
return pgm;
} catch (Exception e) {
System.out.println("Fatal error: could not compile program");
return "STOP\n";
}
}
static String getMethod(SamTokenizer f) {
// TODO: add code to convert a method declaration to SaM code.
// Since the only data type is an int, you can safely check for int
// in the tokenizer.
// TODO: add appropriate exception handlers to generate useful error msgs.
f.check("int"); // must match at begining
String methodName = f.getString();
f.check("("); // must be an opening parenthesis
String formals = getFormals(f);
f.check(")"); // must be an closing parenthesis
// You would need to read in formals if any
// And then have calls to getDeclarations and getStatements.
return null;
}
static String getExp(SamTokenizer f) {
// TODO implement this
switch (f.peekAtKind()) {
case INTEGER: // E -> integer
return "PUSHIMM " + f.getInt() + "\n";
case OPERATOR: {
}
default:
return "ERROR\n";
}
}
static String getFormals(SamTokenizer f) {
// TODO implement this.
return null;
}
}
Here are some additional points regarding the grammar and the language.
There will always be a main method in the input program. The program starts executing from the main method.
The main method does not take any arguments.
There is no overloading of methods.
You will need a symbol table for each method. One approach would be to have a separate class for the symbol table (using hash tables or any approach). A symbol table object would be created inside your getMethod() method, and be initialized by the getDeclarations() method call. Once initialized it would be passed to all (almost) other method invocations inside the getMethod to make sure each rule has the appropriate information. Each method would have its own symbol table.
A break statement must be lexically nested within one or more loops, and when it is executed, it terminates the execution of the innermost loop in which it is nested. Please take care of illegal break statements.
If a program does not satisfy the grammar above or does not satisfy the textual description of the language, your compiler should print a short, informative error message and/or exit with a non-zero exit status.
Make sure that your compiler is in the java class assignment1.BaliCompiler. Your compiler should take two command-line arguments. The first argument is an input file containing a Bali program. The second argument is an output file that will contain your generated SaM code.
The compiler will be tested using both public and private test cases. The public test cases are available in the resources section of this page. After the deadline, your submission will be tested on a few private test cases. The private test cases will be made available to the students after the deadline.
Few tips:
Eclipse is a popular IDE for Java programming. You can use it for your project. If you will be using Eclipse, you will need to add the SaM jar file as an external library in your project so that you can compile your program. Additionally, I suggest you add the source attachment to the jar file to easily navigate the SaM library.
Submit (in canvas) the following:
I will be using the following sequence of commands to evaluate your submission on both public and private testcases.
Compiling Bali program:
java -jar compiler.jar test1.bali output.sam
The above command should read the Bali program in test1.bali
file, generate the SaM program, and dump it to output.sam
file.
Running the SaM program:
java -cp SaM-2.6.2.jar edu.cornell.cs.sam.ui.SamText output.sam
The above command reads the SaM program in output.sam
file and executes the SaM interpreter. The output of the program i.e. the exit status of the program will be displayed on the terminal. From the SaM interpreter’s perspective, the exit status is the element on the top of the stack when the interpreter executed the STOP
instruction. From the Bali program’s perspective, the exit status is the return value of the main()
method. The exit status is used to evaluate the correctness of your compiler. An example output of this command is shown below:
Program assembled.
Program loaded. Executing.
==========================
Exit Status: 30
If there is an error in the SaM program, a error message will be displayed instead of the exit status.
Important:
Each testcase is assigned a difficulty level: easy, medium and hard. Points are assigned to each testcase based on the difficulty level. 3, 5, and 7 points for each, medium, and hard testcase respectively. The points assigned to each testcase is mentioned at the top of the testcase.
There are 10 easy, 5 medium, and 5 hard testcases. The maximum points for succesfully passing all testcases is 90 (= 10 * 3 + 5 * 5 + 5 * 7). 10 points are assigned for code quality and packaging. If you did not submit the assignment according to the submission guidelines, you will lose 10 points. Total points for this assignment is 100.