Homework 1: Writing a 3-address-code to C translator Course: CS 380C: Advanced Compiler Techniques (Fall 2007) Instructor: Keshav Pingali Assigned: Tuesday, September 4, 2007 Due: Thursday, September 13, 2007, 9:30 AM Objective --------- The goal of this project is to write a translator from 3-address format to C. The purpose of this assignment is to 1. Get you familiar with intermediate formats used to represent programs within the compiler (in this particular case, our 3-address format). 2. Write a translator that can be used to test the correctness of code transformations that you will perform in future assignments. Project Description ------------------- 'csc' is a simple compiler that compiles a subset of C to 3-address format. Your goal is to write a translator to convert from this 3-address format back to C. Note that the generated C code will not resemble the original C code. Our High-Level Language (a subset of C) --------------------------------------- This section describes the high-level language that we will write programs in. While an EBNF grammar specification is provided below, it is by no means a full specification. We hope to provide enough information to help you write programs and understand the 3-address format output generated by the "C-subset compiler" (csc). Here are some features of our high-level language and compiler: * Has a void datatype. * Has only one integer data type 'long' with a size of 8 bytes. * Does not support floating point. * Supports structs and arrays. * Has no pointers. * Supports functions and recursion. * Functions return void (i.e., nothing). The way to communicate the result of a function is using global variables. * Parameters of functions can only be of type 'long'. That is, you cannot pass structs, arrays, etc. * There are only two namespaces/scopes -- a global scope and a local scope. * Does not have gotos. Has structured programming constructs, 'if' and 'while'. * Arithmetic operations are +, -, *, /, %, <, <=, ==, !=, >=, > * The compiler supports only a single source file and has no pre-processor. The EBNF grammar for this language is given below. Factor = Designator | Number | "(" Expression ")". Term = Factor {("*" | "/" | "%") Factor}. SimpleExpr = ["+" | "-"] Term {("+" | "-") Term}. EqualityExpr = SimpleExpr [("<" | "<=" | ">" | ">=") SimpleExpr]. Expression = EqualityExpr [("==" | "!=") EqualityExpr]. ConstExpression = Expression. FieldList = VariableDeclaration {VariableDeclaration}. StructType = "struct" Ident ["{" FieldList "}"]. Type = Ident | StructType. IdentArray = Ident {"[" ConstExpression "]"}. IdentList = IdentArray {"," IdentArray}. VariableDeclaration = Type IdentList ";". ConstantDeclaration = "const" Type Ident "=" ConstExpression ";". Designator = Ident {("." Ident) | ("[" Expression "]")}. Assignment = Designator "=" Expression ";". ExpList = Expression {"," Expression}. ProcedureCall = Ident "(" [ExpList] ")" ";". IfStatement = "if" "(" Expression ")" "{" StatementSequence "}" ["else" "{" StatementSequence "}"]. WhileStatement = "while" "(" Expression ")" "{" StatementSequence "}". Statement = [Assignment | ProcedureCall | IfStatement | WhileStatement]. StatementSequence = {Statement}. FPSection = Type IdentArray. FormalParameters = FPSection {"," FPSection}. ProcedureHeading = Ident "(" [FormalParameters] ")". ProcedureBody = {ConstDeclaration | VariableDeclaration} StatementSequence. ProcedureDeclaration = "void" ProcedureHeading "{" ProcedureBody "}". Program = {ConstantDeclaration | VariableDeclaration} ProcedureDeclaration {ProcedureDeclaration}. Number and Ident are terminal symbols. Program is the start symbol. The 3-Address Intermediate Format --------------------------------- The 3-address format is the output of the frontend of the compiler. You need to understand this format, since you will be working with this format in the current and future assignments. Before taking a look at the specification of the 3-address format, let us look at a simple program and its corresponding representation. long array[34]; struct Point { long x; long y; } p; void function(long a, long b) { long local_array[3]; if (array[2] > a) { array[3] = b + p.y; } p.x = local_array[0]; } instr 1: nop instr 2: enter 24 instr 3: mul 2 8 instr 4: add array_base#32496 GP instr 5: add (4) (3) instr 6: load (5) instr 7: cmple (6) a#24 instr 8: blbs (7) [17] instr 9: mul 3 8 instr 10: add array_base#32496 GP instr 11: add (10) (9) instr 12: add p_base#32480 GP instr 13: add (12) y_offset#8 instr 14: load (13) instr 15: add b#16 (14) instr 16: store (15) (11) instr 17: add p_base#32480 GP instr 18: add (17) x_offset#0 instr 19: mul 0 8 instr 20: add local_array_base#-24 FP instr 21: add (20) (19) instr 22: load (21) instr 23: store (22) (18) instr 24: ret 16 instr 25: nop The 3-address format uses a simple RISC instruction set, and assumes an infinite number of virtual registers. The operands to these instructions may be one of the following: * GP (global pointer): A pointer to the beginning of the global address space. * FP (frame pointer): A pointer to the beginning of the frame of the current function. * Constants: For example, 24 * Address offsets: E.g., array_base#32496 represents the starting address of variable "array" as an offset relative to the GP or FP. The name of the data item is merely given for clarity. What really matters is the offset 32496. * Field offsets: E.g., y_offset#8 represents the offset of a field (in this case, "y") of a structure. Similar to the example above, what matters is the offset 8. The name is simply given to make the output easier to ready by a human. * Local variables (scalars): E.g., a#24 represents a local variable and its offset within the stack frame. The offset should be ignored for now (it may be used later during register allocation). We are really only interested in the name of the variable 'a'. The variable can be assumed to be register allocated in virtual register 'a'. * Register name: E.g., (13) stands for virtual register r13. This is the register implicitly written to by instruction 13. * Code label: E.g., [17] represents a code location to jump to (in this example, instruction 17). The instruction set contains the following instructions: * Arithmetic instructions: The opcodes add, sub, mul, div, mod, neg, cmpeq, cmple, and cmplt perform arithmetic operations on their operand(s) and write the result to a virtual register. For example, instr 13: add (12) y_offset#8 stands for: r13 = r12 + 8 * Branch instructions: The opcodes are br, blbc, blbs, call. br stands for unconditional branch. blbc and blbs denote a branch if the register is cleared or set, respectively. The call is an unconditional "jump and link" instruction used for function calls. For example, instr 8: blbs (7) [17] stands for: if (r7 != 0) goto instr_17; * Data movement instructions: load and store instructions read from and write to memory. The move instruction copies virtual registers. For example, instr 14: load (13) instr 15: move i#-8 j#-16 instr 16: store (15) (11) represents: r14 = *(r13) j = i *(r11) = r15 * I/O instructions: read, write, wrl are pseudo opcodes that are used for input and output. read reads an integer from stdin, write prints an integer to stdout, and wrl prints a newline to stdout. * The param (used in conjunction with the call instruction) instruction pushes its operand onto the stack (to be used by a function that is going to be called in the future). instr 60: param (59) instr 62: param from#40 instr 63: call [23] represents call function_23(r59, from); * The enter instruction denotes the beginning of a function. Its operand specifies the amount of space in bytes to be allocated on the stack frame for local variables of that function. * The ret instruction denotes a function return. Its operand specifies the space for formal parameters that needs to be removed from the stack. The following function void function(long p1, long p2, long p3) { long x; long y; } is compiled into instr 1: nop instr 2: enter 16 instr 3: ret 24 instr 4: nop This denotes the 16 bytes of storage for local variables x and y as well as the 24 bytes to be removed from the stack corresponding to formal parameters p1, p2 and p3. * The entrypc instruction denotes the beginning of the 'main' function. * The opcode nop does nothing. Function Call ABI ----------------- The addresses/offsets for the global variables, local variables, and formal parameters have been calculated taking into account the following ABI. In your implementation/translation you are free to follow any ABI. For instance, you may choose to explicitly manage a stack to handle function calls, or use C's call-return mechanism to directly handle function calls in the 3-address code. Nevertheless, we mention here how the stack frame is organized to help you understand what the various offsets stand for. Structures and arrays are laid out from lower to higher addresses. Since our only basic datatype occupies 8 bytes, all integers, structs and arrays are 8-byted aligned. Global variables are laid starting at address 32768 and downwards towards zero. This implies that the maximum space for global variables is 32768. In your implementation, it would suffice to declare 32768 bytes of storage to hold all globals. In the example given above, the global variables are 'array[34]' and 'p'. Their starting addresses, represented as offsets from GP are array_base = 32496 = (32768 - (34*8)), and p_base = 32480 = (32496 - 16), respectively. The stack grows from higher to lower addresses. The stack frame for a function looks like: Low +---------------------+ addr. | | +---------------------+ -32 | ... | +---------------------+ <---+ -24 | 3rd local | | +---------------------+ | -16 | 2nd local | | +---------------------+ | - 8 | 1st local | | +---------------------+ | Stack FP ----> | FP | | frame +---------------------+ | 8 | LNK | | +---------------------+ | 16 | 2nd param | | +---------------------+ | 24 | 1st param | | +---------------------+ <---+ | | +---------------------+ previous frame. High | | addr. +---------------------+ Accordingly, in the example above, 'local_array', and the formal parameters 'a', and 'b', have the following offsets from FP local_array_base: -24 a: 24 b: 16 3-Address to C Translator Implementation ---------------------------------------- Now that we have set the background by explaining our source language and 3-address format, let us move on to the actual project. You are expected to write a translator that takes the 3-address code as input and generates C-code as output. The tarball http://www.cs.utexas.edu/users/pingali/CS380C/2007fa/assignments/assignment1/c-subset-compiler.tar.gz contains the source for 'csc' and some example programs. The examples directory also contains a script 'check.sh' to check if your translator works correctly. You are free to implement your translator in any programming language you are comfortable with. The translator should compile (if necessary) and run on UTCS linux machines. You can test your translator with the example programs. In addition, please provide one non-trivial source program of your own. Turning In Your Assignment -------------------------- Your assignment should contain the following. 1. A single tar.gz file named hw1.tar.gz, which, when extracted, creates directory hw1. 2. The hw1 directory can contain sub-directories. 3. The hw1 should contain the following files: a. example.c - Your non-trivial test source program. b. compile.sh - A script that compiles your source code, if needed. If you are using an interpreted language, this script can be empty. c. run.sh - A script that invokes your translator. Your translator should read 3-address code as input from stdin and print the generated C code to stdout. d. README - Please write your name and UTEID here. The hw1 directory already exists with these files in the tarball you downloaded. Turn in your assignment by running the following commands on a UTCS Linux machine. $ # Go the parent directory of the hw1 directory. $ tar -zcvf hw1.tar.gz hw1 $ turnin --submit suriya cs380c-hw1 hw1.tar.gz $ turnin --list suriya cs380c-hw1