Homework 1: Writing a 3-address-code to C translator

Course:     CS 380C: Advanced Compiler Techniques (Fall 2007)
Instructor: Keshav Pingali
Assigned:   Tuesday, September 4, 2007
Due:        Thursday, September 13, 2007, 9:30 AM

Objective
---------
The goal of this project is to write a translator from 3-address format to
C. The purpose of this assignment is to
  1. Get you familiar with intermediate formats used to represent programs
     within the compiler (in this particular case, our 3-address format).
  2. Write a translator that can be used to test the correctness of code
     transformations that you will perform in future assignments.

Project Description
-------------------
'csc' is a simple compiler that compiles a subset of C to 3-address format.
Your goal is to write a translator to convert from this 3-address format
back to C.  Note that the generated C code will not resemble the original C
code.

Our High-Level Language (a subset of C)
---------------------------------------
This section describes the high-level language that we will write programs
in. While an EBNF grammar specification is provided below, it is by no
means a full specification. We hope to provide enough information to help
you write programs and understand the 3-address format output generated by
the "C-subset compiler" (csc).

Here are some features of our high-level language and compiler:

 * Has a void datatype.
 * Has only one integer data type 'long' with a size of 8 bytes.
 * Does not support floating point.
 * Supports structs and arrays.
 * Has no pointers.
 * Supports functions and recursion.
 * Functions return void (i.e., nothing). The way to communicate the result
   of a function is using global variables.
 * Parameters of functions can only be of type 'long'. That is, you cannot
   pass structs, arrays, etc.
 * There are only two namespaces/scopes -- a global scope and a local
   scope.
 * Does not have gotos. Has structured programming constructs, 'if' and
   'while'.
 * Arithmetic operations are +, -, *, /, %, <, <=, ==, !=, >=, >
 * The compiler supports only a single source file and has no
   pre-processor.

The EBNF grammar for this language is given below.

    Factor = Designator | Number | "(" Expression ")".
    Term = Factor {("*" | "/" | "%") Factor}.
    SimpleExpr = ["+" | "-"] Term {("+" | "-") Term}.
    EqualityExpr = SimpleExpr [("<" | "<=" | ">" | ">=") SimpleExpr].
    Expression = EqualityExpr [("==" | "!=") EqualityExpr].
    ConstExpression = Expression.

    FieldList = VariableDeclaration {VariableDeclaration}.
    StructType = "struct" Ident ["{" FieldList "}"].
    Type = Ident | StructType.
    IdentArray = Ident {"[" ConstExpression "]"}.
    IdentList = IdentArray {"," IdentArray}.
    VariableDeclaration = Type IdentList ";".
    ConstantDeclaration = "const" Type Ident "=" ConstExpression ";".

    Designator = Ident {("." Ident) | ("[" Expression "]")}.
    Assignment = Designator "=" Expression ";".
    ExpList = Expression {"," Expression}.
    ProcedureCall = Ident "(" [ExpList] ")" ";".
    IfStatement = "if" "(" Expression ")" "{" StatementSequence "}" 
                   ["else" "{" StatementSequence "}"].
    WhileStatement = "while" "(" Expression ")" "{" StatementSequence "}".
    Statement = [Assignment | ProcedureCall | IfStatement | WhileStatement].
    StatementSequence = {Statement}.

    FPSection = Type IdentArray.
    FormalParameters = FPSection {"," FPSection}.
    ProcedureHeading = Ident "(" [FormalParameters] ")".
    ProcedureBody = {ConstDeclaration | VariableDeclaration} StatementSequence.
    ProcedureDeclaration = "void" ProcedureHeading "{" ProcedureBody "}".

    Program = {ConstantDeclaration | VariableDeclaration} ProcedureDeclaration {ProcedureDeclaration}.

    Number and Ident are terminal symbols. Program is the start symbol.

The 3-Address Intermediate Format
---------------------------------
The 3-address format is the output of the frontend of the compiler. You need
to understand this format, since you will be working with this format in the
current and future assignments. Before taking a look at the specification of
the 3-address format, let us look at a simple program and its corresponding
representation.

    long array[34];

    struct Point {
        long x;
        long y;
    } p;

    void function(long a, long b) {
        long local_array[3];
        if (array[2] > a) {
            array[3] = b + p.y;
        }
        p.x = local_array[0];
    }

    instr 1: nop
    instr 2: enter 24
    instr 3: mul 2 8
    instr 4: add array_base#32496 GP
    instr 5: add (4) (3)
    instr 6: load (5)
    instr 7: cmple (6) a#24
    instr 8: blbs (7) [17]
    instr 9: mul 3 8
    instr 10: add array_base#32496 GP
    instr 11: add (10) (9)
    instr 12: add p_base#32480 GP
    instr 13: add (12) y_offset#8
    instr 14: load (13)
    instr 15: add b#16 (14)
    instr 16: store (15) (11)
    instr 17: add p_base#32480 GP
    instr 18: add (17) x_offset#0
    instr 19: mul 0 8
    instr 20: add local_array_base#-24 FP
    instr 21: add (20) (19)
    instr 22: load (21)
    instr 23: store (22) (18)
    instr 24: ret 16
    instr 25: nop

The 3-address format uses a simple RISC instruction set, and assumes an
infinite number of virtual registers. The operands to these instructions
may be one of the following:
  * GP (global pointer): A pointer to the beginning of the global address
    space.
  * FP (frame pointer): A pointer to the beginning of the frame of the
    current function.
  * Constants: For example, 24
  * Address offsets: E.g., array_base#32496 represents the starting
    address of variable "array" as an offset relative to the GP or FP.  The
    name of the data item is merely given for clarity. What really matters
    is the offset 32496.
  * Field offsets: E.g., y_offset#8 represents the offset of a
    field (in this case, "y") of a structure. Similar to the example above,
    what matters is the offset 8. The name is simply given to make the
    output easier to ready by a human.
  * Local variables (scalars): E.g., a#24 represents a local variable and
    its offset within the stack frame. The offset should be ignored for now
    (it may be used later during register allocation). We are really only
    interested in the name of the variable 'a'. The variable can be assumed
    to be register allocated in virtual register 'a'.
  * Register name: E.g., (13) stands for virtual register r13. This
    is the register implicitly written to by instruction 13.
  * Code label: E.g., [17] represents a code location to jump to (in this
    example, instruction 17).

The instruction set contains the following instructions:

  * Arithmetic instructions: The opcodes add, sub, mul, div, mod, neg,
    cmpeq, cmple, and cmplt perform arithmetic operations on their
    operand(s) and write the result to a virtual register. For example,

         instr 13: add (12) y_offset#8

    stands for:

         r13 = r12 + 8

  * Branch instructions: The opcodes are br, blbc, blbs, call. br stands for
    unconditional branch. blbc and blbs denote a branch if the register is
    cleared or set, respectively. The call is an unconditional "jump and
    link" instruction used for function calls. For example,

         instr 8: blbs (7) [17]

    stands for:

         if (r7 != 0) goto instr_17;

   * Data movement instructions: load and store instructions read from and
     write to memory. The move instruction copies virtual registers. For
     example,

         instr 14: load (13)
         instr 15: move i#-8 j#-16
         instr 16: store (15) (11)

     represents:

         r14 = *(r13)
         j = i
         *(r11) = r15

   * I/O instructions: read, write, wrl are pseudo opcodes that are used
     for input and output. read reads an integer from stdin, write prints
     an integer to stdout, and wrl prints a newline to stdout.

   * The param (used in conjunction with the call instruction) instruction
     pushes its operand onto the stack (to be used by a function that is
     going to be called in the future).

         instr 60: param (59)
         instr 62: param from#40
         instr 63: call [23]

     represents

         call function_23(r59, from);

   * The enter instruction denotes the beginning of a function. Its operand
     specifies the amount of space in bytes to be allocated on the stack
     frame for local variables of that function.

   * The ret instruction denotes a function return. Its operand specifies
     the space for formal parameters that needs to be removed from the
     stack.  The following function

         void function(long p1, long p2, long p3) {
             long x;
             long y;
         }

     is compiled into

         instr 1: nop
         instr 2: enter 16
         instr 3: ret 24
         instr 4: nop

     This denotes the 16 bytes of storage for local variables x and y as
     well as the 24 bytes to be removed from the stack corresponding to
     formal parameters p1, p2 and p3.

   * The entrypc instruction denotes the beginning of the 'main' function.

   * The opcode nop does nothing.

Function Call ABI
-----------------
The addresses/offsets for the global variables, local variables, and formal
parameters have been calculated taking into account the following ABI. In
your implementation/translation you are free to follow any ABI. For
instance, you may choose to explicitly manage a stack to handle function
calls, or use C's call-return mechanism to directly handle function calls
in the 3-address code.  Nevertheless, we mention here how the stack frame
is organized to help you understand what the various offsets stand for.

Structures and arrays are laid out from lower to higher addresses.  Since
our only basic datatype occupies 8 bytes, all integers, structs and arrays
are 8-byted aligned.

Global variables are laid starting at address 32768 and downwards towards
zero. This implies that the maximum space for global variables is 32768. In
your implementation, it would suffice to declare 32768 bytes of storage to
hold all globals. In the example given above, the global variables are
'array[34]' and 'p'. Their starting addresses, represented as offsets from
GP are

       array_base = 32496 = (32768 - (34*8)), and
       p_base     = 32480 = (32496 - 16), respectively.

The stack grows from higher to lower addresses. The stack frame for a
function looks like:

  Low        +---------------------+
  addr.      |                     |
             +---------------------+
   -32       |       ...           |
             +---------------------+ <---+
   -24       |     3rd local       |     |
             +---------------------+     |
   -16       |     2nd local       |     |
             +---------------------+     |
   - 8       |     1st local       |     |
             +---------------------+     |   Stack
   FP ---->  |        FP           |     |   frame
             +---------------------+     |
     8       |        LNK          |     |
             +---------------------+     |
    16       |     2nd param       |     |
             +---------------------+     |
    24       |     1st param       |     |
             +---------------------+ <---+
             |                     |
             +---------------------+   previous frame.
  High       |                     |
  addr.      +---------------------+

Accordingly, in the example above, 'local_array', and the formal parameters
'a', and 'b', have the following offsets from FP

     local_array_base: -24
     a:                 24
     b:                 16

3-Address to C Translator Implementation
----------------------------------------
Now that we have set the background by explaining our source language and
3-address format, let us move on to the actual project. You are expected to
write a translator that takes the 3-address code as input and generates
C-code as output. The tarball
  http://www.cs.utexas.edu/users/pingali/CS380C/2007fa/assignments/assignment1/c-subset-compiler.tar.gz
contains the source for 'csc' and some example programs. The examples
directory also contains a script 'check.sh' to check if your translator
works correctly.

You are free to implement your translator in any programming language you
are comfortable with. The translator should compile (if necessary) and run
on UTCS linux machines.

You can test your translator with the example programs. In addition,
please provide one non-trivial source program of your own.

Turning In Your Assignment
--------------------------
Your assignment should contain the following.
  1. A single tar.gz file named hw1.tar.gz, which, when extracted, creates
     directory hw1.
  2. The hw1 directory can contain sub-directories.
  3. The hw1 should contain the following files:
     a. example.c  - Your non-trivial test source program.
     b. compile.sh - A script that compiles your source code, if needed. If
                     you are using an interpreted language, this script can
                     be empty.
     c. run.sh     - A script that invokes your translator. Your translator
                     should read 3-address code as input from stdin and
                     print the generated C code to stdout.
     d. README     - Please write your name and UTEID here.
The hw1 directory already exists with these files in the tarball you
downloaded.

Turn in your assignment by running the following commands on a UTCS Linux
machine.
   $ # Go the parent directory of the hw1 directory.
   $ tar -zcvf hw1.tar.gz hw1
   $ turnin --submit suriya cs380c-hw1 hw1.tar.gz
   $ turnin --list suriya cs380c-hw1