Instruction Scheduling for Emerging Communication-Exposed Architectures

Ramadass Nagarajan  Doug Burger  Kathryn S. McKinley  Calvin Lin  Stephen W. Keckler
Department of Computer Sciences, University of Texas at Austin

ABSTRACT
Technology trends present new challenges to instruction schedulers and processor architectures. Although increasing transistor counts will enable numerous execution units on a single chip, decreasing wire transmission speeds will cause on-chip latencies to increase to tens of cycles. Conventional architectures, including static VLIWs and dynamic superscalars, and their instruction schedulers are not capable of meeting these challenges. This paper proposes a new instruction scheduling algorithm for emerging wire-dominated architectures that features critical-path instruction selection, placement of instructions to minimize communication distances, and load balancing across the distributed execution units. We evaluate the algorithm on the Grid Processor Architecture (GPA), which supports a hybrid execution model that requires static instruction placement but allows dynamic execution. The combination of this novel scheduling algorithm with the GPA results in the highest demonstrated instruction-level parallelism to date, sustaining over 8 instructions per cycle on a 64-wide issue machine.

Categories and Subject Descriptors
C.0 [Computer Systems Organization]: General—Hardware/software interfaces

General Terms
Design, Performance, Algorithms, Measurements

Keywords
instruction scheduling, ILP

1. INTRODUCTION

The two dominant architectural models, VLIW and dynamic superscalar, take extreme views on the role of static instruction scheduling. The VLIW model relies on the compiler to schedule independent operations to each wide instruction, and it requires guarantees from the compiler that no dependences will be violated. The dynamic superscalar model instead gives the hardware the freedom to execute any instruction on any appropriate ALU, as long as it obeys the original program dependences. Thus, the VLIW model depends entirely on the static scheduler, while the dynamic superscalar pushes much of the complexity of scheduling into the hardware.

While instruction scheduling is well-understood in these existing contexts, the role of instruction schedulers promises to change, as two important technology trends change the future of microprocessor design. First, poor clock scaling will result in wider-issue processors. Second, shrinking devices will result in slow on-chip wires that cause variable multi-cycle on-chip access latencies. Before discussing the impact of these changes on instruction schedulers, we first describe their implications for the superscalar and VLIW models.

The trend towards wider-issue machines, or greater numbers of ALUs, causes problems for both architectural models. Wider-issue is infeasible for superscalar architectures because of the quadratic growth in hardware complexity that occurs when the issue width increases. While VLIW hardware does extend to wider issue, VLIW machines are unable to exploit more ALUs for two reasons. First, the compiler can seldom find enough parallelism to schedule them explicitly because of aliases and other static unknowns, and second, the machine stalls when any of the instructions in the previous VLIW instruction are not complete.

The trend towards larger on-chip latencies also affects the two architectural models. As wires grow smaller with each technology generation, their delay increases. This trend is causing multicycle latencies to appear throughout the processor, such as between ALUs and between ALUs and registers, thus challenging the scalability of each execution model. Clustered VLIW schedulers can model these latencies, but the larger latencies place increased demands on the scheduler, which must try to find additional parallelism to hide them. Wire delays also complicate the task of the superscalar hardware scheduler, and will quickly make it intractable in hardware.

Thus, technology trends are creating a scheduling problem that is too complex for either the compiler or the hardware to solve alone. These trends demand a cooperative approach in which the compiler exposes parallelism and schedules for locality, and the hardware tolerates dynamic latencies and exploits additional parallelism at runtime. While neither the VLIW or dynamic superscalar models support this type of cooperation, there are many possible design
points between these two extremes. In this paper, we evaluate one of these design points and study its impact on instruction scheduling.

This paper presents a scheduling algorithm that uses a critical path node listing, reprioritization after placement, and explicitly models latencies and parallelism between ALUs. We show that with a Static Placement, Dynamic Issue (SPDI) execution model, these scheduling heuristics can greatly reduce the communication-induced critical execution paths among the ALUs. We show that our machine and scheduler improve instructions per cycle (IPC) up to factors of 10 compared with a VLIW and its scheduler, by exploiting both the schedule and runtime adaptability. Our scheduler uses extensible architectural model and evaluation functions that we use to tease apart the effects and interactions of six compiler heuristics. We also evaluate a variety of interconnection topologies and latencies for the SPDI processor. For instance, we show that a tightly packed schedule with more exposed latencies executes much faster than a more loosely packed schedule. We explore the portability of our schedules to fewer numbers of processors to demonstrate the relatively low performance penalty of binary compatibility on this hardware. By finding the right balance between flexibility for the scheduler and complexity in the hardware, we sustain significantly higher instruction level parallelism (ILP) than previously reported in the literature.

The remainder of this paper is organized as follows. Section 2 describes the technology trends that will affect future architecture and scheduler combinations, along with the related work that addresses these emerging challenges. Section 3 describes Grid Processor Architectures, which we use as a case study to evaluate schedulers for future SPDI architectures. Section 4 describes our base top-down greedy scheduling algorithm and the instruction placement optimizations required to reduce distance and latency. Section 5 analyzes the interaction between the scheduler and inter-ALU interconnection network. Section 6 incorporates considerations of different routing latencies and placement constraints into the scheduler. Finally, Section 7 studies the sensitivity of the scheduler’s performance to exact knowledge of the routing latencies, and proposes a strategy for generating an abstract schedule that is not bound to a single particular implementation. We summarize and conclude in Section 8.

2. BUILDING AND SCHEDULING SCALABLE ARCHITECTURES

In this section, we discuss the role of static instruction scheduling for future scalable architectures. We first describe trends in architecture. We then describe how out-of-order superscalar processors [29, 15] tolerate imprecision in the scheduler, but that their architectural complexity prevents it from scaling. We then explain how VLIW [7, 25] and partitioned VLIW processors [16, 23] are architecturally scalable, but they rely too heavily on perfection in the scheduler to achieve their promise. Section 3 describes a scalable architecture, which benefits from schedules that exploit parallelism and minimize latencies between ALUs, but tolerates imperfection from the scheduler. We thus hit a sweet spot in the architecture and scheduler co-design space that exploits good static schedules, but mitigates the latencies the compiler cannot predict.

2.1 Technology Trends

Clock Rates: The bulk of microprocessor performance improvements over the past twelve years have come from clock speed increases, at the rate of 40% per year. Clock speeds will continue to increase, but at a slower rate as microprocessors start to hit technology limits [27]. Despite much effort from the research community, the number of instructions executed per cycle on conventional architectures is dropping, not increasing, as a consequence of the faster clock speeds [1, 24]. This drop requires new architectural mechanisms, including multiple ALUs that communicate asynchronously and directly.

Wire Delay: Another consequence of higher clock rates and technology limits is wire delay. In previous microprocessor generations, the processor could reach the entire chip in a fraction of a cycle. However, as wire cross section shrinks, the wires become slower. As a result, future chips will require up to tens of cycles just to route a signal across the chip [1, 12]. To achieve high throughput, the scheduler must consider the latencies between ALUs and shared resources as a new first order constraint. For example, the scheduler should try to place dependent instructions on the same ALU (e.g., 0 cycle latency) or an adjacent ALU (e.g., 1 cycle) since non-adjacent ALUs will incur more latency (e.g., 2 to 20 cycles). The microarchitecture will also need to mitigate the effects of non-uniform delays throughout the processor.

The two current approaches taken to tolerate wire delays are deeper pipelines, as in the Pentium IV, in which two of the twenty pipeline stages are used solely for routing signals along wires, and the clustered Alpha 21264, which maintains two copies of the register file, broadcasting results between them, to reduce register file size and permit partitioning in the processing core. Unfortunately, partitioning the core further causes large drops in instruction-level parallelism [ILP] [2, 3, 6].

2.2 Scalable Architecture Design

To sustain performance improvements, designers must build architectures that achieve high ILP. To motivate our approach, we discuss the problems of scaling the two major classes of ILP machines: dynamic superscalar processors, which issue instructions out of order with respect to the compiler schedule; and VLIW architectures, which obey the strict static schedule dictated by the compiler.

Dynamic Superscalar Processors: Recent papers show that clock speed increases are forcing superscalar processors to their limit, since pipelines can be made only so deep before overall performance starts to drop [1, 24]. Superscalar architectures scale poorly to wide-issue machines because of the quadratic growth (with the issue width) in comparators that check for data dependences, and bypass networks that route results from ALU outputs to consumer instructions. Despite previous predictions in the literature, a 4-wide machine is the widest that has been built to date.

Dynamic superscalar processors select ready instructions and issue them in any order. A wide instruction fetch unit and a large instruction window are in theory all that are needed to exploit available ILP. In practice, the scheduler needs to move high and variable latency instructions as soon as possible in the basic block [14, 19] so that the architecture can use dynamically discovered ILP to hide the latencies. Superscalar processors also need the compiler to produce large basic blocks uninterrupted by control flow, and so far, even with techniques such as unrolling and inlining compilers have not delivered. The typical scheduler for a pipelined architecture uses a greedy approach [10, 22] based on the critical path through the instruction dependence graph (a DAG). It models fixed instruction delays, resource constraints, and other hazards, but typically does not model dynamic events such as cache miss latencies or branch misprediction [26]. For an n-wide machine, the scheduler tries to produce a schedule in which n independent instructions is-
sue every cycle [5]. Balanced scheduling [14] improves over classic scheduling by computing the amount of ILP available in the DAG, and using it to hide the high latency instructions. One advantage of a superscalar architecture is that it can compensate if the compiler does not get the exact order of the instructions right [19].

VLIW: VLIW architectures, conversely, rely on the compiler, not the hardware, to discover and schedule ILP [7, 9, 13, 17, 25]. The compiler guarantees that all instruction placed in a VLIW parallel long instruction word are independent of one another and that the operands are ready to read upon issue. The classic VLIW scheduler also takes a greedy approach [8, 9, 17]. It exploits parallelism by building a ready set where all the instructions in the set can issue in parallel. It then fills the current long instruction word. If there is a choice, it selects the instructions on the critical path first. Software pipelining [17] and similar algorithms focus on loops. It tries to find a steady-state pattern, across loop iterations, in which it fills all of the issue slots of the minimum number of VLIW instructions.

Although the hardware complexity scales linearly with issue width, the problem facing VLIW architectures is that, despite advanced techniques such as predication [20], trace scheduling [8], and treegion formation [11], it is difficult for the compiler to find enough instructions to pack into wide instruction words at compile time. Worse, unpredictable latencies such as cache misses force the entire machine to stall. The hardware thus scales, but VLIW schedulers have proven incapable of finding enough ILP to outperform superscalar processors.

These problems are thus pushing the two hardware/software approaches towards each other, and there are a wide range of possible solutions. We next describe our choice in the middle.

3. GRID PROCESSOR ARCHITECTURES

The Grid Processor is an emerging architecture that can be considered a hybrid between statically scheduled (VLIW) and dynamically issued (superscalar) architectures [21]. Figure 1 shows the components of a 4×4 grid processor, composed of 16 instruction execution units connected via a thin operand routing network. The instruction cache, register file, and data cache are placed around the perimeter of the ALU array. Each ALU includes an integer unit, a floating point unit, an operand router, and an instruction buffer (storage) for multiple instructions and their operands. Unlike a queue, instructions in these buffers may execute in any order, but an instruction may not execute until all of its operands arrive. While the diagram shows the routing network as a 2-dimensional mesh, the actual topology depends on both hardware and software constraints.

The grid processor is a static placement, dynamic issue processor (SPDI) in which a scheduler statically assigns instructions to ALUs and instruction buffers, but the hardware issues the instructions in dataflow order. The grid processor compiler uses hyperblock generation techniques [20] to create large monolithic blocks of instructions, and then independently schedules each block to the grid. Once a hyperblock has been mapped to the grid, the hardware reads its input registers from the register file and injects them into the grid. Upon arrival at an ALU, these values trigger instructions to fire, which on completion then distribute their results through the operand network to other ALUs. Instructions that produce block outputs write their values back to the register file. The hardware transmits temporary values that are only live within a block directly from producer to consumer, without writing them back to the register file. This strategy helps decouple register allocation from instruction scheduling, but the scheduler may be required to replicate a reused value and schedule the instructions that distribute it to multiple consuming instructions. Address computations for load and store instructions execute within the grid but transmit the addresses (and data values for stores) to the data cache banks. The cache banks send the loaded values back into the grid via the operand network.

Each ALU contains a fixed number of instruction buffer slots. In the grid processor, the corresponding slot across ALUs are collectively called a frame. Thus a 4×4 grid processor with 128 instruction buffer entries at each ALU has 128 frames of 16 instruction each. A subset of contiguous frames constitute an architecture frame, which is exposed to the compiler for scheduling and placement of a hyperblock. Consequently, dividing 128 frames into 8 architecture frames composed of 16 physical frames would allow the scheduler to map a total of 256 instructions at once to the ALU array. The grid processor hardware uses adjacent architecture frames to speculatively map and execute subsequent hyperblocks concurrently with the non-speculatively executing hyperblock. This technique is very important to good performance, as we illustrate in Section 6. The number of instructions within a single architecture frame represents the size of the instruction window available to the static scheduler. The number of instructions spanning the non-speculative and the speculative frames corresponds to the size of the dynamic scheduling window. In superscalar processors, this window is centralized and relatively small (80-100 instructions), while in the grid processor this window is distributed and can be quite large (thousands of instructions).

The principal issues for the design of the instruction scheduler for a SPDI architecture such as the grid processor include the following:

- Physical locality: In the grid processor, the physical distances between the ALUs, the register file, and the cache banks represent latency. Maximizing performance of this
architecture requires the scheduler to place instructions to minimize the communication latencies between dependent instructions and between instructions and the register file and cache banks.

- **Operand network topology**: Maintaining a simple topology and fast routing in the operand network is at odds with a schedule that minimizes routing distance and number of hops in the network. Designing the routing network and static scheduler in concert will balance router speed with connectivity that the scheduler can easily exploit.

- **Fixed frame space**: Since the number of instruction slots (frames) is fixed, the hardware and software must balance frame size with the number of speculative frames. Larger architecture frames may enable a better schedule since the scheduler has more degrees of freedom in placing instructions. However, Section 6 shows that more tightly packed frames increase the number of speculative frames and results in better performance. Thus, the challenge for the scheduler to create efficient schedules in the smallest number of frames it can.

4. **CRITICAL PATH SCHEDULING**

In this section, we describe our scheduling algorithm for the Grid Processor. Structurally, our algorithm resembles a greedy VLIW scheduler. It takes as input a group of instructions and a description of the processor model, including communication latencies between different structures. It outputs an assignment of instructions to ALUs. We first describe a simple extension of a VLIW scheduler, which will serve as our baseline GPA scheduler. We then augment the algorithm with several heuristics that improve performance for a given processor configuration.

4.1 Basic GPA Scheduling Algorithm

A classic VLIW scheduler computes the initial root set of ready instructions. It chooses an instruction \( i \) based on its critical path, and puts the instruction in a VLIW instruction. The scheduler packs as many ready instructions into the current VLIW instruction as it can. After it schedules an instruction \( i \), it adds to the ready set any of \( i \)'s children whose ancestors have already been scheduled. While a VLIW scheduler assigns instruction to an ALU and a time slot, a scheduler in the SPDI model assigns each instruction to an ALU without specifying a time slot. Pictured below is the algorithm of such a scheduler.

```plaintext
S = top_down_greedy_sort(hyperblock H);
foreach instruction i in sorted list S {
  R = find_legal_reservation_stations(i);
  if |R| = 0, Frames+=, Reschedule();
  E = sort_reservation_stations(R);
  Slot(i) = prioritize(E);
}
```

This scheduler first produces an instruction list prioritized by critical path height. We use static instruction latencies and assume no cache misses exist when computing the critical path heights. For the unscheduled instruction \( i \) with the highest priority, the scheduler computes the set of legal reservation stations \( R \). A reservation station \( rs \) specifies an ALU and one of the slots in the instruction window associated with the ALU. As we will describe in Section 5, the exact interconnection topology defines the set of legal reservation stations. For the most general interconnection framework, the mesh (see Figure 2(d)), all open slots are legal. Other topologies restrict this set to only those reservation stations that are reachable from the locations where the parents of \( i \) have been scheduled.

If no legal reservation station is available, the scheduler adds to the pool of total reservation stations by increasing the number of frames; it then attempts to reschedule the entire block. If several reservation stations are available, the scheduler chooses the one that is closest to all parents. In particular, the selected slot is one that minimizes the Score:

```
Score(rs) = \max\{CompleteTime(p) + Distance[rs,p,rs] \}
```

Here \( p \) refers to a parent of \( i \), \( CompleteTime(p) \) refers to the expected time at which \( p \) will produce its results, and \( D[rs,p,rs] \) is the number of communication hops required to route \( p \)'s result to \( rs \). \( Score(rs) \) is simply the earliest time at which \( i \) can issue at \( rs \). Ties are broken by choosing a slot that is closer to the data caches. We use this algorithm in Section 5 to evaluate the effect of different topological optimizations. We describe a number of additional heuristics to improve our scheduling decisions in the following subsection. We use the mesh topology because it is the most simple, and our results show that it provides the best performance.

4.2 Scheduler Optimizations

We describe three kinds of optimizations that try to balance the twin objectives of maximizing parallelism—scheduling independent instructions on different ALUs—and minimizing communication—scheduling consumers physically close to producers.

- **Locality-Aware Optimizations**: minimize communication latencies along all dataflow paths. In particular, the scheduler attempts to schedule load instructions and dependents of loads closer to data caches. In addition, it attempts to schedule instructions that produce register outputs closer to the register files.

- **Contention Optimizations**: maximize instruction-level parallelism. The scheduler attempts to schedule independent instructions on different ALUs.

- **Ordering optimizations**: expose critical paths in the program. The scheduler gives priority to all instructions on the critical path, and it updates critical path information after each step.

The augmented algorithm is as follows:

```plaintext
Frames = ceil(|H|/num_alus);
S = top_down_criticality_sort(hyperblock H);
foreach instruction i in sorted list S {
  foreach rs in R {
    IssueTime(rs) = ReadyTime(rs)+Contention(rs)
    CompleteTime(i,rs) = IssueTime(rs)+Latency(i)
    Score(rs) = IssueTime(rs)+Lookahead(i)*weight
  }
  E = sort_reservation_stations(R);
  Slot(i) = prioritize(E);
}
```

We first set the number of scheduling frames at the minimum required number. For example, a block of 150 instructions would require 3 frames on an 8x8 array of ALUs. We then obtain the initial list of instructions based on the critical path metric. According to this metric, the instructions are sorted by the maximum depth of any descendant in the dataflow graph. For example, if there are only two dataflow chains in a block, \( A \rightarrow B \rightarrow C \rightarrow D \), and \( E \rightarrow F \rightarrow G \), the criticality metric would sort these instructions...
as $A, B, C, D, E, F, G$, whereas classical greedy sorters would choose the order $A, B, E, C, F, D, G$. The advantage of criticality-based sorting is that every instruction on the critical path will be scheduled first, minimizing communication latencies along the critical path.

Next, we compute the score for every reservation station $rs$. We incorporate all locality- and contention-aware optimizations in this score. Loads and consumers of loads are placed close to the data caches by augmenting the dataflow graph with a pseudo memory instruction and fixing the placement of this instruction. For example, a dependence edge in the DFG, $A \rightarrow B$, where $A$ is a load instruction, is changed to $A \rightarrow M \rightarrow B$, where $M$ has a fixed schedule at a position one hop away from the rightmost column of the grid.

Since the hardware can issue at most one instruction at every ALU in a given cycle, instructions expected to be ready at the same time are placed on different nodes. The augmented scheduler performs this function by keeping track of estimated busy times of each ALU. $\text{ReadyTime}(rs)$ is the same as the score computed in the basic algorithm described in Section 4.1. The term $\text{Contention}(rs)$ denotes any expected additional delay cycles at the ALU due to contention.

The greedy algorithm as described thus far computes scores based only on past history, namely the schedule of prior placed instructions. However, such an algorithm may perform poorly if instructions that produce register outputs are scheduled far away from the register file, because a block cannot be committed until all register outputs have been committed. We avoid this problem by incorporating a lookahead factor into the score.

$$\text{Lookahead}(i) = \frac{\text{distance to child}}{\text{candidate row}} + \frac{\text{candidate row}}{\text{distance to child}}$$

For dataflow chains that lead to a register output, this metric attempts to simultaneously choose slots further away from the registers for instructions early in the dataflow chain and to choose slots closer to the registers for instructions later in the dataflow chain.

The algorithm then selects the reservation station with the lowest score. The remaining unscheduled instructions are re-sorted if any critical paths have changed, and the whole procedure repeats until all instructions have been scheduled.

5. EVALUATION OF TOPOLOGICAL OPTIMIZATIONS

In an SPDI model, the scheduler places instructions at ALUs (nodes) based upon the node’s availability and the location of the resources with which it must communicate. An interconnection network with rich topology, such as a crossbar, enables the schedule to minimize the latency due to hops in the network. However, a fully connected network requires each router to have a multitude of ports, which reduces router speed. While other restrictions on instruction placement may reduce the number of bits required to encode an instruction, this section examines the effect of routing topology on scheduler’s ability to exploit concurrency.

5.1 Topology and Scheduling Trade-offs

Figure 2 shows the four topologies that we consider. We adapted the scheduling algorithm previously described for each topology described below. In every case, the scheduler begins placing instructions in the first frame of the upper right-hand node in the ALU grid, since it is close to both the caches (on the right) and the register files (on the top of the grid).

**GPA-M**: The GPA-M interconnect, shown in Figure 2a bears the most resemblance to a VLIW architecture, in which VLIW instructions are essentially mapped to rows, with the following instruction mapped to the succeeding row. In each row, all instructions belonging to the same frame must be independent and dependent instructions are placed in lower rows. To simplify the router, each node is connected to only the three nodes directly below it in the row (below, left, and right). When a VLIW instruction is mapped to the bottom row, the following VLIW instruction is mapped to the top row in the next frame. Express channels made of fat wires route results from the last row to the top row at high speed. The schedule for GPA-M is effectively an unrolled VLIW schedule, in which the VLIW words are converted from the time dimension to being unrolled along the rows and frames.

The scheduler takes the greedily-sorted list and places the independent instructions in the rightmost slot of the earliest packet possible. The scheduler searches each row in succession to see if all dependences have been satisfied and if the node is reachable from its parents. If no nodes are available in any of the rows, the scheduler tries to place the instruction in the ensuing frame. To minimize latency, the scheduler seeks to place dependent instructions in adjacent rows and columns. As many frames are allocated as necessary to schedule the entire hyperblock.

**GPA-MZ**: The GPA-MZ interconnect, shown in Figure 2b, enhances GPA-M by permitting an producing instruction to forward its result to a consumer mapped to the same ALU. This strategy allows routing delays between adjacent rows to be eliminated for many producer-consumer pairs, at the cost of an additional bypass path in the hardware. In the scheduling algorithm the class of legal reservation stations are extended to include the node on which an instruction’s parent resides, as long as that node is reachable from all of the instruction’s other parents. The earliest reservation station remains the highest logical row in the schedule. For example, if an instruction $i$ had parents placed in row 3 and row 4, then the best position would be a different frame on the same node as the parent in row 4, assuming it was reachable from the parent in row 3.

To determine the number of frames required for the schedule, the scheduler sets the initial frame count to be the minimum required to hold all of the instructions in the hyperblock. For exam-
corporates many VLIW-specific optimizations such as control flow
sensitive VLIW optimizations. The Trimaran compiler is based on the
In addition to the usual set of classic optimizations, Trimaran in-
ting a consumer to be placed anywhere, regardless of the location
Illinois Impact compiler [4]. It produces code in the intermediate

5.2 Evaluation Methodology
The execution substrate is an 8x8 array of ALUs, which can also be
primary caches, and a 2MB level-two cache.

GPA-Tbar: The GPA-Tbar topology, shown in Figure 2c, has
the same number of router ports as GPA-MZ (including the Z-
dimension bypass), but instead connects to the nodes right, left,
and down from the producing node. This topology permits dependence
chains to be routed horizontally and along the frames before
being routed downward, potentially increasing the utilization of the
array for blocks that have long, narrow DFGs. Since dependent in-
structions can be placed within the same row, the highest row is not
necessarily the best choice. For example, placing a consumer one
hop down from its parent in the next row results in fewer hops than
placing it three hops to the left of its parent. The best reservation
station is obtained using the algorithm described in 4.

GPA-Mesh: The GPA-Mesh topology, shown in Figure 2d, aug-
ments GPA-Tbar with “upward” router link to each node, allowing
each node to send an operand to any of its nearest Manhattan neigh-
bors. This topology is the most flexible of those we evaluate, allow-
ing a consumer to be placed anywhere, regardless of the location
of its parents. The advantages of this organization are that hyper-
blocks can be packed into the minimal number of frames because
the scheduler never fails to find a legal assignment. The routers in
GPA-Mesh require an additional port and are likely to run slightly slower than GPA-Tbar. The criteria to select a node from the legal
candidate nodes is identical to that in the GPA-Tbar configuration.

5.3 Scheduler vs. Topology Results
To compare performance across different execution models, sub-
strates, and schedules, we measured instructions per cycle (IPC)
using a custom detailed, cycle-accurate simulator. This tool mod-
els both VLIW and GPA microarchitectures, including realistic la-
tencies through the processor, and simulates instruction fetching,
branch prediction, the cache hierarchy, contention for ALUs, reg-
ister file accesses, and branch mispredictions. We assumed 64KB
primary caches, and a 2MB level-two cache.

In this study, we use a subset of the SPEC2000 [28] and the Me-
diabench [18] benchmark suite. The Trimaran front end currently
compiles only C benchmarks, so we converted a number of the
SPECFP benchmarks to C, and present results for all of the SPEC
benchmarks that the Trimaran tools compiled successfully.

5.3 Scheduler vs. Topology Results
These experiments use the basic scheduler without the placement
heuristics, which are not applicable to a VLIW model. We add
these to the scheduler in the next section. Figure 3 shows the perfor-
ance of these four instruction placement schemes, broken down into integer benchmarks on the left-hand graph and floating-point
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Base</th>
<th>C</th>
<th>CR</th>
<th>CRA</th>
<th>CRAL</th>
<th>CRAI</th>
<th>RALO</th>
</tr>
</thead>
<tbody>
<tr>
<td>adpcm</td>
<td>1.76</td>
<td>1.70</td>
<td>1.71</td>
<td>1.66</td>
<td>1.84</td>
<td>1.80</td>
<td>1.77</td>
</tr>
<tr>
<td>ammp</td>
<td>7.07</td>
<td>7.08</td>
<td>7.24</td>
<td>7.09</td>
<td>7.33</td>
<td>7.32</td>
<td>7.19</td>
</tr>
<tr>
<td>bzip2</td>
<td>5.63</td>
<td>5.44</td>
<td>5.64</td>
<td>5.75</td>
<td>5.68</td>
<td>5.96</td>
<td>6.03</td>
</tr>
<tr>
<td>compr</td>
<td>3.95</td>
<td>3.97</td>
<td>4.01</td>
<td>4.00</td>
<td>4.22</td>
<td>4.24</td>
<td>4.26</td>
</tr>
<tr>
<td>dft</td>
<td>1.89</td>
<td>1.92</td>
<td>1.92</td>
<td>1.95</td>
<td>2.01</td>
<td>1.96</td>
<td>2.02</td>
</tr>
<tr>
<td>equake</td>
<td>2.93</td>
<td>3.02</td>
<td>2.95</td>
<td>2.95</td>
<td>2.94</td>
<td>3.18</td>
<td>2.97</td>
</tr>
<tr>
<td>gzip</td>
<td>2.38</td>
<td>2.41</td>
<td>2.43</td>
<td>2.43</td>
<td>2.55</td>
<td>2.57</td>
<td>2.68</td>
</tr>
<tr>
<td>m88ksim</td>
<td>7.11</td>
<td>6.96</td>
<td>7.58</td>
<td>7.76</td>
<td>8.30</td>
<td>7.76</td>
<td>7.57</td>
</tr>
<tr>
<td>mcf</td>
<td>1.01</td>
<td>1.12</td>
<td>1.12</td>
<td>1.10</td>
<td>1.20</td>
<td>1.21</td>
<td>1.05</td>
</tr>
<tr>
<td>mgrid</td>
<td>12.68</td>
<td>11.87</td>
<td>11.91</td>
<td>11.82</td>
<td>17.91</td>
<td>17.94</td>
<td>15.05</td>
</tr>
<tr>
<td>mpeg2</td>
<td>5.88</td>
<td>5.91</td>
<td>5.82</td>
<td>6.02</td>
<td>6.26</td>
<td>6.46</td>
<td>6.23</td>
</tr>
<tr>
<td>parser</td>
<td>1.56</td>
<td>1.56</td>
<td>1.58</td>
<td>1.58</td>
<td>1.66</td>
<td>1.66</td>
<td>1.69</td>
</tr>
<tr>
<td>swim</td>
<td>12.26</td>
<td>21.21</td>
<td>20.24</td>
<td>22.22</td>
<td>19.69</td>
<td>18.74</td>
<td>18.70</td>
</tr>
<tr>
<td>tombcatv</td>
<td>15.92</td>
<td>14.38</td>
<td>17.83</td>
<td>18.16</td>
<td>16.83</td>
<td>18.26</td>
<td>18.40</td>
</tr>
<tr>
<td>twolf</td>
<td>2.66</td>
<td>2.55</td>
<td>2.56</td>
<td>2.54</td>
<td>2.47</td>
<td>2.70</td>
<td>2.76</td>
</tr>
<tr>
<td>vortex</td>
<td>5.97</td>
<td>6.11</td>
<td>5.74</td>
<td>5.73</td>
<td>6.70</td>
<td>6.63</td>
<td>7.06</td>
</tr>
<tr>
<td><strong>MEAN</strong></td>
<td><strong>6.90</strong></td>
<td><strong>7.28</strong></td>
<td><strong>7.58</strong></td>
<td><strong>7.81</strong></td>
<td><strong>8.05</strong></td>
<td><strong>8.08</strong></td>
<td><strong>7.88</strong></td>
</tr>
</tbody>
</table>

Table 1: Performance improvements from scheduler optimizations.

Finally, the results show additional performance gains from the GPA-Mesh configuration. While upwards routing does not decrease communication latencies, it does permit a schedule to fit completely in the minimal number of frames. This capability permits the densest possible mapping of instructions to frames, allowing more speculative hyperblocks to be mapped onto the grid. These results indicate that the advantages of a richer network topology that provides the scheduler with more flexibility likely outweighs the disadvantages of more router ports. The GPA-Mesh shows an average of 20% performance improvement over the less flexible, VLIW-like GPA-M organization.

6. SCHEDULER OPTIMIZATIONS

In this section, we evaluate the scheduler optimizations described in Section 4 for the GPA-Mesh configuration. Recall that those optimizations try to better match the static schedules to the actual latencies experienced by the critical path at run-time.

6.1 Evaluation of Schedule Optimizations

Table 1 shows the performance results of the optimizations in different combinations. The first column, labeled Base, refers to our baseline GPA scheduler that is similar to a greedy VLIW scheduler. Successive columns correspond to a scheduler with one or more of the following additions to baseline scheduler: C = order by critical path, L = Recompute critical paths after each placement, A = model contention at each ALU, L = schedule loads and consumers of loads closer to data caches, and O = use lookahead to schedule instructions that produce register outputs closer to the register file.

The first set of experiments examine the effect of instruction priority in the scheduling algorithm by comparing the greedy ordering, (Base), the criticality ordering (C), and criticality with recomputation of the critical path during scheduling (CR). In the absence of other optimizations, instruction priority is not a significant factor as greedy out-performs the critical-path order in three benchmarks, while the critical-path order provides significantly higher performance on another three, improving performance by nearly 80% in swim. However, recomputing the critical paths improves criticality ordering, yielding the best performance on 12 of the 19 benchmarks and averaging 9% improvement over baseline and 4% improvements over the critical-path ordering.

Augmenting the scheduler with a contention model to improve load balance across the ALUs (CRA) tends to improve performance.
Recall that this optimization explicitly attempts to migrate independent instructions to different ALUs. While this optimization is not important for integer programs which exhibit little parallelism, it affects performance significantly on some floating point benchmarks. Averaged across the entire benchmark suite, this optimization improves performance by 2.5% over critical path re-computation (CR).

Finally, we apply the locality-aware optimizations that consider the placement of load instructions (CRAL) and instructions that write the register file (CRALO). As seen in the Table 1, optimizing for loads and consumers of loads consistently provides better performance with large gains on 6 benchmarks, indicating the importance of such optimizations. Optimizing the placement of instructions that write the register file, on the other hand, does not have much affect on performance in the presence of the load placement optimization.

### 6.2 Code Density Optimizations

In this section, we evaluate the effect of using different number of frames to schedule a block. Using the minimum number of frames allows the scheduler to densely pack the instructions. With the hardware providing only a finite number of reservation stations at each ALU, a dense packing enables several speculative blocks to be mapped and executed, allowing a large window of instructions to extract instruction level parallelism. Dense schedules also have the benefit of good instruction memory performance. However, providing more frames to a block creates more opportunities to schedule critical path instructions on the same ALU, thus minimizing communication latencies along the critical path.

Table 2 explores this trade-off. The first row shows the IPC when the scheduler is used to allow only the minimum number of frames for each block it schedules. The second line in the same row shows the average number of frames used by the blocks that occur during execution. For example, the benchmark *equake* exhibits an IPC of 3.18, when using dense schedules which utilize 1.5 frames per block on the average. The second row shows the corresponding results, when the scheduler uses twice the minimum number of frames, and the last row corresponds to the case, when the scheduler is free to use as many frames as the length of the longest dataflow chain.

As can be seen, using dense schedules provides superior performance on most benchmarks. On a small set of benchmarks (*gzip*, *compress*), we notice that performance improves marginally with sparse schedules. On such benchmarks, we observed low branch prediction rates, and hence the available reservation stations were not fully utilized during execution. The sparse schedules benefited from using those unused reservation stations, minimizing the communication along the critical path. When we use still sparser schedules (Infinite), the performance further deteriorates, a fact exacerbated in some floating point benchmarks that are inherently latency-tolerant.

### 7. SCHEDULING FOR COMPATIBILITY

A major drawback of traditional VLIW architectures, and SPSI machines in general, is a lack of object code compatibility across generations. If issue width of the hardware or the runtimes latencies used by the compiler change, the binaries must be rescheduled. Since static issue architectures do not dynamically check instruction dependencies or change instruction placements, the compiler must make static assumptions about the topology and latencies to ensure correct execution.

Conversely, SPSI architectures do not enforce a rigorous static schedule and still produce correct program execution even if the dynamic latencies differ from those used at compile time. In this section, we examine the sensitivity of the performance of an SPDI machine to exact knowledge of dynamic communication latencies. Our results show that while the best schedules are usually produced when the static and dynamic latencies match, the performance degradation when they mismatch is typically less than 10%. We also describe a method of dynamically mapping a schedule created for a large SPDI machine onto a machine with fewer ALUs. We evaluate performance degradations resulting from a mismatch in issue width and show that a single schedule may run effectively across machines with different issue widths, requiring no translation or recompilation.

#### 7.1 Sensitivity to Wire Latencies

We evaluated the sensitivity of the scheduler to specific wire delays by scheduling for a fixed delay and then simulating that schedule on an implementation with different delays. We then compared those to our original results in which scheduler knows the communication delays precisely. Table 3 shows the performance, measured in IPC, of programs scheduled for an 8x8 GPA-Mesh configuration, with two rows for each of four single-hop communication latencies ranging from 0.5-3 cycles. The top row in each pair is the absolute IPC when each benchmark is scheduled and executed with matching static and dynamic latencies. The second row shows the difference in IPC when the program is scheduled for a one-cycle latency but run using the varying latencies. Of course, only one row is shown for the one cycle hop latency since the static and dynamic latencies always match.

Negative numbers in ΔIPC indicates that performance degrades when the static and dynamic latencies mismatch, while positive
<p>| scheduled | adpcm | ammp | art | bzip2 | compress | dct | equake | gzip | hydro2d | m88ksim |</p>
<table>
<thead>
<tr>
<th>latencies</th>
<th>IPC (%)</th>
<th>IPC (%)</th>
<th>IPC (%)</th>
<th>IPC (%)</th>
<th>IPC (%)</th>
<th>IPC (%)</th>
<th>IPC (%)</th>
<th>IPC (%)</th>
<th>IPC (%)</th>
<th>IPC (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>-0.3 cycles</td>
<td>1.80</td>
<td>0.73</td>
<td>1.59</td>
<td>0.13</td>
<td>1.96</td>
<td>20.87</td>
<td>3.09</td>
<td>0.00</td>
<td>9.47</td>
<td>7.76</td>
</tr>
<tr>
<td>1.0 cycle</td>
<td>1.38</td>
<td>0.27</td>
<td>1.52</td>
<td>13.50</td>
<td>2.70</td>
<td>0.00</td>
<td>6.91</td>
<td>5.74</td>
<td></td>
<td></td>
</tr>
<tr>
<td>2.0 cycles</td>
<td>0.93</td>
<td>3.85</td>
<td>1.95</td>
<td>10.73</td>
<td>1.94</td>
<td>1.27</td>
<td>4.41</td>
<td>3.56</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3.0 cycles</td>
<td>0.66</td>
<td>3.02</td>
<td>2.66</td>
<td>0.75</td>
<td>6.93</td>
<td>1.35</td>
<td>0.84</td>
<td>3.22</td>
<td>2.50</td>
<td></td>
</tr>
</tbody>
</table>

Table 3: Sensitivity of the SPDI schedule to wire dynamic routing latencies.

numbers indicate performance improvements. For example, the IPC of dct scheduled and run with 2-cycle hop latencies is 10.73. When it is scheduled for 1-cycle hop latencies and run with 2-cycle latencies, the IPC drops to 10.5, a loss of 2.1%. The overall results show that if the wires are faster than those for which they were scheduled (0.5 cycles), the performance is virtually the same. If the wires are slower (2 or 3 cycles per hop), performance drops by only 2–3%. Some benchmarks, such as bzip2 and mgrid, are extremely sensitive to the schedule and their performance actually improves when the wires are slower than those assumed by the scheduler. This somewhat surprising result demonstrates the sensitivity of runtime performance to compile time assumptions of the hardware, and we are continuing to investigate its causes.

7.2 Sensitivity to Issue Width

SPDI architectures can achieve cross-generation compatibility by further virtualizing the nodes in the ALU array. Since the GPA-Mesh topology is completely connected, instructions may be assigned anywhere and the schedule will still produce the correct result. Virtualization is achieved by scheduling the code for a large array (for example, a 64-wide 8x8 GPA used in this section), which can then be run on a smaller array by dynamically mapping the instructions from multiple nodes in the larger virtual array to a single node in the smaller physical array. This virtualization strategy requires that enough frames are available on the physical array to store all of the instructions from the schedule of the virtual array. For example, an instruction block consuming 2 frames in an 8x8 array may consist of up to 128 instructions. Mapping this block onto a 4x4 array would require 8 frames of storage on the smaller array. The mapping function can be performed entirely in hardware by interpreting the instruction placement addresses, specified as coordinates in the X, Y, and Z dimensions of the virtual grid, differently on different size grids. As an example, an address of <1,7,1> representing row 1, column 7, frame 1 on the 8x8 virtual grid can be mapped to <0,3,7> on the 4x4 array by translating the binary addresses from <00011111> to <00111111>, 0. We evaluated the performance losses incurred by running programs scheduled for a larger 8x8 array on smaller arrays. Table 4 shows the performance degradation as compared to codes explicitly scheduled for the smaller arrays. The rows labeled “IPC” show the raw instructions per clock when the scheduler knows the exact topology of the grid. The rows labeled ΔIPC show the change in performance when running the 8x8 schedule on each smaller grid size indicated. Results show that performance drops an average of 5% when running an 8x8 schedule on an 8x4 array (8 rows, 4 columns), 17% when running on a 4x4 array, and 22% when running on a 4x2 array. However, those performance drops are negligible when compared to the performance gains that can be achieved on programs with substantial parallelism by migrating to larger grid dimensions. Thus a compatibility path can be provided by scheduling all codes for large arrays, initially running them on small arrays, and achieving performance improvements by incrementally migrating to larger arrays until reaching the grid size for which the applications were originally scheduled.

8. CONCLUSIONS

Conventional architectures sit at opposite ends of the spectrum with regard to their demands on the scheduler. While superscalar architectures can improve some schedules through dynamic scheduling hardware and can see some benefit from good instruction schedulers, performance is ultimately constrained by the limited instruction window size. At the other end of the spectrum, VLIW architectures demand that the compiler place every instruction and schedule every latency. Such demands are unrealistic in the face of uncertain memory latencies and statically uncertain aliases. A hybrid approach that allows the scheduler to place instructions for good locality while also allowing the hardware to dynamically execute the instructions (overlapping instruction latencies and other unknown latencies) can produce better performance. Such approaches will become even more important as technology trends make communication more critical due to increased wire delays.

We have implemented and evaluated a scheduler for such an emerging architecture. Because the hardware dynamically executes the instructions, the scheduler is freed from the burden of precise scheduling constraints. Instead its job is to expose the concurrency in the instruction stream and place the instructions to minimize communication overheads. Our scheduling algorithm is able to achieve a tightly packed schedule using a minimum number instruction slots, while still minimizing latency and balancing the load across the ALUs, thus eliminating hot spots where too many independent instructions have been placed.

We have evaluated our scheduler on a 64-issue processor and examined the interplay between the hardware constraints and the scheduler’s capabilities. We show that the freedom provided by a mesh interconnect topology allows the scheduler to expose more concurrency while minimizing the number of hops, resulting in a 27% performance improvement over more restrictive topologies.

We demonstrate that accounting for distances, not just between ALUs, but also to the register file and cache banks, is critical for performance. An algorithm to estimate instruction execution times...
was necessary to improve load balancing and to help place independent instructions on different nodes. Finally, we show that instruction criticality is important and that iteratively updating the estimated critical path during the instruction placement process provides a 10% boost in performance over a single priority listing. Combining the strengths of static scheduling with the advantages of dynamic issue will be critical to achieve performance in emerging wire-dominated technologies.

9. REFERENCES


[23] Y. Qian, S. Carr, and P. Sweany. Optimizing loop

<table>
<thead>
<tr>
<th>grid dimensions</th>
<th>adpcm</th>
<th>ammp</th>
<th>art</th>
<th>bzip2</th>
<th>compress</th>
<th>dct</th>
<th>equake</th>
<th>gzip</th>
<th>hydro2d</th>
<th>m88ksim</th>
</tr>
</thead>
<tbody>
<tr>
<td>IPC - 8x8</td>
<td>1.80</td>
<td>7.32</td>
<td>3.96</td>
<td>4.24</td>
<td>1.96</td>
<td>20.87</td>
<td>3.18</td>
<td>2.37</td>
<td>9.47</td>
<td>7.76</td>
</tr>
<tr>
<td>IPC - 8x4</td>
<td>1.81</td>
<td>6.67</td>
<td>5.77</td>
<td>4.26</td>
<td>1.91</td>
<td>17.81</td>
<td>3.04</td>
<td>2.68</td>
<td>9.46</td>
<td>7.66</td>
</tr>
<tr>
<td>IPC - 4x4</td>
<td>1.89</td>
<td>5.88</td>
<td>5.32</td>
<td>4.26</td>
<td>2.11</td>
<td>12.31</td>
<td>2.64</td>
<td>2.62</td>
<td>7.30</td>
<td>6.89</td>
</tr>
<tr>
<td>IPC - 4x2</td>
<td>1.71</td>
<td>3.89</td>
<td>4.18</td>
<td>3.48</td>
<td>1.98</td>
<td>6.60</td>
<td>2.00</td>
<td>2.15</td>
<td>4.01</td>
<td>4.80</td>
</tr>
<tr>
<td>ΔIPC - 8x4 (%)</td>
<td>1.6</td>
<td>-2.4</td>
<td>-6.1</td>
<td>-0.9</td>
<td>-7.3</td>
<td>-2.2</td>
<td>-8.35</td>
<td>-0.37</td>
<td>-5.3</td>
<td>-5.1</td>
</tr>
<tr>
<td>ΔIPC - 4x4 (%)</td>
<td>-1.6</td>
<td>-11.2</td>
<td>-5.6</td>
<td>-8.7</td>
<td>-1.4</td>
<td>-18.9</td>
<td>-10.2</td>
<td>-6.5</td>
<td>-11.2</td>
<td>-18.4</td>
</tr>
<tr>
<td>ΔIPC - 4x2 (%)</td>
<td>-7.6</td>
<td>-10.8</td>
<td>-8.9</td>
<td>-17.5</td>
<td>-13.1</td>
<td>-8.0</td>
<td>-18.5</td>
<td>-17.7</td>
<td>-14.9</td>
<td>-27.7</td>
</tr>
</tbody>
</table>

Table 4: Sensitivity to issue width and topology


