From: John Rice <jrr@cs

Quarterly Status Report

Performance Modeling

An Environment For End-to-End Performance Design of

Large Scale parallel Adaptive Computer/Communications Systems

for the period May 1^st, 1999 to July 31^st, 1999,

Contract N66001-97-C-8533

CDRL A001

1.0 Purpose of Report

This status report is the quarterly contract deliverable (CDRL A001), which summarizes the effort expended by the University of Texas, Austin team in support of Performance Modeling on Contract N66001-97-C-8533.

2. Project Members

University of Texas, spent: 1,065 hours

sub-contractor (Purdue), spent: 80 hours

sub-contractor (UT-El Paso), spent: 936 hours

sub-contractor (UCLA), spent: 456 hours

sub-contractor (Rice), spent: 433 hours

sub-contractor (Wisconsin), spent: 0 hours

sub-contractor (Los Alamos), spent: 0 hours

3.0 Project Description (last modified 07/97)

3.1 Objective

The goals of this project are: (1) to develop a comprehensive environment (POEMS) for end-to-end performance analysis of large, heterogeneous, adaptive, parallel/distributed computer and communication systems, and (2) to demonstrate the use of the environment in analyzing and improving the performance of defense-critical parallel and distributed systems.

3.2 Approach

The POEMS project combines innovations from a number of domains (communication, data mediation, parallel programming, performance modeling, software engineering, and CAD/CAE) to realize the goals. First, we will develop a specification language based on a general model of parallel computation with specializations to representation of workload, hardware and software. To enable direct use of programs as workload specifications, compilation environments such as dHPF will be adapted to generate executable models of parallel computation at specified levels of abstraction.

Second, we will experimentally and incrementally develop and validate scaleable models. This will involve using multi-scale models, multi-paradigm models, and parallel model execution in complementary ways. Multi-scale models will allow different components of a system to be modeled at varying levels of detail via the use of adaptive module interfaces, supported by the specification language. Multi-paradigm models will allow an analyst to use the modeling paradigm—analytical, simulation, or the software or hardware system itself—that is most appropriate with respect to the goals of the performance study. Integration of an associative model of communication with data mediation methods to provide adaptive component interfaces will allow us to compose disparate models in a common modeling framework. To handle computationally expensive simulations of critical subsystems in a complex system, we will incorporate parallel simulation technology based on the Maisie language.

Third, we will provide a library of models, at multiple levels of granularity, for modeling scaleable systems like those envisaged under the DOE ASCI program, and for modeling complex adaptive systems like those envisaged under the GloMo and Quorum

programs.

Finally, we will provide a knowledge base of performance data that can be used to predict the performance properties of standard algorithms as a function of architectural characteristics.

4.0 Performance Against Plan

4.1 Spending – Spending has caught up with plan. All of the subcontracts except for LANL are place. The spending rate for the project will, after this quarter, run at about the planned rate.

4.2 Task Completion - A summary of the completion status of each task in the SOW is given following. Because several participants are involved in most tasks the assessment of completion for tasks in progress have some uncertainty in the estimates of completion. Assessments of task completions by participating institutions are given in the progress reports from each institution.

Task 1 - 95% Complete - Methodology development is an iterative process. One develops a version of the methodology, applies it and revises the methodology according to the success attained in the application. Evaluation of the methodology is in progress with the analysis of the performance of Sweep3D on the SP2 family of architectures. Closure will come with completion of Task 7 when validation of the methodology on the first end-to-end performance model has been completed.

Task 2 - Complete

Task 3 - 95% Complete - Specification languages for all three domains have been proposed and are in various states of completion.

Task 4 - 85% Complete - Task graphs can now be developed for most HPF programs and work on MPI programs is well underway.

Task 5 - 85% Complete - The compiler for the specification language is well into development. Use of the compilation methods developed for the CODE parallel programming system at UT-Austin has accelerated this task.

Task 6 - 65% Complete - The initial library of components has been specified and instantiation has begun. (See the progress reports from UTEP and Wisconsin for details.)

Task 7 - 50% Complete - Subtask or Phase 1 of this task is about 50% complete. (See the progress reports from UCLA and Wisconsin for details.)

Task 8 - 55% Complete

Task 9 – Task 9 has been partitioned into seven subtasks. Subtask 9.1 is complete and Subtask 9.2 is complete. Subtask 9.3 is complete. Tasks 9.4 is 50% complete, 9.5 is 35% complete and 9.6 is20% complete. Subtask 9.7 has just been initiated this quarter.

Task 10 - 0% Complete

Task 11 - 0% Complete

5.0 Major Accomplishments to Date

Project Management

a) Long Term Workplan

POEMS has generated the framework for end-to-end performance modeling and has developed initial versions of several major components. This year has been designated the "Year of Integration." The long-term goal for this year (1999/2000) is integration of POEMS components into the framework. This will enable POEMS to spend the bulk of the third year of the project in application to further example systems.

Technical Accomplishments

a. Knowledge Base

* Pythia system made ready to accept data from performance testing and

modeling

* Started work on Task 9.7

* Ifestos methodology validated on data from two case studies

previously published and analyzed by a completely different methodology.

b. Tool Interfacing and Integration

Integration of Compiler-Generated Task Execution Times into MPI-SIM

Rice and UCLA have been collaborating to develop hybrid models which integrate compiler-based optimizations which can describe task execution times

as a function of analytical formulas and measurement into the MPI-Sim simulator. For example, if a task is composed of a loop with bounds from 0 to n-1, the task execution time would be n* (measured execution time of the code segment inside the loop). This integration can be used to facilitate simulation of systems with thousands of processors, and realistic problem sizes expected for such large systems. In the previous quarter we had manually modified a few MPI programs to evaluate the benefits of this approach, with very promising results.

The MPI-Sim simulator was extended to accept the analytical descriptions of the task execution times and incorporates them into the performance estimation of the total execution time. The communications tasks are still simulated in detail by MPI-Sim.

The first step in developing the hybrid model is to derive a task graph from the application source code. The task graph has two purposes, one is to expose the tasks so that measurement of execution task time can be performed and the second is to derive analytical models of task time execution.

Once the tasks are exposed, the application is run and the task execution times are measured. For the purpose of large-scale simulation, the measurements are performed on a small problem size and a small number of processors.

To accomplish this generation of task graphs the Rice dHPF compiler has been extended as follows:

1) The compiler uses program slicing to identify those subsets of the

computational tasks whose *results do not affect the performance* of

the program; We call these "redundant computations" (redundant from the

viewpoint of program performance.

(2) The compiler computes analytical (but symbolic) estimates for the

execution time of these redundant computations.

(3) The compiler modifies the generated message-passing program to replace

the redundant computations with calls to a special MPI-Sim function,

passing in the symbolic performance estimate as a parameter.

(4) The compiler also generates a second new version of the message-passing

program with instrumentation inserted to measure parameter values for

the symbolic task performance estimates. This version is executed to

measure these parameter values and these values are also provided as a

separate input to the simulator.

(These compiler extensions directly exploit information in the static

task graph, which was developed an artifact of Rice University's effort on

task 4 of this project.)

Integration of MPI-Sim and SimpleScalar

UTEP finalized the interface between MPI and SimpleScalar, tested it, and distributed a document describing this effort and the use of this tool; the document is entitled "SimpleScalar 3.0a Modifications to Run Under MPI".

The interfacing of MPI and SimpleScalar has made it possible to run an MPI

program, in particular the MPI version of Sweep3D, on multiple instantiations

of SimpleScalar that communicate using MPI commands. The referenced document

can serve as a template for interfacing SimpleScalar to the POEMS platform.

Interfacing of Task Graph and Poems Specification Language Models

Rice and UT-Austin are working to interface the static and dynamic task graphs generated by dHPF to CODE environment, by mapping the task graphs to the POEMS Specification Language (PSL). The Rice investigators (Vikram Adve and Rizos Sakellariou) visited UT Austin for a day on May 17, 1999. The major result of the meeting was a resolution of the key technical issues to be faced in interfacing the two systems, and work plan to achieve this goal. Subsequently, Rizos Sakellariou provided UT Austin with a detailed example of a task graph for an example MPI program, that illustrates the features required to map to PSL.

c. Methodology Definition

The WOSP and TSEE papers define and illustrate the methodology.

d. Specification Languages

A complete example in the Poems Specification Language was given in the TSE paper.

e. Model Development and Validation

Hardware Domain Component Library

UTEP continued validation of SimpleScalar against the R10000 and Power604e.

UTEP continued study of Sweep3D's microarchitecture and memory resource needs. The former study, which focuses on the causes of stalls, is a first step in investigating the performance of Sweep3D on next-generation processor architectures, which will lead into collaborative work with the University of Wisconsin-Madison and UCLA in terms of running Sweep3D on multiprocessors with next-generation processors. This work will pair the LogGP model with SimpleScalar simulation and MPI-SIM with SimpleScalar.

UTEP continued the project that is determining the accuracy of the R10000

performance counters. If accurate, these counters could be used for modeling

purposes.

Wisconsin performed further validation experiments on our new AMVA techniques for analyzing the performance of resources with high service time variance. The processor nodes in the SGI Origin 2000 model have high service time variance, due to the nature of modern processor architectures. The new validation experiments demonstrate that the new modeling techniques accurately predict system performance over a wider range of the system parameter space than previous validations.

SWEEP3D Models

We revised the specification of the LogGP component models of Sweep3D, to reflect the POEMS multi-domain, multi-paradigm methodology. In particular, we separated the component model for the application behavior from the component models of the MPI run-time system behavior, and we created well-defined interfaces between these component models.

6. Artifacts Developed

Artifacts include technical papers, software and models.

Technical Papers

a. "Compiler-Supported Simulation of Very Large Parallel Applications,"

Vikram Adve, Rajive Bagrodia, Ewa Deelman, Thomas Phan, and

Rizos Sakellariou, To appear in the Proceedings of the ACM/IEEE SC99

Conference on High Performance Networking and Computing."

b. "Analytic Evaluation of Shared Memory Architectures with Heterogeneous Applications" D. Eager, D. Sorin and M. Vernon. (Submitted to HPCA 2000.

7.0 Issues

Open Issues with no Plan for Resolution

None

Open Issues with Plan for Resolution

a. Integration of Performance Data into the Knowledge System

How to automate (at least partially) the insertion of performance data into the knowledge system? This should not be a formidable technical proble but there are many details to be coordinated among the project participants

UCLA

b. Integration of PARSEC runtime into MPISIM.

We are considering porting the MPI-Sim simulator to the PARSEC runtime

simulation system. This will allow us to utilize the latest synchronization

algorithms present in PARSEC.

c. Integration of LogGP with AMVA

Integration of the LogGP specification of application synchronization structure with the AMVA model of the Origin 2000-like memory for more precise end-to-end modeling of the Splash benchmarks.

Issues Resolved

None

8.0 Near-term Plan

"Near-term" refers to the next one or two quarters.

Knowledge-Based System

Complete the evaluation of the knowledge inference methodology for

linear algebra solvers.

b. Models and Model Evaluation

Distribute a new version of the "Hardware Domain Component Model Specification"

which will address feedback from UCLA and LANL and will include

the specification of the LLNL-SP/2 Power604e and a description of SimpleScalar.

Complete the validation of SimpleScalar simulations of the Power604e and MIPS
R10000 by comparison of simulated execution times with actual execution times

of the block of work of Sweep3D.

Complete the study of Sweep3D's microarchitecture resource demands.
Continue Sweep3D memory study and project re: accuracy of the R10000

performance counters.

Experiment with the new AMVA techniques for estimating residence time at a resource with bursty arrivals (such as the processor bus in the AMVA prototype component model for the SGI Origin 2000 system.

Interfacing and Integration

Perform the validation of the MPI-Sim/Task Graph hybrid model.
Continue work on integration of task graph and PSL specified models.

9.0 – Completed Travel

Vikram Adve and Rizos Sakellariou visited UT Austin for a day on May 17, 1999 as described above.
Pat Teller attended ISCA '99, May 2-4 1999, Atlanta, GA, for which she was
Tutorials Chair.
Pat Teller attended SC '99 Program Committee Meeting, June 15-17, 1999,

Portland, Oregon. She is Student Volunteers Chair. This trip was not paid for by funds from this grant.

Pat Teller served on the NSF Review Panel for the PACI Program, July 19-22,

1999, Washington, DC. This trip was not paid for by funds from this grant.

Mary Vernon presented the POEMS-sponsored paper, "Predictive Analysis of a Wavefront Application Using LogGP", at the PPoPP '99 conference in Atlanta, May 4-6,1999. Vernon also attended the Program Committee meeting for Supercomputing '99 in Portland, Oregon, June 1999.
John Rice visited the POEMS research groups at UT Austin, UT El

Paso, Rice University and UCLA

10.0 Equipment

None Acquired

11.0 Summary of Activity

11.1 Work Focus:

The two foci for continuing work for the 1999/2000 year are integration of tools and component model library development.

Knowledge Base

Expansion of the performance data within the knowledge base.
Evaluation of knowledge inference methodology for performance.

Integration and Interfacing

Integration of Task Graph Models and POEMS Specification Language Models
Integration of Task Graph Models and MPISIM
Integration of MPISIM and SimpleScalar

c) Models and Model Evaluation

Hardware domain component model specification and implementation,

in particular, implementation of the processor/memory subsystem

component model/simulator of the LLNL-SP/2 Power604e and the SGI O2K

MIPS R10000.

Models of next generation processors/memory systems and microarchitecture

performance analysis.

This past quarter we focused on completing the HPCA paper submission and on creating the new LogGP component model specifications for Sweep3D.

11.2 Significant events

The technical paper describing the integrated dHPF / MPI-Sim system has been accepted for publication at SC99, the premier and most widely attended conference in high performance computing.
Validation of PYTHIA methodolgy for 2 case studies taken from the

literature.

Plan developed at El Paso meeting to incorporate models and data

from group into the Ifestos framework

The POEMS project has two papers at the ACM SIGPLAN 1999 Symposium on Principles and Practice of Parallel Programming held in Atlanta, Georgia on

May 4-6, 1999. The titles of the two papers were "Performance Prediction of Large Parallel Applications Using Parallel Simulations" and , "Predictive Analysis of a Wavefront Application Using LogGP."

Dr. Vikram Adve will be leaving Rice University to take up a faculty position at the University of Illinois. He will continue to be an active principal investigator on the POEMS project, working at Illinois.
The subcontract to Rice University will be terminated in August 1999, following Dr. Adve's departure (since he is the lead investigator for this effort at Rice).
Dr. Rizos Sakellariou will be leaving the POEMS project to take up a
faculty position at the University of Cyprus. [NOTE: Dr. Sakellariou later joined the Computer Science faculty at the University of Manchester where he is currently.] A search will be conducted to replace Dr. Sakellariou.

FINANCIAL INFORMATION:

Contract #: N66001-97-C-8533

Contract Period of Performance: 7/24/97-7/23/00

Ceiling Value: $1,839,517

Reporting Period: 5/1/99-7/31/99

Actual Vouchered (all costs to be reported as fully burdened, do not report

overhead, GA and fee separately):

Actual Vouchered (all costs to be reported as fully burdened, do not report

overhead, GA and fee separately):
Current Period

Prime Contractor Hours Cost

Labor 1,065 44,440.25

ODC's 35,821.90

Sub-contractor 1 (Purdue) 80 12,414.90

Sub-contractor 2 (UT-El Paso) 936 20,086.47

Sub-contractor 3 (UCLA) 456 23,260.53

Sub-contractor 4 (Rice) 433 22,533.81

Sub-contractor 5 (Wisconsin) 0 216.07

Sub-contractor 6 (Los Alamos) 0 0.00

TOTAL: 2,970 158,773.93

Cumulative to date:

Cumulative to date:

Prime Contractor Hours Cost

Labor 8,225 254,165.26

ODC's 311,962.02

Sub-contractor 1 (Purdue) 1,208 95,232.81

Sub-contractor 2 (UT-El Paso) 3,346 148,415.06

Sub-contractor 3 (UCLA) 2,676 135,309.47

Sub-contractor 4 (Rice) 2,848 164,482.46

Sub-contractor 5 (Wisconsin) 1,658 86,058.40

Sub-contractor 6 (Los Alamos) 0 0.00

TOTAL: 19,961 1,195,625.48