12/22/1997 08:56 6195533348 NRAD D4101 PAGE 13

Quarterly Status Report

Performance Modeling

An Environment For End-to-End Performance Design of

Large Scale parallel Adaptive Computer/Communications Systems

for the period 25 January 1998 -April 24th, 1998

Contract N66001-97-C-8533

CDRL A001

1.0 Purpose of Report

This status report is the quarterly contract deliverable (CDRL A001),which summarizes the effort expended by the University of Texas, Austin team in support of Performance Modeling on Contract N66001-97-C-8533.

2. Project Members

University of Texas, spent: 960 hours

sub-contractor (Purdue), spent: 295 hours

sub-contractor (UT-El Paso), spent: 0 hours

sub-contractor (UCLA), spent: 0 hours

sub-contractor (Rice), spent: 0 hours

sub-contractor (Wisconsin), spent: 0 hours

sub-contractor (Los Alamos), spent: 0 hours

3.0 Project Description (last modified 07/97)

3.1 Objective

The goals of this project are: (1) to develop a comprehensive environment (POEMS) for end-to-end performance analysis of large, heterogeneous, adaptive, parallel/distributed computer and communication systems, and (2) to demonstrate the use of the environment in analyzing and improving the performance of defense-critical scaleable systems.

3.2 Approach

The project will combine innovations from a number of domains (communication, data mediation, parallel programming, performance modeling, software engineering, and CAD/CAE) to realize the goals. First, we will develop a specification language based on a general model of parallel computation with specializations to representation of workload, hardware and software. To enable direct use of programs as workload specifications, compilation environments such as dHPF will be adapted to generate executable models of parallel computation at specified levels of abstraction.

Second, we will experimentally and incrementally develop and validate scaleable models. This will involve using multi-scale models, multi-paradigm models, and parallel model execution in complementary ways. Multi-scale models will allow different components of a system to be modeled at varying levels of detail via the use of adaptive module interfaces, supported by the specification language. Multi-paradigm models will allow an analyst to use the modeling paradigm -- analytical, simulation, or the software or hardware system itself -- that is most appropriate with respect to the goals of the performance study. Integration of an associative model of communication with data mediation methods to provide adaptive component interfaces will allow us to compose disparate models in a common modeling framework. To handle computationally expensive simulations of critical subsystems in a complex system, we will incorporate parallel simulation technology based on the Maisie language.

Third, we will provide a library of models, at multiple levels of granularity, for modeling scaleable systems like those envisaged under the DOE ASCI program, and for modeling complex adaptive systems like those envisaged under the GloMo and Quorum

programs.

Finally, we will provide a knowledge base of performance data that can be used to predict the performance properties of standard algorithms as a function of architectural characteristics.

4.0 Performance Against Plan

4.1 Spending - Spending was under plan during the quarter October 25th, 1997 - January 24, 1998 because of delays in completing processing of subcontracts. The subcontract with the University of Wisconsin was still not in force as of January 25th, 1998. Some spending for the quarter ending January 25th, 1998 will appear in the quarter ending April 24th, 1998.due to these delays. After the quarter ending April 24th, 1998 spending will begin to catch up to plan.

4.2 Task Completion - A summary of the completion status of each task in the SOW is given following. Because several participants are involved in most tasks the assessment of completion for tasks in progress have some uncertainty in the estimates of completion. Assessments of task completions by participating institutions are given in the progress reports from each institution.

Task 1 - 65% Complete - Methodology development is an iterative process. One develops a version of the methodology, applies it and revises the methodology according to the success attained In the application. A draft specification of the metodology has been prepared and portions applied and evaluated. Closure will come with completion of Task 7 when validation of the methodology on the first end-to-end performance model has been completed.

Task 2 - Complete

Task 3 - 85% Complete - Specification languages for all three domains have been proposed and are in various states of completion.

Task 4 - 0% Complete - This task has not been started because of the delay in hiring of the postdoctoral fellow for Rice who will execute this task.

Task 5 - 50% Complete - The compiler for the specification language is well into development. Use of the compilation methods developed for the CODE parallel programming system at UT-Austin has accelerated this task.

Task 6 - 35% Complete - The initial library of components has been specified and instantiation has begun. (See the progress reports from UTEP and Wisconsin for details.)

Task 7 - 20% Complete - Subtask or Phase 1 of this task is about 30% complete. (See the progress reports from UCLA and Wisconsin for details.)

Task 8 - 15% Complete - This task has been carried only through conceptual design.

Task 9 — Task 9 has been partitioned into two subtasks. The subtasks are defined in the Project Plan. Subtask 9.1 is 65% complete and Subtask 9.2 is 45% complete.

Task 10 - 0% Complete

Task 11 - 0% Complete

5.0 Major Accomplishments to Date

5.1 Project Management

a) Long Term Workplan

The complete plan and schedule of accomplishment for the POEMS project has been completed.

5.2 Technical Accomplishments

a) The initial version of the performance data base was brought up. (Purdue)

b) The representation for specification of application level components was completed. (Rice)

c) Validation of the MPI-SIM model of Sweep3D on the IBM SP2 was completed. (See the report from UCLA for details.)

d) A new draft of the specifications for the hardware components was completed. (UTEP)

e) A LoPC application level model of Sweep3D was developed. (Wisconsin)

6.0 Artifacts Developed During the Past Quarter

a) Technical Paper - An overview paper on the POEMS project was prepared and submitted to the Workshop on Software Perfomance Evaluation which will be held in Santa Fe, New Mexico in October, 1998. This paper, "Poems: End-to-end Performance Design of Large Parallel Adaptive Computational Systems" can be found on the POEMS project web page:

http://www.cs.utexas.edu/users/poems.

7.0 Issues
7.1 Open issues with no plan, as yet, for resolution:

a) Derivation of parameters for characterization of analytically soluble memory models which will enable analysis and prediction of the impacts of memory hierarchies on processor performance. This remains as an open issue from the last quarterly progress report.

b) Resolution of performance differences between Fortran and C versions of the same program.

7.2 Open issues with plan for resolution:

a) Degree to which AMVA models can capture details of component behavior. This remains as an open issue from the last quarterly progress report.

b) Extension of execution driven simulation for MPI programs to cluster architectures such as the newest SP2s.

c) Interfaces across operating system and hardware modeling domains.

d) Experiment design for completing the experimental evaluation of Sweep3D. A clean definition of experiment goals remains to be specified.

7.3 Issues resolved:

a) Interfacing of component models which utilize analytic methods of evaluation with component models which are evaluated by discrete event simulation.

b) Scalability of Sweep3d application with respect number of processors on the IBM SP2. Architecture

More detail on each of these issues and the resolutions obtained can be found in the reports from the participating institutions which are appended.

8.0 Near-term Plan

The goals and status of the near-term plan (ending on August 31^st, 1998) have been reassessed with the completion of the project plan. Division of tasks 6 and 7 into subtasks gives a clearer picture of project status. The goals for August 31^st, 1998 will be largely met as shown in schedules in the full project plan. In particular, subdividing model development and model validation for the IBM SP2 and SGI Origin enables well-defined and unambiguous goals for August 31^st, 1998 completion.

9.0 Completed Travel

a) Project Review Meeting — All participants attended the March 6, 1998 project review in Austin. A project meeting was held on March 7, 1998 following the project review.

b) SES/Workbench Class - Dr. Oliver attended an SES Workbench class in Austin, Texas. Workbench is one of the possible POEMS demonstration platforms.

c) IPCCC '98 - Drs. Oliver and Teller (UT-El Paso) attended the 1988 IEEE International Performance, Computing, and Communications Conference (IPCCC '98) at which Dr. Oliver presented the following two papers that are published in the conference proceedings -- the acceptance of these papers was reported in the first quarterly report:

"Accurate Measurement of System Call Service Times for Trace-Driven Simulation of Memory Hierarchy Designs" by Richard Oliver, Ward P. McGregor, and Patricia J. Teller

"Generating Dynamically Scheduled Memory Address Traces" by Richard Oliver and Patricia J. Teller.

10.0 Equipment and Description

While no equipment was purchased or acquired directly by the POEMS project the facilities of the POEMS group at UTEP were substantially upgraded during this quarter.

At UT-El Paso a considerable amount of effort during the first and second

quarters resulted in an upgrade of research lab and computing facilities that

were put in place during the third quarter.

The Systems and Software Engineering Affinity Lab (SSEAL) became functional

during this reporting period. The lab includes office space and

computer facilities (including a SGI Origin 2000 multiprocessor system, a

RAID array, an automatic back-up system, an RS6000 workstation, several

SGI workstations, several PCs, and several SUN workstations (on order) for

students working on the POEMS project as well as other projects. In response

to a proposal of Drs. Teller and Gates (a colleague in the Department of

Computer Science) the VPAA of UTEP funded the renovation and furnishing of

the lab, which was allocated as part of a cost-share associated with

an NSF MII grant.

With the help of the Dean of the Engineering College and Intel Corporation,

the computer facilities for Drs. Teller and Oliver include two SUN Ultra1s,

two HP printers, a Pentium-Pro workstation, and a laptop (on order).

The above facilities do not include those supplied by the department.

11.0 Summary of Activity

11.1 Work Focus:

The main foci for collective activities for January 25^th, 1998 to April 24^th, 1998 have been:

a) Preparation of the paper for the Workshop on Software Performance.

b) Development of the complete project plan for the POEMS project.

c) Validation of the Sweep3D application on the IBM SP2.

d) Development of the Specification Language and the Methodology.

Each participating institution has been working on their responsibilities for tasks 1, 2, 3, 4, 6 and 7.

11.2 Significant Events:

a) Application Specification — First drafts of all of the elements of the specification language for the application level have been completed.

b) Initiation of Task 4 — The arrival of Dr. Rizos Sakellariou at Rice has enabled initiation of work on Task 4.

c) The Task Graph model has been extended to cover message passing system.

d) Completion of initial scalability studies of Sweep3D on the IBM SP2. Sweep3D scales well until about 256 processors. The rate of speedup declines until about 1024 processors. There is no improvement in performance beyond 1024 processors.

e) Simulators — The Armadillo PowerPC instruction level simulator including ILP has been brought into the project in addition to the RSIM R10000 simulator from Rice.

FINANCIAL INFORMATION:

Contract #: N66001-97-C-8533

Contract Period of Performance: 7/24/97-7/23/00

Ceiling Value: $1,832,417

Reporting Period: 10/25/97 — 01/24/98

Actual Vouchered (all costs to be reported as fully burdened, do not report

overhead, GA and fee separately):
Current Period

Prime Contractor Hours Cost

Labor 960 18,488.00

ODC's 32,165.16

Sub-contractor 1 (Purdue) 276 21,309.80

Sub-contractor 2 (UT-El Paso) 0 0

Sub-contractor 3 (UCLA) 0 0

Sub-contractor 4 (Rice) 0 0

Sub-contractor 5 (Wisconsin) 0 0

Sub-contractor 6 (Los Alamos) 0 0

TOTAL: 1236 71,962.96

Cumulative to date:

Prime Contractor Hours Cost

Labor 1,760 46,862.00

ODC's 64,692.96

Sub-contractor 1 (Purdue) 276 21,309.80

Sub-contractor 2 (UT-El Paso) 0 0

Sub-contractor 3 (UCLA) 0 0

Sub-contractor 4 (Rice 0 0

Sub-contractor 5 (Wisconsin) 0 0

Sub-contractor 6 (Los Alamos) 0 0

TOTAL: 2,036 132,864.76

Contract #: N66001-97-C-8533

Contract Period of Performance: 7/24/97-7/23/00

Ceiling Value: $1,832,417

Notes:

There follows individual progress reports for each of the participating institutions.

Purdue University
Accomplishments

Our efforts this quarter has been concentrated on SubTasks 9.1 and SubTasks 9.2 to develop the performance data base for SP-2 needed to generate the rules of KB (1). A population of PDE applications has been determined and a set of PELLPACK solvers were applying for their simulation on SP-2. In addition a set of values of the associated parameters was determined so that the performance data to be generated will capture the behavior of the architecture and application. In addition a large set of performance were collected by utilizing the automatic facility of the PKDB system. The difficulty encountered was able to generate large meshes since the "foreign" tools used were failing for many of geometries (i.e. PDE domains) selected. This is the most tedious part of the effort, since in many instances we do not have complete control of the public codes we use. This effort will be continued in the next quarter together with the generation of the KB (1) rules. We have completed the first draft of two papers related to this effort that we hope to submit by the end of August.

Progress Summary And Plan

The following table summarizes the progress and plans for the work in Quarter 3 (Jan 25 — April 25).

Effort

SubTask 9.1: 65%

SubTask 9.2 45%

Finances

Hours Cost

sub-contractor (Purdue) 19 753.04

Current and Future Research Plans

We will continue working towards the completion of subtasks 9.1 and 9.2. In addition we hope to be ready to generate some KB (1) rules (subtask 9.3).

Rice University

Accomplishments

In the last quarter, we completed work on the design of the application representation (which was begun in the preceding quarter), and continued work on the task-graph-based analytical model for Sweep3D.

The design of the application representation is described in a document which was finalized in this quarter. The document describes the key components of the application representation and the attributes and operations of each of the components. This document, together with the corresponding documents for the operating system and hardware domains, should provide a basis for the development of the POEMS environment.

The task-graph based analytical model has been extended to model explicit message-passing programs (in addition to shared-memory programs which were the previous focus of the model). The model uses a standard approach, namely training-sets, to capture the costs of primitive communication operations on a given architecture and configuration, but explicitly models the behavior of higher-level communication patterns based on those primitives. The model is now able to provide preliminary performance predictions for the Sweep3D application, using the dynamic task graphs as input. We are continuing work on measuring model input parameters and validating the model for an IBM SP-2 system.

Current and Future Research Plan

The tasks planned for the current quarter are as follows:

1. We are continuing work on the analytical model described earlier.

The focus of work in this quarter is to complete the parameterization of the model for the Sweep3D application, and validate the model against measured performance on an IBM SP-2. This model will provide an application-level component model for POEMS, in particular, to model wavefront-based application components. Sweep3D is a specific 3-dimensional instance of this class.

2. We are initiating work on Task 4, namely, the compiler support for

generating task graphs for POEMS. As reported previously, Rizos Sakellariou has joined the POEMS effort at Rice, beginning on May 20. Our first goal is to develop a prototype of dHPF that is able to synthesize static task graphs for an High Performance Fortran (HPF) program parallelized using dHPF.

Completion level for each task at 6 months after project initiation --------------------------------------------------------------------

Planned Date for Nominal target Level achieved

Initiation/Completion level at 9 months at 9 months

Task 1 0/24 months 35% 35%

Task 3 0/ 9 months 100% 100%

Task 4 6/30 months 10% 0%

Task 6 3/18 months 40% 40%

Task 7 3/18 months 0% 0%

Task 8 9/18 months 0% 0%

At this point, we believe the Rice components of tasks 1, 3 and 6 are progressing on schedule. Our work on task 4 has just begun in the current quarter, since Rizos Sakellariou has joined the project. Finally, as reported earlier, we expect the primary work on Task 7 to begin in the second half of the planned period, namely in the current quarter.

UCLA

Accomplishments (By task in which UCLA is involved)

Task 6: Component Model and Validation

c) Models of parallel I/O systems

d) Models of high performance networks

We are adding support for simulation of clusters of SMPs, where each

node of a massively parallel machine is composed of several

processors sharing the same address space. The inter-node

communications are faster than off-node communications. In our

model, the data is copied out of the user space of the sender to a

shared buffer. That buffer is visible to the receiver, which copies

the data out of the buffer into its own user-space. The current

model does not include contention for the buffer. We assume, that

there is a shared buffer for each pair of communicating

processes. Since the number of processor on the same node is small,

and most MPI applications use a one-to-one process-to-processor

mapping, this assumption is reasonable. This heterogeneous aspect

of the architecture will be captured by MPI-SIM.

Future Goals:

1. Demonstrate the performance of the modeled SMP clusters on a

synthetic benchmark. It is hard to demonstrate the performance

difference on SWEEP3D, because that application is not sensitive to

latency variations for latencies close to that of an IBM SP.

2. We are still seeking applications that can stress the I/O

simulation subsystem.

Task 7: Validation of POEMS by End-to-End Performance Experiment.

MPI-SIM has been validated on the IBM SP. We have used MPI-SIM to

predict the performance of SWEEP3D as the number of processors is

increased. We have predicted the performance of SWEEP3D for three

problem sizes: 50, 100 and 150 cubed. A blocking factor of 10 was

used in the K dimension and 3 for the angles. For large problems,

the study showed that the application's performance scales well as

the number of processors is increased to almost 1600, although the

relative improvement in performance drops beyond 256 processors.

For the smaller problem size with $50^3$ elements, the performance

appears to peak at about 1024 processors and subsequently gets

worse.

Future Goals:

1. Port MPI-SIM and SWEEP3D to the SGI Origin. Validate the

simulated MPI version of SWEEP3D on the Origin.

2. Conduct further scalability experiments for SWEEP3D (both on the

IBM SP and the SGI Origin).

3. Characterize the performance of MPI-SIM simulating SWEEP3D in

terms of speedup, which demonstrates the parallel performance of the

simulator; and slowdown, which shows the overhead that the simulator

incurs.

University of Texas at Austin

5 . Accomplishments (by task)

5.1 January 25^th, 1998 — April 24^th, 1998

Task 1 — Integration of associative objects into the data flow model of parallel computation was completed. Proposal for interface between simulation-evaluated components and analytically evaluated components has been drafted and circulated.

Completion (of Texas responsibilities) - 85%

Task 2 — Complete — Report Submitted

Completion (of Texas Responsibility) — 100%

Task 3 — First Draft of Specification Language Definition and Users Manual about 60% completed.

Completion of Texas Responsibility — 60%

Task 5 — Compiler for specification language has been begun. Initial manual compilation of simple examples to use as test cases has been completed.

Completion (of Texas responsibility) — 50%

5.2 Current and Future Research

Task 1 — Integration of methodology documents on application, operating system/runtime system and hardware component levels. Specification of the process implementing the elements of the methodology. Completion of document drafting.

Task 3 — Completion of integration of the sections on applications, operating/runtime systems and hardware components. Specification of inheritance and composition in terms of associative objects. Completion of the manuals.

Task 5 — Version 1 of the compiler for the specification language to be completed in the third and fourth quarters.

6. Artifacts — none

7.0 Issues
7.1 Open issues with no plan, as yet, for resolution:

none

7.2 Open issues with plan for resolution:

none

7.3 Issues resolved:

none

9. Travel Completed

none

10. Equipment

none

11. Significant Events — none

Artifacts — none

University of Texas at El Paso

Accomplishments (By task in which UTEP is involved)

4. Performance Against Plan

____________________________

Task 1

______

Methodology Definition

______________________

Domain interfaces

_________________

Discussion and feedback were contributed w.r.t. the second version of a

document that describes the proposed design of the application domain,

which was completed this quarter. In particular, UTEP contributed

w.r.t. defining the interfaces between the application domain and

the operating system/runtime and hardware domains. Subsequent related

discussions with RICE were focused on implementation details. These

discussions will result in an addendum to the document, which will

be completed during the fourth quarter.

Task 6

______

Component and Models for the Hardware Domain

____________________________________________

A second draft of a the POEMS hardware domain component model specification

was provided to the POEMS researchers. This version included the definition

of the transport that interfaces with the interconnection network and the

concept of task execution descriptions. The document is in the process of

being revised yet again. The next version will include more detailed

specification of processor and memory internal communication, as well as

missing specifications.

RSIM, a MIPS R10000 simulator, was ported to the UTEP-POEMS environment,

running on a Sun Ultra1. During the next quarter sweep.single will drive

the R10000 simulation. The capabilities of RSIM are under study.

The coherence protocol of the SGI Origin 2K is under study.

Simulation engines

__________________

Investigation of the potential differences between performance data

generated by different types of instruction-level processor simulators

resulted in a commitment by POEMS to support the simulation of

processors with instruction-level parallelism (ILP). This decision

led to the definition of two new tasks for UTEP: (1) a qualitative examination

of existing simulators, and (2) the porting and evaluation of the capabilities

and design of RSIM from Rice University and SIMOS (for MIPS R10000/SGI Origin

2000 simulation) and Armadillo from UT-Austin (for PowerPC/IBM SP/2 simulation).The first task is on-going. With respect to the second task, RSIM

has been ported to the UTEP-POEMS computing environment and evaluation

is underway. The evaluation of these simulation engines will determine the

difficulties involved in interfacing them with separately defined memory

components.

IBM SP/2 Sweep3D task execution times

_____________________________________

The Sweep3D code is to be evaluated at the task level (corresponding to the

node level of a task graph). In support of this level of modeling, the UTEP

Poets are in the process of developing a methodology to accurately measure

the execution times for each of the tasks executed during a run of Sweep3D.

Within the parallel Sweep3D code, a task corresponds to a defined "block of

work" which is the computational component between communications points.

This "block of work" is the code approximately between lines 322 and 512 of

the file sweep.f. It can also be described as the "idiag" loop of Sweep3D.

As per discussions with LANL the following set of parameters have been selected

to drive the experiments. The processor decomposition will be varied as (2x2,

3x3, 4x4, ...) up to the limits that the host machine (or time) impose. The

grid size will be varied as (5x5, 10x10, 15x15, ...) also up to the limits

that the host machine (or time) impose. The k-block size will be varied from

1 to n/2 for each size of the grid. This definition of the experimental space

is being verified with LANL.

Issues

_________

Open issues with plan for resolution:

_____________________________________

Sweep3D processor configurations and grid sizes to study

________________________________________________________

In the process of designing a methodology for measuring Sweep3D execution

time for various problem sizes on IBM SP/2 and SGI Origin 2000 multiprocessors,

the definition of the experimental space was raised. As per discussions with

LANL the following set of parameters have been selected to drive the

experiments. The processor decomposition will be varied as (2x2, 3x3, 4x4, ...)

up to the limits that the host machine (or time) impose. The grid size will be

varied as (5x5, 10x10, 15x15, ...) also up to the limits that the host machine

(or time) impose. The k-block size will be varied from 1 to n/2 for each size

of the grid. This definition of the experimental space will be finalized with

the help of LANL.

Interfaces among domains

________________________

There is need for a definition of how a processor/memory subsystem

will interface with the interconnection network and how the application

domain and operating/system domain will interface with a processor/memory

subsystem. These implementation issues have been discussed and that

discussion will appear in an addendum to the document that describes

the proposed design of the application domain.

C vs. Fortran

_____________

Currently different POEMS members are working with different versions

of Sweep3D, namely a Fortran version and a C version. Comparison of

performance results that are based on these two different versions can

be questionable since preliminary examinations of these codes show

significant differences.

Completed Travel

____________________

Dr. Oliver attended an SES Workbench class in Austin, Texas. Workbench

is one of the possible POEMS demonstration platforms.

Drs. Oliver and Teller (UT-El Paso) attended the POEMS meeting at UT-Austin.

Drs. Oliver and Teller (UT-El Paso) attended the 1988 IEEE International

Performance, Computing, and Communications Conference (IPCCC '98) at which

Dr. Oliver presented the following two papers that are published in the

conference proceedings -- the acceptance of these papers was reported in the

first quarterly report:

"Accurate Measurement of System Call Service Times for Trace-Driven

Simulation of Memory Hierarchy Designs" by Richard Oliver, Ward P. McGregor,

and Patricia J. Teller

"Generating Dynamically Scheduled Memory Address Traces" by Richard Oliver

and Patricia J. Teller.

Equipment and Description

______________________________