Quarterly Status Report

Performance Modeling

An Environment For End-to-End Performance Design of

Large Scale parallel Adaptive Computer/Communications Systems

for the period May 1st, 1999 to July 31st, 1999,

Contract N66001-97-C-8533

CDRL A001

 

1.0 Purpose of Report

This status report is the quarterly contract deliverable (CDRL A001), which summarizes the effort expended by the University of Texas, Austin team in support of Performance Modeling on Contract N66001-97-C-8533.

2. Project Members

University of Texas, spent: 1,065 hours

sub-contractor (Purdue), spent: 80 hours

sub-contractor (UT-El Paso), spent: 936 hours

sub-contractor (UCLA), spent: 456 hours

sub-contractor (Rice), spent: 433 hours

sub-contractor (Wisconsin), spent: 0 hours

sub-contractor (Los Alamos), spent: 0 hours

3.0 Project Description (last modified 07/97)

3.1 Objective

The goals of this project are: (1) to develop a comprehensive environment (POEMS) for end-to-end performance analysis of large, heterogeneous, adaptive, parallel/distributed computer and communication systems, and (2) to demonstrate the use of the environment in analyzing and improving the performance of defense-critical parallel and distributed systems.

3.2 Approach

The POEMS project combines innovations from a number of domains (communication, data mediation, parallel programming, performance modeling, software engineering, and CAD/CAE) to realize the goals. First, we will develop a specification language based on a general model of parallel computation with specializations to representation of workload, hardware and software. To enable direct use of programs as workload specifications, compilation environments such as dHPF will be adapted to generate executable models of parallel computation at specified levels of abstraction.

Second, we will experimentally and incrementally develop and validate scaleable models. This will involve using multi-scale models, multi-paradigm models, and parallel model execution in complementary ways. Multi-scale models will allow different components of a system to be modeled at varying levels of detail via the use of adaptive module interfaces, supported by the specification language. Multi-paradigm models will allow an analyst to use the modeling paradigm—analytical, simulation, or the software or hardware system itself—that is most appropriate with respect to the goals of the performance study. Integration of an associative model of communication with data mediation methods to provide adaptive component interfaces will allow us to compose disparate models in a common modeling framework. To handle computationally expensive simulations of critical subsystems in a complex system, we will incorporate parallel simulation technology based on the Maisie language.

Third, we will provide a library of models, at multiple levels of granularity, for modeling scaleable systems like those envisaged under the DOE ASCI program, and for modeling complex adaptive systems like those envisaged under the GloMo and Quorum

programs.

Finally, we will provide a knowledge base of performance data that can be used to predict the performance properties of standard algorithms as a function of architectural characteristics.

4.0 Performance Against Plan

4.1 Spending – Spending has caught up with plan. All of the subcontracts except for LANL are place. The spending rate for the project will, after this quarter, run at about the planned rate.

4.2 Task Completion - A summary of the completion status of each task in the SOW is given following. Because several participants are involved in most tasks the assessment of completion for tasks in progress have some uncertainty in the estimates of completion. Assessments of task completions by participating institutions are given in the progress reports from each institution.

Task 1 - 95% Complete - Methodology development is an iterative process. One develops a version of the methodology, applies it and revises the methodology according to the success attained in the application. Evaluation of the methodology is in progress with the analysis of the performance of Sweep3D on the SP2 family of architectures. Closure will come with completion of Task 7 when validation of the methodology on the first end-to-end performance model has been completed.

Task 2 - Complete

Task 3 - 95% Complete - Specification languages for all three domains have been proposed and are in various states of completion.

Task 4 - 85% Complete - Task graphs can now be developed for most HPF programs and work on MPI programs is well underway.

Task 5 - 85% Complete - The compiler for the specification language is well into development. Use of the compilation methods developed for the CODE parallel programming system at UT-Austin has accelerated this task.

Task 6 - 65% Complete - The initial library of components has been specified and instantiation has begun. (See the progress reports from UTEP and Wisconsin for details.)

Task 7 - 50% Complete - Subtask or Phase 1 of this task is about 50% complete. (See the progress reports from UCLA and Wisconsin for details.)

Task 8 - 55% Complete

Task 9 – Task 9 has been partitioned into seven subtasks. Subtask 9.1 is complete and Subtask 9.2 is complete. Subtask 9.3 is complete. Tasks 9.4 is 50% complete, 9.5 is 35% complete and 9.6 is20% complete. Subtask 9.7 has just been initiated this quarter.

Task 10 - 0% Complete

Task 11 - 0% Complete

5.0 Major Accomplishments to Date

    1. Project Management
    2. a) Long Term Workplan

      POEMS has generated the framework for end-to-end performance modeling and has developed initial versions of several major components. This year has been designated the "Year of Integration." The long-term goal for this year (1999/2000) is integration of POEMS components into the framework. This will enable POEMS to spend the bulk of the third year of the project in application to further example systems.

    3. Technical Accomplishments

a. Knowledge Base

* Pythia system made ready to accept data from performance testing and

modeling

* Started work on Task 9.7

* Ifestos methodology validated on data from two case studies

previously published and analyzed by a completely different methodology.

b. Tool Interfacing and Integration

Integration of Compiler-Generated Task Execution Times into MPI-SIM

Rice and UCLA have been collaborating to develop hybrid models which integrate compiler-based optimizations which can describe task execution times

as a function of analytical formulas and measurement into the MPI-Sim simulator. For example, if a task is composed of a loop with bounds from 0 to n-1, the task execution time would be n* (measured execution time of the code segment inside the loop). This integration can be used to facilitate simulation of systems with thousands of processors, and realistic problem sizes expected for such large systems. In the previous quarter we had manually modified a few MPI programs to evaluate the benefits of this approach, with very promising results.

The MPI-Sim simulator was extended to accept the analytical descriptions of the task execution times and incorporates them into the performance estimation of the total execution time. The communications tasks are still simulated in detail by MPI-Sim.

The first step in developing the hybrid model is to derive a task graph from the application source code. The task graph has two purposes, one is to expose the tasks so that measurement of execution task time can be performed and the second is to derive analytical models of task time execution.

Once the tasks are exposed, the application is run and the task execution times are measured. For the purpose of large-scale simulation, the measurements are performed on a small problem size and a small number of processors.

To accomplish this generation of task graphs the Rice dHPF compiler has been extended as follows:

1) The compiler uses program slicing to identify those subsets of the

computational tasks whose *results do not affect the performance* of

the program; We call these "redundant computations" (redundant from the

viewpoint of program performance.

(2) The compiler computes analytical (but symbolic) estimates for the

execution time of these redundant computations.

(3) The compiler modifies the generated message-passing program to replace

the redundant computations with calls to a special MPI-Sim function,

passing in the symbolic performance estimate as a parameter.

(4) The compiler also generates a second new version of the message-passing

program with instrumentation inserted to measure parameter values for

the symbolic task performance estimates. This version is executed to

measure these parameter values and these values are also provided as a

separate input to the simulator.

(These compiler extensions directly exploit information in the static

task graph, which was developed an artifact of Rice University's effort on

task 4 of this project.)

Integration of MPI-Sim and SimpleScalar

UTEP finalized the interface between MPI and SimpleScalar, tested it, and distributed a document describing this effort and the use of this tool; the document is entitled "SimpleScalar 3.0a Modifications to Run Under MPI".

The interfacing of MPI and SimpleScalar has made it possible to run an MPI

program, in particular the MPI version of Sweep3D, on multiple instantiations

of SimpleScalar that communicate using MPI commands. The referenced document

can serve as a template for interfacing SimpleScalar to the POEMS platform.

Interfacing of Task Graph and Poems Specification Language Models

Rice and UT-Austin are working to interface the static and dynamic task graphs generated by dHPF to CODE environment, by mapping the task graphs to the POEMS Specification Language (PSL). The Rice investigators (Vikram Adve and Rizos Sakellariou) visited UT Austin for a day on May 17, 1999. The major result of the meeting was a resolution of the key technical issues to be faced in interfacing the two systems, and work plan to achieve this goal. Subsequently, Rizos Sakellariou provided UT Austin with a detailed example of a task graph for an example MPI program, that illustrates the features required to map to PSL.

c. Methodology Definition

The WOSP and TSEE papers define and illustrate the methodology.

d. Specification Languages

A complete example in the Poems Specification Language was given in the TSE paper.

e. Model Development and Validation

Hardware Domain Component Library

performance counters. If accurate, these counters could be used for modeling

purposes.

SWEEP3D Models

6. Artifacts Developed

Artifacts include technical papers, software and models.

    1. Technical Papers

a. "Compiler-Supported Simulation of Very Large Parallel Applications,"

Vikram Adve, Rajive Bagrodia, Ewa Deelman, Thomas Phan, and

Rizos Sakellariou, To appear in the Proceedings of the ACM/IEEE SC99

Conference on High Performance Networking and Computing."

b. "Analytic Evaluation of Shared Memory Architectures with Heterogeneous Applications" D. Eager, D. Sorin and M. Vernon. (Submitted to HPCA 2000.

7.0 Issues

    1. Open Issues with no Plan for Resolution
    2. None

    3. Open Issues with Plan for Resolution
    4. a. Integration of Performance Data into the Knowledge System

      How to automate (at least partially) the insertion of performance data into the knowledge system? This should not be a formidable technical proble but there are many details to be coordinated among the project participants

      UCLA

      b. Integration of PARSEC runtime into MPISIM.

      We are considering porting the MPI-Sim simulator to the PARSEC runtime

      simulation system. This will allow us to utilize the latest synchronization

      algorithms present in PARSEC.

      c. Integration of LogGP with AMVA

      Integration of the LogGP specification of application synchronization structure with the AMVA model of the Origin 2000-like memory for more precise end-to-end modeling of the Splash benchmarks.

    5. Issues Resolved

None

8.0 Near-term Plan

"Near-term" refers to the next one or two quarters.

  1. Knowledge-Based System

linear algebra solvers.

b. Models and Model Evaluation

which will address feedback from UCLA and LANL and will include

the specification of the LLNL-SP/2 Power604e and a description of SimpleScalar.

of the block of work of Sweep3D.

performance counters.

  1. Interfacing and Integration

9.0 – Completed Travel

Portland, Oregon. She is Student Volunteers Chair. This trip was not paid for by funds from this grant.

1999, Washington, DC. This trip was not paid for by funds from this grant.

Paso, Rice University and UCLA

10.0 Equipment

None Acquired

11.0 Summary of Activity

11.1 Work Focus:

The two foci for continuing work for the 1999/2000 year are integration of tools and component model library development.

  1. Knowledge Base
  1. Integration and Interfacing

c) Models and Model Evaluation

in particular, implementation of the processor/memory subsystem

component model/simulator of the LLNL-SP/2 Power604e and the SGI O2K

MIPS R10000.

performance analysis.

11.2 Significant events

literature.

from group into the Ifestos framework

May 4-6, 1999. The titles of the two papers were "Performance Prediction of Large Parallel Applications Using Parallel Simulations" and , "Predictive Analysis of a Wavefront Application Using LogGP."

 

 

 

 

 

FINANCIAL INFORMATION:

Contract #: N66001-97-C-8533

Contract Period of Performance: 7/24/97-7/23/00

Ceiling Value: $1,839,517

Reporting Period: 5/1/99-7/31/99

Actual Vouchered (all costs to be reported as fully burdened, do not report

overhead, GA and fee separately):

 

Actual Vouchered (all costs to be reported as fully burdened, do not report

overhead, GA and fee separately):

Current Period

Prime Contractor Hours Cost

Labor 1,065 44,440.25

ODC's 35,821.90

Sub-contractor 1 (Purdue) 80 12,414.90

Sub-contractor 2 (UT-El Paso) 936 20,086.47

Sub-contractor 3 (UCLA) 456 23,260.53

Sub-contractor 4 (Rice) 433 22,533.81

Sub-contractor 5 (Wisconsin) 0 216.07

Sub-contractor 6 (Los Alamos) 0 0.00

TOTAL: 2,970 158,773.93

Cumulative to date:

Cumulative to date:

Prime Contractor Hours Cost

Labor 8,225 254,165.26

ODC's 311,962.02

Sub-contractor 1 (Purdue) 1,208 95,232.81

Sub-contractor 2 (UT-El Paso) 3,346 148,415.06

Sub-contractor 3 (UCLA) 2,676 135,309.47

Sub-contractor 4 (Rice) 2,848 164,482.46

Sub-contractor 5 (Wisconsin) 1,658 86,058.40

Sub-contractor 6 (Los Alamos) 0 0.00

TOTAL: 19,961 1,195,625.48