From: John Rice <jrr@cs

Quarterly Status Report

Performance Modeling

An Environment For End-to-End Performance Design of

Large Scale parallel Adaptive Computer/Communications Systems

for the period February 1^st, 1999 to April 30^st, 1999,

Contract N66001-97-C-8533

CDRL A001

1.0 Purpose of Report

This status report is the quarterly contract deliverable (CDRL A001), which summarizes the effort expended by the University of Texas, Austin team in support of Performance Modeling on Contract N66001-97-C-8533.

2. Project Members

University of Texas, spent: 1,200 hours

sub-contractor (Purdue), spent: 128 hours

sub-contractor (UT-El Paso), spent: 472 hours

sub-contractor (UCLA), spent: 710 hours

sub-contractor (Rice), spent: 644 hours

sub-contractor (Wisconsin), spent: 300 hours

sub-contractor (Los Alamos), spent: 0 hours

3.0 Project Description (last modified 07/97)

3.1 Objective

The goals of this project are: (1) to develop a comprehensive environment (POEMS) for end-to-end performance analysis of large, heterogeneous, adaptive, parallel/distributed computer and communication systems, and (2) to demonstrate the use of the environment in analyzing and improving the performance of defense-critical parallel and distributed systems.

3.2 Approach

The project combines innovations from a number of domains (communication, data mediation, parallel programming, performance modeling, software engineering, and CAD/CAE) to realize the goals. First, we will develop a specification language based on a general model of parallel computation with specializations to representation of workload, hardware and software. To enable direct use of programs as workload specifications, compilation environments such as dHPF will be adapted to generate executable models of parallel computation at specified levels of abstraction.

Second, we will experimentally and incrementally develop and validate scaleable models. This will involve using multi-scale models, multi-paradigm models, and parallel model execution in complementary ways. Multi-scale models will allow different components of a system to be modeled at varying levels of detail via the use of adaptive module interfaces, supported by the specification language. Multi-paradigm models will allow an analyst to use the modeling paradigm—analytical, simulation, or the software or hardware system itself—that is most appropriate with respect to the goals of the performance study. Integration of an associative model of communication with data mediation methods to provide adaptive component interfaces will allow us to compose disparate models in a common modeling framework. To handle computationally expensive simulations of critical subsystems in a complex system, we will incorporate parallel simulation technology based on the Maisie language.

Third, we will provide a library of models, at multiple levels of granularity, for modeling scaleable systems like those envisaged under the DOE ASCI program, and for modeling complex adaptive systems like those envisaged under the GloMo and Quorum

programs.

Finally, we will provide a knowledge base of performance data that can be used to predict the performance properties of standard algorithms as a function of architectural characteristics.

4.0 Performance Against Plan

4.1 Spending – Spending has caught up with plan. All of the subcontracts except for LANL are place.. The spending rate for the project will, after this quarter, run at about the planned rate.

4.2 Task Completion - A summary of the completion status of each task in the SOW is given following. Because several participants are involved in most tasks the assessment of completion for tasks in progress have some uncertainty in the estimates of completion. Assessments of task completions by participating institutions are given in the progress reports from each institution.

Task 1 - 95% Complete - Methodology development is an iterative process. One develops a version of the methodology, applies it and revises the methodology according to the success attained in the application. Evaluation of the methodology is in progress with the analysis of the performance of Sweep3D on the SP2 family of architectures. Closure will come with completion of Task 7 when validation of the methodology on the first end-to-end performance model has been completed.

Task 2 - Complete

Task 3 - 90% Complete - Specification languages for all three domains have been proposed and are in various states of completion.

Task 4 - 75% Complete - Task graphs can now be developed for most HPF programs and work on MPI programs is well underway.

Task 5 - 75% Complete - The compiler for the specification language is well into development. Use of the compilation methods developed for the CODE parallel programming system at UT-Austin has accelerated this task.

Task 6 - 55% Complete - The initial library of components has been specified and instantiation has begun. (See the progress reports from UTEP and Wisconsin for details.)

Task 7 - 40% Complete - Subtask or Phase 1 of this task is about 50% complete. (See the progress reports from UCLA and Wisconsin for details.)

Task 8 - 55% Complete

Task 9 – Task 9 has been partitioned into seven subtasks. Subtask 9.1 is complete and Subtask 9.2 is complete. Subtask 9.3 is complete. Tasks 9.4 is 40% complete, 9.5 is 25% complete and 9.6 is10% complete. Subtask 9.7 has not yet been initiated.

Task 10 - 0% Complete

Task 11 - 0% Complete

5.0 Major Accomplishments to Date

5.1 Project Management

a) Long Term Workplan

POEMS has generated the framework for end-to-end performance modeling and has developed initial versions of several major components. This year has been designated the "Year of Integration." The long-term goal for this year is integration of POEMS components into the framework. This will enable POEMS to spend the bulk of the third year of the project in application to further example systems.

5.2 Technical Accomplishments

a. Knowledge Base

* Completed Task 9.3

* Started work on Task 9.6

b. Models, Model Evaluation and Modeling

* Integration of Task Graph Models of Applications and MPI-SIM - A new collaboration between Rice and UCLA to combine the task graph representations developed at Rice with the MPI-SIM execution driven simulator for MPI programs at UCLA. We have developed practical, automatic techniques to exploit compiler support to facilitate simulation of systems with thousands of processors, and realistic problem sizes expected on such large systems. The key idea behind our approach is that the information in the static task graph allows us to avoid executing large portions of the sequential computations. Instead these are modeled as simple delays estimated by a combination of compiler analysis and direct measurement.

We evaluated the accuracy and benefits of our technique for manually-modified MPI programs, including Sweep3D, the SP program from the NAS benchmarks, and a synthetic application with a variety of communication patterns. For example, for the Sweep3D code, the errors we observed in our approximate simulation are typically 10-20%, and at most 37%. The total memory usage of the simplified simulator is 2-3 orders of magnitude less than the original simulator, and the simulation time is a factor of 5 times less. The impact of these improvements is described briefly under 'Significant Events.'

* Validation of MPI-SIM on the new testbed, the distributed shared memory SGI Origin 2000 has been a major activity. We have executed and simulated a variety of applications. In addition to Sweep3D, we have added two scientific programs and a synthetic application to our core experimental set. The programs BT and SP from the NAS Parallel Benchmark 2 suite are two real-world applications designed to solve linear equations, and each has been run with varying problem sizes and number of parallel processors. Our own synthetic benchmark, SAMPLE, is designed to execute computation- and communication-intensive segments of code at varying ratios of computation-to-communication time and with different message-passing patterns. These diverse applications have allowed us to validate MPI-SIM on a wide spectrum of programs.

MPI-SIM has been found to be accurate (within 5%) when predicting the performance of Sweep3D and the NAS benchmarks. Additionally, MPI-SIM proved to be accurate in predicting the performance of SAMPLE. Even for a large communication-to-computation ratio of 1:2 and a wavefront communication pattern, the error in prediction is below 10% and around 5% for smaller ratios. Similar results were obtained for other communication patterns.

* UTEP completed testing of MPI_SS. A draft of the documentation also is complete.

* Vernon revised the paper submitted to PPoPP '99, "Predictive Analysis of a Wavefront Application Using LogGP", for publication in the conference.

* The POEMS group jointly revised and extended the WOSP '98 POEMS paper for submission to a special issue of the IEEE Transactions on Software Engineering. The paper revisions prompted refinements in the overall modeling methodology as well as refinements in the role that hybrid models can play in the methodology.

c. Frameworks, Specification Languages and Compilers

* Design for the mapping of the output of the task graph compiler to the POEMS Specification Language was initiated.

6. Artifacts Developed

6.1 Technical Papers

The POEMS project had two papers accepted for the important Principles and Practice of Parallel Programming symposium.

[1] "A Data Mining Environment for Modeling the Performance of Scientific Software", E.N. Houstis, V.S. Verykios, A.C. Catlin, N. Ramakrishnan, and J.R. Rice. Submitted to the KDD-99 conference: Intl. Conf. On Knowledge Discovery and Data Mining.

[2] "A Data Mining Environment for Modeling the Performance of Scientific Software" accepted for publication in the book "Problem Solving Environments for Computational Science", E. Houstis, S. Gallopoulos, J. Rice, and R. Bramley (editors), IEEE Press (to appear).

[3] "PYTHIA-II: A Knowledge/Data Base System for Testing and Recommending Scientific Software", E.N. Houstis, V.S. Verykios, A.C. Catlin, N. Ramakrishnan, and J.R. Rice.. Submitted to the ACM Trans. Math. Software.

[4] "POEMS – End to End Performance Models for Dynamic Parallel and Distributed Systems", J.C. Browne in Proceedings of the Seventh Symposium on the Frontiers of Massively Parallel Computing, (Annapolis, MD, February 21-25, 1999), pp.160-163

[5] "Compiler-Supported Simulation of Very Large Parallel Applications" Vikram Adve, Rajive Bagrodia, Ewa Deelman, Thomas Phan, Rizos Sakellariou, Submitted to Supercomputing '99.

[6] "Performance Prediction of Large Parallel Applications Using Parallel Simulations." Rajive Bagrodia, Ewa Deelman and Thomas Phan (to appear in the Proceedings of theACM SIGPLAN 1999 Symposium on Principles and Practice of Parallel Programming Atlanta, Georgia on May 4-6, 1999.)

[7] " POEMS: End-to-end Performance Design of Large Parallel Adaptive Computational Systems", V. Adve, et. al. (submitted to IEEE Transactions on Software Engineering.)

[8] "Predictive Analysis of a Wavefront Application Using LogGP" M. Vernon and D. Sunderam-Stukel" (to appear in the Proceedings of theACM SIGPLAN 1999 Symposium on Principles and Practice of Parallel Programming Atlanta, Georgia on May 4-6, 1999.)

7.0 Issues

7.1 Open Issues with no Plan for Resolution

a. Integration of MPI-SIM and SimpleScalar - We are still looking at the possibility of integrating MPI-SIM with the Simple Scalar processor and memory simulator used by UTEP. The interface between the simulators needs to be defined and developed.

b. Fortran and SimpleScalar - Simulating Fortran programs within the SimpleScalar Tool Set.

7.2 Open Issues with Plan for Resolution

a. Entry of Performance Data - How to get performance data from other sites inserted into the database.

Integration of LogGP and AVMA Models - How to integrate the LogGP specification of application synchronization structure with the AMVA model of the Origin 2000-like memory for more precise end-to-end modeling of the Splash benchmarks.

c. Interface Definition - Definition of the interfaces across application, operating system/runtime environment, and hardware modeling domains.

d. Benchmarking - Understanding the output of the narrow spectrum benchmarks being used for calibration.

7.3 Issues Resolved

a. MPI-SIM on the SGI Origin - The validation of MPI-SIM on the Origin 2000 has been completed.

8.0 Near-term Plan

The near term plan focuses on completing components of POEMS and getting them ready for integration.

a. Knowledge-Based System

* PYTHIA II system. Complete the enhanced system

* Performance database. Complete creation of a large set of performance data for linear algebra solvers. Complete evaluation of the knowledge inference methodology for linear algebra solvers.

b. Models and Model Evaluation

* Rice and UCLA are continuing work on integration of task graph generation and MPISIM. Rice is extending the dHPF compiler infrastructure to automatically synthesize simplified MPI programs that capture exactly the computation and communication code that must be explicitly simulated. This is based directly on the information in the task graph, plus an additional analysis using program slicing to isolate those portions of the computation that need to be simulated (because they affect the communication behavior or the sequential computations times).

* Rice and UT-Austin have begun a collaboration with the University of Texas to interface the static and dynamic task graphs generated by dHPF to the POEMS Specification Language.

* Distribute final version of documentation of MPI_SS. Collaborate with UCLA to

interface SimpleScalar with MPI-SIM.

* Distribute a new version of the "hardware Domain Component Model Specification"

document, which will address feedback from the UCLA and LANL Poets and will

include the specification of the LLNL-SP/2 Power604e and a description of

SimpleScalar.

* Validate the SimpleScalar simulations of the Power604e and MIPS R10000 by

comparison of execution time with actual execution times of the block of work.

* Using SimpleScalar investigate the performance of Sweep3D on next-generation

architectures. This work is in collaboration with the University of Wisconsin-Madison.

The LogGP model will be used in association with SimpleScalar simulation results.

This study will include the memory study for Sweep3D.

* Sweep3D CPU stall study in collaboration with LANL.

* Experiment with further hybrid analytic/simulation models of Sweep3d, including models with simulated alternative memory hierarchies (with the UCLA and UTEP teams.)

c. Task Graph Generation

* Work will continue on Task 4. We aim to improve the functionality of the existing prototype.

*Interfacing the task graph to an execution-driven MPI simulator, MPI-SIM, to improve the efficiency of simulation of MPI programs.

Frameworks, Specification Language and Compilers

*Mapping of task graphs to the POEMS Specification Language.

*Revision and extension of the POEMS Specification Language.

* Revision and Extension of the POEMS Specification Language Compiler.

9.0 – Completed Travel

University of Texas at Austin

Browne attended the Frontiers of Massively Parallel Computing in Annapolis, MD on February 22-23, 1999 and presented a paper.

10.0 Equipment

None Acquired

11.0 Summary of Activity

The main foci for collective activities have been completion of the measurements and modeling for SWEEP3D on the IBM SP2 and completion of POEMS components and preparation for integration of these components into the POEMS framework

Each participating institution has been working on their responsibilities for tasks 1, 2, 3, 4, 6, 7 and 9.

11.1 Work Focus:

The foci for activities are broken out by topic.

a) Knowledge Base

*Develop the POEMS knowledge base system (renamed PYTHIA II)

*Expand the performance data set in the knowledge base

b. Models, Model Evaluation and Modeling

*Rice and UCLA collaborated to develop techniques to exploit application task graphs for efficient parallel simulation of MPI programs. We evaluated these integrated techniques using manually modified MPI programs, and obtained substantial reductions in memory usage and simulation time for the Sweep3D benchmark, as described below..

*Completion of PPoPP '99 paper and creating the new paper describing the POEMS methodology more completely.

* Hardware domain component model specification and implementation,

in particular, implementation of the processor/memory subsystem

component model/simulator of the LLNL-SP/2 Power604e and the SGI O2K MIPS R10000.

* Interfacing of processor/memory subsystem simulation and MPI-SIM.

* Research w.r.t. next generation processors/memory systems and microarchitecture

performance analysis.

* Validation of MPI-SIM on the Origin 2000.

c. Task Graph Generation

*Rice and UT-Austin initiated collaboration to map task graphs to the POEMS specification languge

d. Framework – Methodology, Specification Language and Compiler

*Rice and UT-Austin initiated collaboration to map task graphs to the POEMS specification language

11.2 – Significant Events

Knowledge Base

* Incorporation of a critical mass of linear algebra solver performance

data within the PYTHIA II knowledge base.

b. Models, Model Evaluation and Modeling

* Rice and UCLA together demonstrated 100-2000x reduction in memory usage

and 5x reduction in simulation time for the parallel simulation of MPI

performance of the Sweep3D benchmark. For a realistic problem size, this

allows us to simulate a 6,400 processor system compared with a maximum of

400 processors without our optimizations. Furthermore, in some cases, the

simulation is actually faster than the real-time execution of the program.

The error in predicted execution time caused by the optimizations is

typically in the range of 10-20%, which should be acceptable for most

purposes.

* Validated MPI-SIM on a set of benchmarks on the SGI Origin 2000.

FINANCIAL INFORMATION:

Contract #: N66001-97-C-8533

Contract Period of Performance: 7/24/97-7/23/00

Ceiling Value: $1,839,517

Reporting Period: 2/01/99-4/30/99

Actual Vouchered (all costs to be reported as fully burdened, do not report

overhead, GA and fee separately):
Current Period

Prime Contractor Hours Cost

Labor 1,200 32,147.00

ODC's 28,865.16

Sub-contractor 1 (Purdue) 128 11,947.07

Sub-contractor 2 (UT-El Paso) 472 18,161.61

Sub-contractor 3 (UCLA) 710 35,704.08

Sub-contractor 4 (Rice) 644 32,942.44

Sub-contractor 5 (Wisconsin) 300 5,521.77

Sub-contractor 6 (Los Alamos) 0 0.00

TOTAL: 3,454 165,289.13

Cumulative to date:

Prime Contractor Hours Cost

Labor 7,160 209,725.01

ODC's 276,140.12

Sub-contractor 1 (Purdue) 1,128 82,817.91

Sub-contractor 2 (UT-El Paso) 2,410 128,328.59

Sub-contractor 3 (UCLA) 2,220 112,048.94

Sub-contractor 4 (Rice) 2,415 141,948.65

Sub-contractor 5 (Wisconsin) 1,658 85,842.33

Sub-contractor 6 (Los Alamos) 0 0.00

TOTAL: 16,991 1,036,851.55