Williams, Paul Leighton. "Learning Visual Scene Descriptions: An Approach to Symbol Grounding." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-01. December 2005. 36 pages.
Introducing a new architecture has never been a trouble-free subject for computer architects. Growing complexity and compatibility issues often hinder the progress of new architectures even though the technological trends demand rapid, wholesome changes. TRIPS, is one such novel architecture that targets the problems of wire delays, memory latencies, power consumption and saturated parallelism. TRIPS aims to be a scalable, malleable, dynamically adaptive and non-specialized architecture that supports diverse applications. This thesis analyzes the advantages and disadvantages of the TRIPS architecture and its prototype. Because this thesis is the first to evaluate the TRIPS architecture, it tries to establish a foundation based on which the evaluation and analysis of the TRIPS architecture can be carried out smoothly in the future. The goals of this thesis are three-fold. The thesis provides details of the tools developed to analyze the TRIPS architecture, reports the metrics that provide data on the effects of the features of the TRIPS architecture on the program output, and evaluates the efficiency of the TRIPS model by comparing it with the Alpha 21264 (RISC) machine for a set of common source programs.
Tran, Khoa D. "High Frequency Wave Propagation: a Convergent Approach, Part II: Implementation and Numerical Testing." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-02. December 7, 2005. 25 pages.
Numerical approximation of wave propagation is prevalent in many applications. When the wave frequency is high, this is a computationally challenging problem and is a study of interest. In this paper, I present the implementation and numerical testing of the convergent approach proposed by Bruno et al. and retheorized by Tran to solve problems of electromagnetic or acoustic scattering by convex obstacles. The problem is formulated as an integral equation. For the evaluation of the integral operators, a localized integration scheme is used. This, combined with a change of variables to resolve the problem of high slopes at the shadow boundaries, provides an efficient algorithm to solve the scattering problem for Helmholtz equation at high frequencies, with a computational cost that is independent of the frequency for a fixed accuracy.
Richardson, Clare. "Rapid, High Precision Control in Tightly Constrained Environments." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-03. December, 2005. 21 pages.
In autonomous robotics, environments in which there is little room for error in movement such as traveling through a doorway can be troublesome. Some control methods can only deal with these tasks by severely reducing the speed of the robot. However, combining both speed and precision, smooth movement around obstacles is possible. Many control methods focus on tasks ‰in relatively open environments such as traveling down corridors to test their speed and obstacle avoidance capabilities. I have chosen two such control methods: the Vector Field Histogram approach by Borenstein and Koren and the Dynamic Window approach by Fox, Burgard, and Thrun. I evaluated their ability in tightly-constrained test environments using Vulcan, a robotic wheelchair, and implemented modifications to improve their usability in such environments.
Chowdhury, Rezaul Alam and Vijaya Ramachandran. "The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework and Experimental Evaluation." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-04. March 29, 2006. 36 pages.
The cache-oblivious Gaussian Elimination Paradigm (GEP) was introduced by the authors in an earlier paper to obtain efficient cache-oblivious algorithms for several important problems that have algorithms with triply-nested loops similar to those that occur in Gaussian elimination. These include Gaussian elimination and LU-decomposition without pivoting, all-pairs shortest paths and matrix multiplication. In this paper, we prove several important properties of the cache-oblivious framework for GEP given earlier, which we denote by I-GEP. We build on these results to obtain C-GEP, a completely general cache-oblivious implementation of GEP that applies to any code in GEP form, and which has the same time and I/O bounds as the earlier algorithm, while using a modest amount of additional space. We present an experimental evaluation of the caching performance of I-GEP and C-GEP in relation to the traditional Gaussian elimination algorithm. Our experimental results indicate that I-GEP and C-GEP outperform GEP on inputs of reasonable size, with dramatic improvement in running time over GEP when the data is out of core. `Tiling', an important loop transformation technique employed by optimizing compilers in order to improve temporal locality in nested loops, is a cache-aware method that does not adapt to all levels in a multi-level memory hierarchy. The cache-oblivious GEP framework (either I-GEP or C-GEP) produces system-independent I/O-efficient code for triply nested loops of the form that appears in Gaussian elimination without pivoting, and is potentially applicable to being used by optimizing compilers for loop transformation.
Yu, Zeyun and Chandrajit Bajaj. The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-05. January 9, 2006. 12 pages.
We present computational approaches for analyzing/interpreting three-dimensional (3D) structures of large bio-molecular complexes at intermediate or moderate resolutions (5-20 angstroms). Two problems are addressed: one is the 3D alignment (matching) between segmented subunits and the other is the secondary structure identification within subunits. For the first problem, we propose a fast algorithm to spatially correlate two subunits. A number of applications are discussed and illustrated, including measuring similarities, averaging subunits, refining segmentation, and fitting atomic structures. For the second problem, we propose an efficient algorithm to detect the secondary structures, including both alpha-helices and beta-sheets, of a protein density map. The performance of our approaches is demonstrated on both experimentally reconstructed virus maps and computationally simulated protein density maps.
Siddavanahalli, Vinay K. and Chandrajit Bajaj. "Fast Error-bounded Surfaces and Derivatives Computation for Volumetric Particle Data." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-06. December 6, 2005. 16 pages.
Volumetric smooth particle data arise as atomic coordinates with electron density kernels for molecular structures, as well as fluid particle coordinates with a smoothing kernel in hydrodynamic flow simulations. In each case there is the need for efficiently computing approximations of relevant surfaces (molecular surfaces, material interfaces, shock waves, etc), along with surface and volume derivatives (normals, curvatures, etc.), from the irregularly spaced smooth particles. Additionally, molecular properties (charge density, polar potentials), as well as field variables from numerical simulations are often evaluated on these computed surfaces. In this paper we show how all the above problems can be reduced to a fast summation of irregularly spaced smooth kernel functions. For a scattered smooth particle system of M smooth kernels in R^3, where the Fourier coefficients have a decay of the type 1/{omega}^3, we present an O(M + n^3 \log n +N) time, Fourier based algorithm to compute N approximate, irregular samples of a level set surface and its derivatives within a relative L_2 error norm epsilon, where n is O(M^{1/3}epsilon^{1/3}). Specifically, a truncated Gaussian of the form e^{-bx^2} has the above decay, and n grows as sqrt{b}. In the case when the N output points are samples on a uniform grid, the back transform can be done exactly using a Fast Fourier transform algorithm, giving us an algorithm with O(M + n^3 \log n +N \log N) time complexity, where n is now approximately half its previously estimated value.
Jump, Maria and Kathryn S. McKinley. "Cork: Dynamic Memory Leak Detection for Java." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-07. January 25, 2006. 10 pages.
Despite all the benefits of garbage collection, memory leaks remain a problem for Java programs. A memory leak in Java occurs when a program inadvertently maintains references to objects that it no longer needs, preventing the garbage collector from reclaiming space. At best, leaks degrade performance. At worst, they cause programs to run out of memory and crash. Small continuous leaks in long-running programs are notoriously hard to find and can crash the program only after days or weeks of execution.We introduce Cork, a low-overhead, accurate technique for detecting memory leaks in Java programs. Cork identifies overall monotonic heap growth by piggybacking on the garbage collector. On each full-heap collection, Cork builds a summary type points-to graph annotated with type volumes. Cork identifies potentially leaking types that grow over multiple collections. Cork reports the slice in the type points-to graph that is growing (i.e., the data structure that points to the leaking type). We implement Cork in MMTk for Jikes RVM, where it adds an average overhead of 2.4\% for moderate heap sizes and 1.7\% for large heap sizes to SPECjvm98 and DaCapo benchmarks using a generational mark-sweep collector. Cork exactly identifies a single growing data structure in each of three popular benchmarks (fop, _202_jess, and SPECjbb2000). Due to the precision of Cork's report, we eliminated these leaks in _202_jess and SPECjbb2000, whereas their developers had not previously done so. Cork is the first tool to find leaks in Java with low enough overhead to consider using online.
Zarnani, Hormoz. "An Incremental Approach to Verifying the Intentional Naming System Framework and Extensions." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-08. December, 2005. 97 pages.
Lookup-Name is the name resolution algorithm for the Intentional Naming System (INS). We formalize this algorithm in the ACL2 logic and develop a specification for its correctness. We discuss certain difficulties that arise in our formalization of this algorithm which restrict our ability to prove its correctness. We describe a simpler version of the algorithm, specify it, and prove it correct. We discuss how the proof of the simple algorithm can be used as a guide to prove the original algorithm correct.
Changkyu, Kim, Simha Sethumadhavan, Nitya Ranganathan, Liu Haiming, Robert McDonald, Doug Burger, and Stephen W. Keckler. "Elastic Threads on Composable Processors." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-09. March 8, 2006. 18 pages.
Modern CMPs are designed to exploit both instruction-level parallelism within processors and thread-level parallelism across processors, but the balance between the granularity of each processor and the number of processors must be chosen at design time. In this paper, we propose a microarchitecture that allows this balance to be dynamically adjusted. The microarchitecture, which implements the TRIPS instruction set, consists of a large number of fine-grained, single-issue processor cores. By changing a set of OS-visible configuration registers, the system software can aggregate multiple cores to form larger, more powerful processors, depending on the needs of the available threads. For instance, a 64-core chip could be configured as 64 1-wide processors, 1 64-wide processor, or any combination in between. We quantify the area and performance overheads associated with providing the capability to compose larger processors out of multiple small ones, find the distinct ideal configuration for each of several applications, and show the additional benefits gained by explicit compiler support for specific configurations.
Bientinesi, Paolo and Robert van de Geijn. "Representing Dense Linear Algebra Algorithms: A Farewell to Indices." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-10. February 6, 2006. 25 pages.
We present a notation that allows a dense linear algebra algorithm to be represented in a way that is visually recognizable. The primary value of the notation is that it exposes subvectors and submatrices allowing the details of the algorithm to be the focus while hiding the intricate indices related to the arrays in which the vectors and matrices are stored. The applicability of the notation is illustrated through a succession of progressively complex case studies ranging from matrix-vector operations to the chasing of the bulge of the symmetric QR iteration. The notation facilitates comparing and contrasting different algorithms for the same operation as well as similar algorithms for different operations. Finally, we point out how algorithms represented with this notation can be directly translated into high-performance code.
Ha, Jungwoo, Hany Ramadan, Jason V. Davis, Christopher Rossbach, Indrajit Roy, and Emmett Witchel. "Navel: Automating Software Support by Classifying Program Behavior." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-11. March 22, 2006. 12 pages.
End-user software problems take too much time to resolve, in part due to unclear or ambiguous error messages. The quality of error messages embedded within software is unlikely to improve given the variety of contexts in which errors can occur, the programming complexity of sophisticated error reporting, and the modular structure of modern applications. While vendors supply documents, help systems and websites to support end users, it is still difficult for users to figure out how to resolve their problems. Navel improves error reporting by monitoring software execution and determining if a particular execution is an instance of a known error. As a program executes, Navel builds a compact abstraction of the program's behavior (a behavior profile) using control flow information. Navel classifies behavior profiles using a machine learning model trained on known errors by vendors, support organizations or other users, enabling them to better disseminate error workarounds by matching user behavior profiles with known problems. Navel provides a way for an average user to get solutions to software problems with less effort. A prototype implementation, Skepsis, demonstrates the efficacy of the Navel approach. Skepsis collects three behavior profiles based on program control flow: function counting, path profiling, and a new technique, call-tree profiling. We evaluate Skepsis on confusing error messages currently emitted by large, mature programs including the gcc compiler and Microsoft's Visual FoxPro database. Using call-tree profiling, Skepsis achieves an average classification accuracy of 97% across a range of nine benchmarks on two operating systems, while function counting and path profiling achieve average classification accuracies of 92% and 94% respectively.
Nethercote, N., K. S. McKinley, and D. Burger. "Self-Evaluating Compilation Applied to Loop Unrolling." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-12. February 13, 2006. 17 pages.
Well-engineered compilers use a carefully selected set of optimizations, heuristic optimization policies, and a phase ordering to produce good machine code. Designing a compiler with one heuristic per optimization that works well with other optimization phases is a challenging task. Although compiler designers evaluate the optimization heuristics and phase ordering before deployment, compilers typically do not statically evaluate nor refine the quality of their optimization decisions during a specific compilation. This paper identifies a class of optimizations for which the compiler can evaluate the effectiveness of its heuristics and phase interactions statically, and when necessary re-run optimization phases, using information from the evaluation phase to guide its heuristics. We call this approach self-evaluating compilation (SEC). This model avoids some of the difficulties of predicting phase interactions, and perfecting any one heuristic. The SEC model was motivated by loop unrolling and other optimizations for the TRIPS architecture. TRIPS has a limit on instructions that the compiler can place in an atomic execution unit (a TRIPS block), yet each block has a fixed minimum cost. The goal of loop unrolling (and other optimizations) is to produce as full a block as possible without exceeding the block size, since an unnecessary block with a small number of instruction degrades performance. Because unrolling enables downstream optimizations, it needs to occur well before code generation, but this position makes it impossible to predict the final number of instructions. However, eventually the compiler generates code and can, on a per-loop basis, determine if it unrolled too much or too little or just right. If need be, SEC unrolling then goes back and adjusts the unroll amount accordingly and reruns subsequent optimization phases. We demonstrate a prototype SEC unrolling implementation that automatically matches the best hand unrolled version for a set of microbenchmarks on the TRIPS architectural simulator. Although motivated by TRIPS compilation challenges, SEC is broadly applicable to helping solve compilation phase ordering and heuristic design for resource constraints such as register and code size limitations which can be measured statically and occur when compiling for embedded, VLIW, and partitioned hardware.
Arikan, Okan. "Compression of Motion Capture Databases." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-13. .
We present a lossy compression algorithm for large databases of motion capture data. We approximate short clips of motion using Bezier curves and clustered principal component analysis. This approximation has a smoothing effect on the motion. Contacts with the environment (such as foot strikes) have important detail that needs to be maintained. We compress these environmental contacts using a separate, JPEG like compression algorithm and ensure these contacts are maintained during decompression. Our method can compress 6 hours 34 minutes of human motion capture from 1080 MB data into 35.5 MB with little visible degradation. Compression and decompression is fast: our research implementation can decompress at about 1.2 milliseconds/frame, 8 times faster than real-time (for 120 frames per second animation). Our method also yields smaller compressed representation for the same error or produces smaller error for the same compressed size.
Jung, Eunjin (EJ), Ehab S. Elmallah, and Mohamed G. Gouda. "Optimal Dispersal of Certificate Chains." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-14. February 22, 2006. 33 pages.
We consider a network where users can issue certificates that identify the public keys of other users in the network. The issued certificates in a network constitute a set of certificate chains between users. A user u can obtain the public key of other user v from a certificate chain from u to v in the network. For the certificate chain from u to v, u is called the source of the chain and v is called the destination of the chain. Certificates in each chain are dispersed between the source and destination of the chain such that the following condition holds. If any user u needs to securely send messages to any other user v in the network, then u can use the certificates stored in u and v to obtain the public key of v (then u can use the public key of v to set up a shared key with v to securely send messages to v). The cost of dispersing certificates in a set of chains among the source and destination users in a network is measured by the total number of certificates that need to be stored in all users. A dispersal of a set of certificate chains in a network is optimal if no other dispersal of the same chain set has a strictly lower cost. In this paper, we show that the problem of computing optimal dispersal of a given chain set is NP-Complete. Thus, minimizing the average number of certificates stored in any node is NP-Complete. We identify three special classes of chain sets that are of practical interests and devise three polynomial-time algorithms that compute optimal dispersals for each class. We also present two polynomial-time extensions of these algorithms for more general classes of chain sets.
Edwards, H. Carter and Robert A. van de Geijn. "Application Interface to Parallel Dense Matrix Libraries: Just let me solve my problem!." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-15. February 20, 2006. 8 pages.
We focus on how applications that lead to large dense linear systems naturally build matrices. This allows us explain why traditional interfaces to dense linear algebra libraries for distributed memory architectures, which evolved from sequential linear algebra libraries, inherently do not support applications well. We review the application interface that has been supported by the Parallel Linear Algebra Package (PLAPACK) for almost a decade, which appears to support applications better. The lesson learned is that an application-centric interface can be easily defined, deminishing the obstacles that exist when using large distributed memory architectures.
Belaramani, Nalini, Mike Dahlin, Lei Gao, Amol Nayate, Arun Venkataramani, Praveen Yalagandula, and Jiandan Zheng. "PRACTI Replication." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-17. . 0 pages.
Jain, Navendu, Lisa Amini, Henrique Andrade, Richard King, Yoonho Park, Philipe Selo, and Chitra Venkatramani. "Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core.." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-18. . 0 pages.
Plaxton, Greg and Ned Dimitrov. "Buyer-Supplier Games: Core Characterization and Computation." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-19. April 7, 2006. 29 pages.
In a buyer-supplier game, a distinguished player, called the buyer, wishes to purchase some combinatorial object. A set of players, called suppliers, offer pieces of the object for sale. In this paper, we provide a transformation from most combinatorial minimization problems to buyer-supplier games; a characterization of the core of buyer-supplier games in the transferable and non-transferable utility cases; a polynomial time algorithm for optimization over the core of buyer-supplier games for which the underlying minimization problem is solvable in polynomial time; and an impossibility result showing that it is hard to distinguish core vectors if the base minimization problem is not solvable in polynomial time. We also introduce and study the concept of focus point price, which answers the question: If we are constrained to play in equilibrium, how much can we lose by playing the wrong equilibrium?
Bientinesi, Paolo, Brian Gunter, and Robert van de Geijn. "Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-20. April 8, 2006. 22 pages.
We present families of algorithms for operations related to the computation of the inverse of a Symmetric Positive Definite (SPD) matrix: Cholesky factorization, inversion of a triangular matrix, multiplication of a triangular matrix by its transpose, and one-sweep inversion of an SPD matrix. These algorithms are systematically derived and implemented via the Formal Linear Algebra Methodology Environment (FLAME), an approach for developing linear algebra algorithms. How different members of these families of algorithms are more or less suited for a given architecture is demonstrated via implementations for sequential, shared-memory, and distributed memory parallel architectures. Performance on various platforms is reported.
Stoll, Gordon, William R. Mark, Peter Djeu, Rui Wang, and Ikrima Elhassan. "Razor: An Architecture for Dynamic Multiresolution Ray Tracing." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-21. April 26, 2006. 15 pages.
Rendering systems organized around the ray tracing visibility algorithm provide a powerful and general tool for generating realistic images. These systems are being rapidly adopted for offline rendering tasks, and there is increasing interest in utilizing ray tracing for interactive rendering as well. Unfortunately, standard ray tracing systems suffer from several fundamental problems that limit their flexibility and performance, and until these issues are addressed ray tracing will have no hope of replacing Z-buffer systems for most interactive graphics applications. To realize the full potential of ray tracing, it is necessary to use variants such as distribution ray tracing and path tracing that can compute compelling visual effects: soft shadows, glossy reflections, ambient occlusion, and many others. Unfortunately, current distribution ray tracing systems are fundamentally inefficient. They have high overhead for rendering dynamic scenes, use excessively detailed geometry for secondary rays, perform redundant computations for shading and secondary rays, and have irregular data access and computation patterns that are a poor match for cost-effective hardware. We describe Razor, a new software architecture for a distribution ray tracer that addresses these issues. Razor supports watertight multiresolution geometry using a novel interpolation technique and a multiresolution kD-tree acceleration structure built on-demand each frame from a tightly integrated application scene graph. This dramatically reduces the cost of supporting dynamic scenes and improves data access and computation patterns for secondary rays. The architecture also decouples shading computations from visibility computations using a two-phase shading scheme. It uses existing best-practice techniques including bundling rays into SIMD packets for efficient computation and memory access. We present an experimental system that implements these techniques, although not in real time. We present results from this system demonstrating the effectiveness of its software architecture and algorithms.
Jain, Navendu, Dmitry Kit, Prince Mahajan, Praveen Yalagandula, Mike Dahlin, and Yin Zhang. "PRISM: Precision-aware Aggregation for Scalable Monitoring." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-22. .
Goto, Kazushige and Robert van de Geijn. "High-Performance Implementation of the Level-3 BLAS." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-23. May 5, 2006. 17 pages.
A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. Exceptional performance is demonstrated on various architectures.
Lopez-Herrejon, Roberto E. and Don Batory. "From Crosscutting Concerns to Product Lines: A Function Composition Approach." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-24. May 8, 2006. 10 pages.
Aspects offer sophisticated mechanisms to modularize crosscutting concerns. Asp ect Oriented Programming (AOP) has been successfully applied to many domains; however, its application to product line engineering has not been thoroughly explored. Features are increments in program functionality and are building blocks of software product lines. Work on Feature Oriented Programming (FOP) has shown that a crucial factor to synthesize product lines is composing features by function composition. In this paper we describe a way to emulate function composition using AspectJ for the synthesis of a non-trivial product line, present a general mechanism to support it and highlight its potential reuse benefits. Our study also profiles the role different aspect constructs play in the synthesis of product lines and offers venues of research on the use of aspects in product line implementations.
Clement, Allen, Jeff Napper, Lorenzo Alvisi, and Mike Dahlin. "BAR Games." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-25. May 30, 2006. 13 pages.
This paper describes a general methodology for simplifying the design and analysis of BAR protocols. BAR protocols allow the participation of Byzantine, Altruistic, and Rational players. Because BAR protocols tolerate both arbitrary behaviors by some nodes and selfish behavior by the rest, they are appropriate for service and applications spanning multiple administrative domains. We focus our attention on IC-BFT protocols that (a) guarantee a set of safety and liveness properties to all non-Byzantine nodes and (b) insure that rational nodes follow the protocol faithfully. We rely on existing techniques to show that safety and liveness are maintained when all non-Byzantine nodes follow the protocol. In order to show that rational nodes follow the protocol faithfully, we decomponse the BAR game corresponding to the problem specification into a combination of a $n$-player benefit game and a collection of $n$ 2-player cost games. We provide a set of sufficient properties to show that the protocol is a CBE in the benefit game and that rational nodes will thus follow the protocol faithfully. We present the first synchronous IC-BFT TRB protocol, basing our protocol design on these properties.
Li, Harry C., Allen Clement, Edmund L. Wong, Jeff Napper, Indrajit Roy, Lorenzo Alvisi, and Michael Dahlin. "BAR Gossip." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-26. June 29, 2006. 14 pages.
We present the first peer-to-peer data streaming application that guarantees predictable throughput and low latency in the BAR model, in which non-altruistic nodes can behave in a self-serving (rational) or even arbitrarily malicious (Byzantine) way. At the core of our solution is a BAR-tolerant version of gossip, a well-known technique for scalable and reliable data dissemination. In traditional gossip, data dissemination is performed with randomly selected partners; such non-determinism, we show, offers an excellent opening to rational nodes bent on gaming the system. BAR-tolerant gossip instead relies on a verifiable pseudo-random partner selection mechanism that eliminates non-determinism while maintaining the unpredictability and rapid convergence of traditional gossip. Our initial experience indicates that BAR Gossip is robust against up to 20% of nodes exhibiting Byzantine behavior and even against up to 40% of nodes colluding together against the remaining nodes. In either case, our BAR-tolerant vi deo streaming application provides over 95% convergence for broadcast updates.
Sra, Suvrit and Inderjit S. Dhillon. "Nonnegative Matrix Approximation: Algorithms and Applications." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-27. June 21, 2006. 36 pages.
Low dimensional data representations are crucial to numerous applications in machine learning, statistics, and signal processing. Nonnegative matrix approximation (NNMA) is a method for dimensionality reduction that respects the nonnegativity of the input data while constructing a low-dimensional approximation. NNMA has been used in a multitude of applications, though without commensurate theoretical development. In this paper we describe generic methods for minimizing generalized divergences between the input and its low rank approximant. Some of our general methods are even extensible to arbitrary convex penalties. Our methods yield efficient multiplicative iterative schemes for solving the proposed problems. We also consider interesting extensions such as the use of penalty functions, non-linear relationships via ``link'' functions, weighted errors, and multi-factor approximations. We present some experiments as an illustration of our algorithms. For completeness, this report also includes a brief literature survey of the various algorithms and the applications of NNMA.
Rozner, Eric, Yogita Mehta, Aditya Akella, and Lili Qiu. "Traffic-Aware Channel Assignment in Wireless LANs." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-28. September 7, 2006. 9 pages.
Plaxton, Greg C., Yu Sun, Mitul Tiwari, and Harrick Vin. "Reconfigurable Resource Scheduling with Variable Delay Bounds." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-29. June 10, 2006. 18 pages.
Certain emerging network applications involve dynamically allocating shared resources to a variety of services to provide QoS guarantees for each service. Motivated by such applications, we address the following online scheduling problem belonging to the recently introduced class of reconfigurable resource scheduling: unit jobs of different categories arrive over time and need to be completed within category-specific delay guarantees, or else they are dropped at a unit drop cost; processors can be reconfigured to process jobs of a certain category at a fixed reconfiguration cost; the goal is to minimize the total cost. We study this problem in the framework of competitive analysis. Through a novel combination of the EDF and LRU scheduling principles, we obtain an online algorithm that is constant competitive when given a constant factor advantage in the number of resources over an optimal offline algorithm.
Plaxton, Greg C., Mitul Tiwari, and Praveen Yalagandula. "Online Aggregation over Trees." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-30. June 5, 2006. 30 pages.
Consider a distributed network with nodes arranged in a tree and each node having a local value. We formulate an aggregation problem as the problem of aggregating values (e.g. summing values) from all nodes to the requesting nodes in the presence of writes. The goal is to minimize the total number of messages exchanged. The key challenges are to define a notion of ``acceptable'' aggregate values, and to design algorithms with good performance that are guaranteed to produce such values. We formalize the acceptability of aggregate values in terms of certain consistency guarantees. The aggregation problem admits a spectrum of solutions that trade off between consistency and performance. We propose a lease-based aggregation mechanism as a design point in this spectrum, and evaluate algorithms based on this mechanism in terms of consistency and performance. With regard to consistency, we generalize the definitions of strict and causal consistency, traditionally defined for distributed shared memory, for the aggregation problem. We show that any lease-based algorithm provides strict consistency in sequential executions, and causal consistency in concurrent executions. With regard to performance, we propose an online lease-based algorithm, and show that, for sequential executions, the algorithm is $\frac{5}{2}$-competitive against an optimal offline lease-based algorithm, and $5$-competitive against an optimal offline algorithm that provides strict consistency. The key highlight of the results is the design of an online algorithm that effectively reduces the analysis to reasoning about a pair of neighboring nodes.
Jung, Eunjin and Mohamed G. Gouda. "Rating Certificates." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-31. May 26, 2006. 10 pages.
We consider a system where each user has a public key and a private key. In this system, a certificate is a data item that is issued by one user u and contains the public key of another user v. A third user w that knows the public key of u can verify that this certificate has not been corrupted (by an adversary) since it was issued by u, and so can accept the public key in the certificate as the correct public key of v. User w can use this accepted public key of v in two ways. First, w can securely communicate with v. Second, w can obtain more public keys of other users, as it used the public key of u to obtain the public key of v. However, the safety of the second use is questionable if u, the issuer of the certificate, has concluded that it cannot trust v enough to accept a public key merely because v accepts it. To solve this problem, we propose that each certificate should have a ``rating". The rating of a certificate describes how much trust the issuer puts on the subject concerning key acceptance. We present two algorithms for computing the set of all users that can accept the given public key where all certificates have ratings. The first algorithm is simple, but its time complexity is O(n^3), where n is the number of users in the system. The second algorithm is more sophisticated, but its time complexity is O(e), where e is the number of certificates in the system. This algorithm meets the lower bound of the worst case complexity. We also discuss how to find an input to these two algorithms, and present two algorithms that compute an optimal set of certificates that are necessary for a user to accept the public key of another users.
Kitchin, David, William R. Cook, and Jayadev Misra. "Semantic Properties of Asynchronous Orc." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-32. January 23,2007. 38 pages.
Orc is a new language for task orchestration, a form of concurrent programming with applications in workflow, business process management, and web service orchestration. Orc provides constructs to orchestrate the concurrent invocation of services -- while managing time-outs, priorities, and failure of services or communication. In this paper, we show a trace-based semantic model for Orc, which induces a congruence on Orc programs and facilitates reasoning about them. Despite the simplicity of the language and its semantic model, Orc is able to express a variety of useful orchestration tasks. This technical report contains full proofs of the theorems.
Batory, Don and Sahil Thaker. "Towards Safe Composition of Product Lines." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-33. June 20, 2006. 10 pages.
Programs of a software product line can be synthesized by composing modules that implement features. Besides high-level domain constraints that govern the compatibility of features, there are also low-level implementation constraints: a feature module can reference elements that are defined in other feature modules. Safe composition is the guarantee that programs composed from feature modules are absent of references to undefined elements (such as classes, methods, and variables). We show how many properties of safe composition can be verified for AHEAD product lines using feature models and SAT solvers.
Nagarajan, Ramadass, Robert G. McDonald, Doug Burger, and Stephen W. Keckler. "Implementation of the Control Unit in the TRIPS Prototype Processor." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-34. June 26, 2006. 12 pages.
Future processor microarchitectures will feature distributed hardware components communicating using on-chip interconnection networks. Managing the common execution state and controlling the operations of different components are important design challenges for performing distributed computation on such architectures. This paper describes the fine-grained control mechanisms used in the distributed microarchitecture of the TRIPS prototype processor. A set of master-slave protocols driven from a centralized unit and implemented atop point-to-point networks controls the overall execution in the processor. The protocols are latency tolerant and support a back-end capable of executing up to 16 instructions in each cycle.
Jung, Eunjin. "Dispersability and Vulnerability Analysis of Certificate Systems." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-35. June 27, 2006. 127 pages.
A certificate is a way to distribute public keys of users in a distributed system. For example, in the current Internet, certificates are heavily used in SSL/TLS for securing e-commerce. In this thesis, we describe the three phases of a certificate, how a certificate is issued, used, and revoked/expired. In particular, we propose a new way of distributing certificates, called certificate dispersal. Certificate dispersal assigns certificates to users such that when a user u wants to securely communicate with another user v in a system, users u and v may find out the public key of user v based on the certificates stored in u or v. In other words, users u and v have no need to contact any other user in the system. We define dispersal in two environments, a certificate graph and a certificate chain set and the costs of dispersal. In the environment of certificate chain set, computing an optimal dispersal is NP-complete. However, we identify several classes of chain sets and certificate graphs for which optimal dispersal can be computed in polynomial-time. For each class we present an algorithm that computes an optimal dispersal. We also analyze the vulnerability of certificate systems. Any certificate system suffer from impersonation attacks when a private key of a user is revealed to an adversary. We define the metric called vulnerability that measures the scope of damage when some private keys are revealed, and show how different certificate systems have different vulnerabilities. These results can be used to design a good certificate system that satisfies system requirements of dispersal cost and vulnerability.
Maher, Bertram, Aaron Smith, Doug Burger, and Kathryn S. McKinley. "Merging Head and Tail Duplication for Convergent Hyperblock Formation." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-36. .
VLIW and EDGE (Explicit Data Graph Execution) architectures both rely on the compiler to form high-quality hyperblocks for good performance. Typical VLIW compilers perform loop unrolling first, then if-conversion, and follow with a final phase of scalar optimizations, such as predicate combining and common subexpression elimination. This approach limits the compiler's ability to exploit or correct interactions between these phases. EDGE architectures complicate this approach because they impose structural constraints on hyperblocks, such as instruction count and instruction composition. This paper introduces convergent hyperblock formation, an algorithm for forming hyperblocks that iteratively applies if-conversion and scalar optimizations until it converges on a hyperblock that is is as close as possible to the architectural constraints. This algorithm performs loop unrolling, loop peeling, and tail duplication as special cases of a general code duplication process to enable if-conversion. The key insight is that unrolling and peeling are just generalized cases of code duplication that enable if-conversion at the head of a loop, thus the term head duplication. Thus the algorithm integrates peeling and unrolling for if-conversion as part of its iterative process. Simulation results for an EDGE architecture show that convergent hyperblock formation improves code quality over different orderings of discrete-phase approaches, using both traditional VLIW heuristics and new heuristics tuned for this framework.
Aiyer, Amitanand S., Lorenzo Alvisi, and Rida A. Bazzi. "Byzantine and Multi-writer k-quorums." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-37. September 25, 2006. 20 pages.
Single-writer $k$-quorum protocols achieve high availability without incurring the risk of read operations returning arbitrarily stale values: in particular, they guarantee that, even in the presence of an adversarial scheduler, any read operation will return the value written by one of the last $k$ writes. In this paper, we expand our understanding of $k$-quorums in two directions: first, we present a single-writer $k$-quorum protocol that tolerates Byzantine server failures; second, we extend the single-writer $k$-quorum protocol to a multi-writer solution that applies to both the benign and Byzantine cases. For a system with $m$ writers, we prove a lower bound of $\big( (2m-1)(k-1) + 1 \big)$ on the staleness of any multi-writer protocol built over a single-writer $k$-quorum system and propose a multi-writer protocol that provides an almost matching staleness bound of $\big( (2m-1)(k-1) + m \big)$.
Li, Yi, Yin Zhang, Lili Qiu, and Simon S. Lam. "SmartTunnel: A Multipath Approach to Achieving Reliability in the Internet." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-38. July 30, 2006. 13 pages.
Reliability is critical to a variety of network applications. Unfortunately, due to lack of QoS support across ISP boundaries, it is difficult to achieve even two 9s (99%) reliability in today's Internet. In this paper, we propose SmartTunnel, an end-to-end approach to achieving reliability. A SmartTunnel is a logical point-to-point tunnel between two end points that spans multiple physical network paths. It achieves reliability by strategically allocating traffic onto multiple paths and performing FEC coding. Such an end-to-end approach requires no explicit QoS support from intermediate ISPs, and is therefore easy to deploy in today's Internet.To fully realize the potential of SmartTunnel, we analytically derive the near-optimal traffic allocation schemes that minimize loss rates.We extensively evaluate our approach using trace-driven simulations, ns-2 simulations, and experiments on PlanetLab. Our results clearly demonstrate that SmartTunnel is effective in achieving high reliability.
Sethumadhavan, Simha, Doug Burger, and Steve Keckler. "Partition the banks, not the functionality, of Large-Window Load-Store Queues." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-39. August 2006. 23 pages.
Designing scalable memory ordering hardware is one of the most important challenges for large-window, out-of-order processor design, due to its complexity, power, and its criticality for high performance. Recent research has aimed to partition the functionality of load/store queues (LSQ) into three components: ordering violations detection, value forwarding, and store buffering for commit, to avoid frequent access to large, associative, energy-inefficient structures. This approach adds microarchitectural complexity but has been shown to be effective. In this paper, we describe a family of energy-efficient distributed load/store queue designs that avoid the need for partitioning the LSQ functionality, but which achieve comparable energy efficiency with performance comparable to an ideal LSQ. These designs interleave LSQ banks based on cache-line addresses, and deal with the resultant overflow challenges by prioritizing older instructions and occasionally rejecting younger instructions using flow-control techniques. The experimental results show that on average, there is no performance degradation for address-interleaved LSQs that are undersized up to three-quarters of a 256 memory instruction window for both SPEC and EEMBC benchmarks. In addition, each LSQ access consumes only as much energy, on average, as a 8-entry, fully associative traditional LSQ.
Navratil, Paul Arthur and William R. Mark. "An Analysis of Ray Tracing Bandwidth Consumption." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-40. November 21, 2006. 4 pages.
The trend in chip-multi-processors for the next several years is for on-chip FLOPS to grow much faster than bandwidth to off-chip DRAM. This trend is likely to emerge as a performance bottleneck for future real-time ray tracing systems. In this paper, we assess the impact of this bottleneck by measuring the DRAM bandwidth requirements for several different ray tracing algorithms, each running on simulated architectures with a variety of cache sizes. We conclude that for current packet-tracing algorithms, bandwidth will not be a bottleneck for primary rays, but that it will be a bottleneck for soft shadow rays. This bottleneck is caused primarily by dramatically lower cache hit rates, rather than by an increase in total working set, which suggests that substantial reductions in memory bandwidth requirements would be possible by designing algorithms that do a better job of scheduling ray traversals in a coherent fashion for divergent secondary rays.
Huh, Jaehyuk and Doug Burger. "Scalable Subspace Snooping." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-41. August 2006. 18 pages.
Snooping tag bandwidth is one of the resources that limits the number of processors that can participate in a cache-coherent snooping system. In this paper, we evaluate a type of coherence protocol called subspace snooping, which decouples the snoop tag bandwidth from the address bus bandwidth. In subspace snooping, each processor snoops a set of logical channels, which are a subset of the total snoopable address busses in the system. Thus, each processor snoops a subset of the address space, reducing the number of tag matches required for a system of a given size. By dynamically assigning both processors and cache lines to channels, we support dynamic formation of subspaces, with the goal of having only sets of processors that share data snooping on each given channel. Subspace snooping aligns best with systems for which the address bus bandwidth greatly exceeds the snooping tag bandwidth. Snooping optical interconnects exhibit such characteristics, providing enormous transmission bandwidth, but which quickly become limited by snooping tag energy and bandwidth as the number of processors increases. Optical busses can be subdivided into logical channels using either wave-division or time-division multiplexing, making them good candidates for a subspace snooping implementation. We evaluate a range of subspace snooping protocols on six parallel scientific benchmarks, running on an execution-driven simulator. We show an average 50% reduction in total snoops for 64-processor systems with 8-32 channels, and that for regular applications, subspace snooping shows large performance improvements the ratio of processor bandwidth to snoop bandwidth grows large.
Quintana-Ort, Enrique S. and Robert A. van de Geijn. "Updating an LU factorization with Pivoting." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-42. September 7, 2006. 13 pages.
We show how to compute an LU factorization of a matrix when the factors of a leading principle submatrix are already known. The approach incorporates pivoting akin to partial pivoting, a strategy we call incremental pivoting. An implementation using the Formal Linear Algebra Methods Environment (FLAME) Application Programming Interface (API) is described. Experimental results demonstrate practical numerical stability and high performance on an Intel Itanium2 processor based server.
Roeder, Thomas, Kamen Yotov, Keshav Pingali, Fred Gustavson, and John Gunnels. "The Price of Cache Obliviousness." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-43. September 25, 2006.
Chan, Ernie, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. "Collective Communication: Theory, Practice, and Experience." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-44. September 26, 2006. 32 pages.
We discuss the design and high-performance implementation of collective communications operations on distributed-memory computer architectures. Using a combination of known techniques (many of which were first proposed in the 1980s and early 1990s) along with careful exploitation of communication modes supported by MPI, we have developed implementations that have improved performance in most situations compared to those currently supported by public domain implementations of MPI such as MPICH. Performance results from a large Intel Xeon/Pentium 4 (R) processor cluster are included.
Lopez-Herrejon, Roberto Erick. "Understanding Feature Modularity." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-45. September 28, 2006. 145 pages.
Features are increments in program functionality. Feature abstraction, the process of abstracting programs into their constituent features, is a relatively common yet informal practice in software design. It is common because it simplifies program understanding. It is also important for software product lines whose essence is the systematic and efficient creation of software products from a shared set of assets or features, where each product exhibits common functionality with other products but also has unique functionalities.Thus, it seems natural to modularize feature abstractions and use such modules as building blocks of programs and product lines. Unfortunately, conventional modularization approaches such as methods, classes and packages are not geared for supporting feature modules. They present two recurrent problems. First, a typical feature implementation is spread over several conventional modules. Second, features are usually more than source code artifacts as they can modularize many different program representations (makefiles, documentation, performance models).
An undesirable consequence is that developers must lower their abstractions from features to those provided by the underlying implementation languages, a process that is far from simple let alone amenable to significant automation. The conceptual gap created between feature abstractions and their modularization hinders program understanding and product line development. The root of the problem is the fact that feature modularity is not well understood and thus not well supported in conventional programming languages, modularization mechanisms, and design techniques.
In this dissertation, we explore language and modularity support for features founded on an algebraic model geared for program synthesis. Our model integrates ideas from collaboration-based designs, mixin layers, aspect oriented programming, multi-dimensional separation of concerns, and generative programming. We assess our model with an implementation of a non-trivial product line case study, and evaluate feature support in emerging modularization technologies.
Bientinesi, Paolo. "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-46. September 29, 2006. 149 pages.
We consider the problem of developing formally correct dense linear algebra libraries. The problem would be solved convincingly if, starting from the mathematical specification of a target operation, it were possible to generate, implement and analyze a family of correct algorithms that compute the operation. This thesis presents evidence that for a class of dense linear operations, systematic and mechanical development of algorithms is within reach. It describes and demonstrates an approach for deriving and implementing, systematically and even mechanically, proven correct algorithms. It also introduces a systematic procedure to analyze, in a modular fashion, numerical properties of the generated algorithms.
Tirmizi, Syed Hamid and Daniel Miranker. "OBO2OWL: Roundtrip between OBO and OWL." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-47. October 2, 2006. 16 pages.
Ontologies have traditionally been used in biomedicine for representing relationships among biological concepts, and as a result, large knowledge-bases like Gene Ontology (GO) have emerged. We believe use of Semantic Web technologies can allow better querying and collaboration of biomedical ontologies. As a migration path for biomedical ontologies, we have developed a mechanism for lossless roundtrip transformations between Open Biomedical Ontologies (OBO) format and OWL. We have methodically examined each of the constructs of OBO and mapped them to constructs in the Semantic Web stack. We have also enumerated constructs in each system that do not have simple syntactic equivalent in the other, important ones being GUIDs, various kinds of synonyms and subsets. We have implemented a tool that uses our transformation rules to translate OBO ontologies into OWL, and back, without loss of knowledge.
Lee, Dong-Young and Simon Lam. "Protocol Design for Dynamic Delaunay Triangulation." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-48. December 1, 2006. 31 pages.
Delaunay triangulation (DT) is a useful geometric structure for applications such as routing, clustering, broadcast, distributed virtual reality systems, and multiplayer on-line games. In this paper we investigate the design of join, leave, and maintenance protocols for a set of nodes to construct and maintain a distributed DT dynamically. (Conceptually nodes are points in a Euclidean space.) We define a distributed DT and present a necessary and sufficient condition for a distributed DT to be correct.This condition is used as a guide for protocol design. We present join and leave protocols as well as correctness proofs for serial joins and leaves. In addition, to handle concurrent joins and leaves as well as node failures, we present a maintenance protocol. An accuracy metric is defined for a distributed DT. Experimental results show that our join, leave and maintenance protocols are scalable, and they achieve high accuracy for systems under churn and with node failures.
To support applications of distributed DT, we present protocols for greedy routing, clustering, broadcast, and multicast within a radius. Each node in our greedy routing, broadcast and multicast protocols does not maintain any per-session state. We also discuss and prove correctness for the application protocols.
De Paula, Judah. "Converting RGB Images to LMS Cone Activations." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-49. October 25, 2006. 7 pages.
This article describes a process for converting color RGB (Red/Green/Blue) images to LMS (Long/Medium/Short) photoreceptor activations. In an RGB image each triple represents phosphor luminosity in a CRT (Cathode Ray Tube) screen. In an LMS triple each value simulates a retina photoreceptor activation upon viewing the RGB triple displayed on a CRT screen. The conversion process requires the phosphor emission functions for a CRT monitor and the sensitivity functions for the Long, Medium, and Short cone photoreceptors. To calculate the final photoreceptor activity, transform the RGB pixels with the CRT emission function, then multiply the result with the photoreceptor sensitivity vector.
Bafna, Sapna, Daniel Miranker, and Julian Humphries. "Schema Driven Assignment and Implementation of Life Science Identifiers (LSIDs)." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-50. October 19, 2006. 14 pages.
Life sciences identifier (LSIDs) is a formal global unique identifier standard intended to help rationalize the unique archival requirements of biological data. We describe an LSID implementation architecture such that data managed by a relational database management system may be integrated with the LSID protocol as an add-on layer. The approach requires a database administrator (DBA) to specify an export schema detailing the structure of the archived data, and a mapping of the existing database to that schema. This specification is accomplished through the equivalent of SQL view definitions. In effect, we define a domain specific language for implementing LSIDs. We describe the mapping of the view definition to an implementation as a set of database triggers and a fixed runtime library. This suggests a compiler for the domain specific language could be written that would reduce the implementation of LSIDs to the task of writing SQL view definitions.
Bafna, Sapna and Daniel Miranker. "LSID Specification Language for Relational Databases based on SQL Server 2000." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-51. October 19, 2006. 13 pages.
This document describes the specification of a domain specific language to introduce support for LSIDs in an existing archival database. The language constructs can be used to specify subsets of the existing relational schema that need to be exported via LSIDs.
Batory, Don. "Feature Interactions in Feature-Based Program Synthesis." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-52. October 18, 2006. 16 pages.
A feature is an increment in product functionality. Features can encapsulate different program representations, programs can be synthesized by feature composition, and programs can be declaratively specified using features. We explain in this paper how feature-based program synthesis works, how features interact, how interactions are controlled, and explore the relationship between features and aspects.
Batory, Don. "From Implementation to Theory in Product Synthesis." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-53. October 24, 2006.
Future software development will rely on product synthesis, i.e., the synthesis of code and non-code artifacts for a target component or application. Prior work on feature-based product synthesis can be understood and generalized using elementary ideas from category theory. Doing so reveals (a) practical and previously unrecognized properties that product synthesis tools must satisfy, and (b) non-obvious generalizations of current techniques that will guide future research efforts in automated product development.
Kim, Dongmin, Suvrit Sra, and Inderjit S. Dhillon. "A New Projected Quasi-Newton Approach for Solving Nonnegative Least Squares Problem." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-54. May 9, 2007. 16 pages.
Constrained least squares estimation lies at the heart of many applications in a wide array of fields such as statistics, psychometrics, and signal processing, among others. The simplest constraints that usually arise are those of non-negativity, and the associated least-squares problem is called Nonnegative least squares (NNLS). This report presents a new, efficient, and scalable Quasi-Newton-type method for solving the NNLS problem. Experiments demonstrate that our algorithm significantly outperforms well-known methods for solving NNLS, especially when the problem size becomes large. Other important benefits of our algorithm are ease of implementation and the ability to exploit sparsity in the input.
Lee, Byeongcheol, Kevin Resnick, Michael D. Bond, and Kathryn S. McKinley. "Correcting the Dynamic Call Graph Using Control-Flow Constraints." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-55. October 27, 2006. 22 pages.
To reason about whole-program behavior, dynamic optimizers and analysis tools collect a dynamic call graph using sampling. Previous approaches have not achieved high accuracy with low runtime overhead, and this problem is likely to become more challenging as object-oriented programmers increasingly compose complex programs.This paper demonstrates how to use static and dynamic control-flow graph (CFG) constraints to improve the accuracy of the dynamic call graph (DCG). We introduce the frequency dominator (FDOM) which is a novel CFG relation that extends the dominator relation to expose relative execution frequencies of basic blocks. We combine conservation of flow and dynamic CFG basic block profiles to further improve the accuracy of the DCG. Together these approaches add minimal overhead (1%) and achieve 85% accuracy compared to a perfect call graph for SPEC JVM98 and DaCapo benchmarks. Compared to sampling alone, accuracy improves by 12 to 36%. These results demonstrate that static and dynamic control-flow information offer accurate information for efficiently improving the DCG.
Siddavanahalli, Vinay and Chandrajit Bajaj. "An Adaptive Grid Based Method for Computing Molecular Surfaces and Properties." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-56. November 6, 2006. 19 pages.
We present an adaptive grid based algorithm to compute a family of relevant molecular surfaces. Molecular interfaces are important in simulations and visualization involving biomolecules. The Richards surface has traditionally been used as a good approximation to the surface, and defined as the surface formed by the inner facing part of a solvent probe atom rolling along the van der Waals surface of the molecule. Computing and representing this surface has traditionally involved complex geometrical data structures like alpha shapes. Adaptive and uniform trilinear grids are commonly used in various simulations involving interactions of molecules or computation of electrostatics and other energy terms. We make use of this grid directly to compute the Molecular Surface and properties like area, volume, curvatures, surface atoms and other surfaces. We compare geometrical and biochemical properties with other methods as a validation.
Siddavanahalli, Vinay and Chandrajit Bajaj. "F2Dock: A Fast and Fourier Based Error-Bounded Approach to Protein-Protein Docking." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-57. November 6, 2006. 23 pages.
The functions of proteins is often realized through their mutual interactions. Determining a relative transformation for a pair of proteins and their conformations which form a stable complex, reproducible in nature, is known as docking. It is an important step in drug design, structure determination and understanding function and structure relationships. We provide a model for rigid docking and errorbounded approximation algorithms to solve the model and predict docking sites. Translational search is sped up using the Fourier domain. Shape based interactions is shown to give good results for a large range of pairs of proteins.
Choi, Young-ri and Mohamed G. Gouda. "Stabilization Properties of Flood Protocols in Sensor Networks." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-58. November 9, 2006.
Greve, David A., Matt Kaufmann, Panagiotis Manolios, J Strother Moore, Sandip Ray, Jose Luis Ruiz-Reina, Rob Sumners, Daron Vroon, and Matthew Wilding. "Efficient Execution in an Automated Reasoning Environment." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-59. November 20, 2006. 53 pages.
We describe a method to permit the user of a mathematical logic to write elegant logical definitions while allowing sound and efficient execution. We focus on the ACL2 logic and automated reasoning environment. ACL2 is used by industrial researchers to describe microprocessor designs and other complicated digital systems. Properties of the designs can be formally established with the theorem prover. But because ACL2 is also a functional programming language, the formal models can be executed as simulation engines. We implement features that afford these dual applications, namely formal proof and execution on industrial test suites. In particular, the features allow the user to install, in a logically sound way, alternative executable counterparts for logically-defined functions. These alternatives are often much more efficient than the logically equivalent terms they replace. We discuss several applications of these features.
Lee, Gene Moo, Taehwan Choi, and Yin Zhang. "Designing an Incentive-Based Framework for Overlay Routing." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-60. January 23, 2007. 20 pages.
Overlay routing becomes popular as an incremental mechanism to improve the Internet routing. So far, overlay nodes are always assumed to be cooperated to each other. In this paper, we analyze the overlay routing in a new viewpoint, in which the overlay nodes act independently to maximize their own payoff. We use game theoretic approach to analyze the transit traffic forwarding, and realize that overlay nodes are not likely to cooperate each other in our new scenario.In order to stimulate the independent overlay nodes to cooperate each other, we design and propose an incentive-based framework. We introduce three possible systems and evaluate them analytically. Among the candidates, we use simulation to verify the feasibility of our proposed framework generalized punish-and-reward system. The performance gets closer to social optimum as we increase the number of punishments. In addition, the system shows tolerance against impatient players.
Lee, Gene Moo, Taehwan Choi, and Yin Zhang. "Improving the Interaction between Overlay Routing and Traffic Engineering." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-61. January 23, 2007. 23 pages.
Overlay routing has been successful as an incremental method to improve the current Internet routing by allowing users to select their logical routing. In the meantime, traffic engineering (TE) techniques are being used to reduce the network cost by adapting the physical routing in response to varying traffic patterns. An overlay is interested in the optimal routes for its own users, whereas TE is to optimize the whole network performance. Previous studies have shown that the conflict of objectives can cause huge network cost increases and oscillations to the network. In this paper, we improve the interaction between overlay routing and TE by modifying the objectives of both parties. For the overlay part, we propose TE-awareness which limits the selfishness by some bounds so that the action of overlay does not offensively affect TE.s optimization process. For the TE part, we suggest COPE as a strong candidate, which achieves close-to-optimal performance for predicted traffic matrices and handles unpredictabl e overlay traffic efficiently. With extensive simulation results, we show the proposed methods can significantly improve the interaction with lower network cost and smaller oscillation problems.
Bush, Kevin B., Mark Gebhart, Doug Burger, and Steve Keckler. "A Characterization of High Performance DSP Kernels on the TRIPS Architecture." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-62. November 28, 2006. 20 pages.
Diminishing performance gains in conventional architectures are fueling novel designs which more effectively extract parallelism and have the potential to change the nature of architectural bottlenecks. Consequently, workload characterization is of a growing importance in the design of modern high performance computing architectures. However, the accurate performance evaluation necessary for workload characterization can be prohibitively constrained by immature compilers. In this paper, we present a workload characterization for High Performance Digital Signal Processing (HP-DSP) applications on the TRIPS architecture. Included is a bottleneck analysis of this novel next-generation architecture and a discussion of our evaluation methodology. Using a combination of hand and machine optimization techniques we successfully characterize the workload of the TRIPS architecture on HP-DSP applications under the constraint of a developing compiler. This detailed performance characterization illustrates the potential of HP-DSP applications to successfully map to highly concurrent hardware and discusses bottlenecks unique to the TRIPS architecture.
Clement, Allen, Jeffrey Napper, Harry Li, Jean-Philippe Martin, Lorenzo Alvisi, and Michael Dahlin. "Theory of BAR Games Architecture." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-63. November 29, 2006. 18 pages.
Distributed systems that span multiple administrative domains require protocols that tolerate both Byzantine and selfish nodes. This paper offers a theory that can be used to analyze such protocols. The theory systematically extends traditional game theory solution concepts through an {\em ex ante} analysis that incorporates a rational player's awareness of the possible presence of Byzantine players in the player's utility function. We illustrate our approach by modeling synchronous Terminating Reliable Broadcast as a game. We show that Dolev and Strong's Byzantine TRB protocol with message authentication is not a Nash equilibrium and that rational deviations from it may lead to violation of the TRB safety properties. We present a new TRB protocol with the same asymptotic complexity of Dolev-Strong and prove it to be a Nash equilibrium. Finally, we prove that {\em (k-t)} robustness, a recently proposed solution concept for games with Byzantine and rational players, cannot yield an equilibrium in games, such as our TRB game, that model systems where any node may crash and communication is necessary and incurs cost.
Mericli, Tekin. "Color and Illumination Independent Hand Tracking and Gesture Recognition." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-64. December 4, 2006. 7 pages.
Recognition of human motion provides hints to understand human activities and gives opportunities to the development of a new human-computer interaction (HCI) interface. Hidden Markov Models (HMMs) are used for visual recognition of complex, structured hand gestures such as the ones found in a sign language, since they have proved their success in recognizing speech and handwriting. In this paper, we introduce a hand gesture recognition system to recognize gestures in real-time. Hand tracking is performed in two different ways. The first method is based on color segmentation and blob generation over the hand region, and the second method uses block matching and particle filtering algorithms to detect the moving hand which makes the system totally color and illumination independent. In both methods, extracted information is used as the input to the HMM based gesture recognizer.
Mericli, Tekin and Peng Bi. "JFun: Functional Programming in Java." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-65. December 4, 2006. 19 pages.
A function is a good way of specifying a computation since in each computation the result depends in a certain way on the parameters, and using functions makes a program modular and well-structured. In order to reduce the development effort and future programming costs caused by bugs and maintenance problems, writing well-structured and modular programs has become crucial as software becomes more and more complex. Since modularity is the key to successful programming, functional languages are vitally important to the real world. In this paper we try to show that writing programs in an object-oriented language by using functional programming concepts is possible. As examples, list and tree manipulation functions, numerical integration and differentiation, and alpha-beta heuristic which is an algorithm from Artificial Intelligence used in game-playing programs are implemented in Java programming language using functional programming concepts.
Martin, Jean-Philippe. "Byzantine Fault-Tolerance and Beyond." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-66. December 6, 2006. 291 pages.
Byzantine fault-tolerance techniques are useful because they tolerate arbitrary faults regardless of cause: bugs, hardware glitches, even hackers. These techniques have recently gained popularity after it was shown that they could be made practical.Most of the dissertation builds on Byzantine fault-tolerance (BFT) and extends it with new results for Byzantine fault-tolerance for both quorum systems and state machine replication. Our contributions include proving new lower bounds, finding new protocols that meet these bounds, and providing new functionality at lower cost through a new architecture for state machine replication.
The second part of the dissertation goes beyond Byzantine fault-tolerance. We show that BFT techniques are not sufficient for networks that span multiple administrative domains, propose the new BAR model to describe these environments, and show how to build BAR-Tolerant protocols through our example of a BAR-Tolerant terminating reliable broadcast protocol.
Chan, Ernie, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. "SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures." The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-67. December 18, 2006. 12 pages.
We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data, and operations on these blocks become the fundamental units of computation, resulting in algorithms-by-blocks as opposed to the more traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling and out-of-order execution usual in superscalar processors, which we name {\em SuperMatrix Out-of-Order scheduling}. Performance results on a 16 CPU Itanium2-based server are used to highlight opportunities and issues related to this new approach.
Questions to trcenter@cs.utexas.edu