Research Overview:

My dominant focus is now on heterogeneous databases. The umbrella project is Alamo; the Net as a Virtual Data Warehouse. The effort starts from a set of basic assumptions which are quite different than nearly all related projects. Each assumption leads to fundamental differences in system organization and the qualification of scientific contribution.

The architecture is all encompassing. To maintain scientific integrity, not to mention more practical implications, less aggressive implementations of the architecture have been built and used to develop applications. This includes the OBDIF project, an integration of order-of-battle data to be used in military simulations and the World-Wide Herbarium projects performed by teams of students as term projects in my graduate database classes. An extension of the Alamo caching and prefetch mechanism is under development for inclusion in the MCC Infosleuth project.

Assumptions

  1. We assume no cooperation from the component data sources other than the data has been made available.
  2. Although the bandwidth of the internet may increase, connections to individual servers will continue to be unreliable and the wait time for an individual response to an internet query will continue to be unpredictable.
  3. Component data sources can be enumerated a priori. I.e. crawling and discovery of interesting sites has been accomplished.

 

Empowering the User:

A goal of the project is to develop tools to empower a user, or at least his organization, to federate data sources without the help of the data providers. The first assumption recognizes the fact that most organizations are resource constrained and that ownership of information assets is usually jealously guarded. Thus, the user of the data is much more motivated to attend to the details of data integration than is a data provider. The case has been made that the provider of the data is uniquely placed to make authoritative statements concerning the structure and content of the data source (i.e. to define the meta-data). Nevertheless, a user of the data certainly has domain knowledge about the data he is interested in using. An element of our approach is to exploit data mining techniques to augment the user's own knowledge of the data and automate the discovery of integration of meta-level details.

First Solution Processing:

The Alamo architecture directly addresses a model where data is processed in streams. At each step in the evaluation of a query the priority is to reduce the latency to the first few solutions. The goal being to serve a user a first screen full of results without waiting for anyone data source. Certainly not having to wait for the complete transmittal of data from a data source.

The impact of optimizing for the first-few solutions could not be larger. We have already determined that database systems do not even collect the meaningful statistics about the data on which to base a cost-function[BaMi95]. We are building a basis for join algorithms that are able to derive the benefits semijoin reductions on-the-fly [SaMi96,Mietal96]. The parallelizing techniques for the incremental evaluation of rule-based programs effectively solve the synchronization problems the appear with the concurrent arrival of data from multiple internet sources [Kuetal9?].