Data Mining Seminar Abstracts

Spring 2001

Mailing List


Dharmendra S. Modha
IBM Almaden Research Center
Wed, Apr 5 @ 4pm

Clustering Hypertext with Applications to Web Searching

Clustering separates unrelated documents and groups related documents, and is useful for discrimination, disambiguation, summarization, organization, and navigation of unstructured collections of hypertext documents. We propose a novel clustering algorithm that clusters hypertext documents using words (contained in the document), out-links (from the document), and in-links (to the document). The algorithm automatically determines the relative importance of words, out-links, and in-links for a given collection of hypertext documents. We annotate each cluster using six information nuggets: summary, breakthrough, review, keywords, citation, and reference. These nuggets constitute high-quality information resources that are representatives of the content of the clusters, and are extremely effective in compactly summarizing and navigating the collection of hypertext documents. We employ web searching as an application to illustrate our results. Anecdotally, when applied to the documents returned by AltaVista in responses to the query abduction, our algorithm separates documents about "alien abduction" from those about "child abduction."

This is joint work with W. Scott Spangler and will appear in the ACM Hypertext 2000 Conference.


Haym Hirsh
Computer Science Department
Rutgers University
Thurs, Aug 3 @ 3pm

Integrating Multiple Sources of Information in Text Classification

One popular class of supervised learning problems concerns the task of classifying items that are comprised primarily of text. Such problems can occur when assessing the relevance of a Web page to a given user, assigning topic labels to technical papers, and assessing the importance of a user's email. Most commonly this task is performed by extrapolating from a corpus of labeled textual training documents procedures for classifying further unlabeled documents. This talk presents an approach for corpus-based text classification based on WHIRL, a database system that augments traditional relational database technology with textual-similarity operations developed in the information retrieval community. Not only does the approach perform competitively on traditional text classification tasks, we show that it enables the incorporation of a range of sources of information into the classification process in a fairly robust and general fashion.


Jude Shavlik
Departments of Computer Sciences and Biostatistics & Medical Informatics
University of Wisconsin - Madison
Fri, Aug 4 @ 3pm

Using Diverse Evidence Sources to Uncover Coordinately Controlled Genes via Machine Learning and Dynamic Programming

Now that the complete genomes of numerous organisms have been determined, a key problem in computational molecular biology is uncovering the relationships that exist among the genes in each organism and the regulatory mechanisms that control their operation. We are developing computational methods for discovering such regulatory mechanisms and relationships. In our first project, we have developed a computational approach for predicting operons in the genomes of prokaryotic organisms. Operons are contiguous genes on a chromosome that are "turned on" or "shut off" as a unit. Our approach uses machine-learning methods to induce predictive models for this task from a rich variety of data types including DNA sequence data, gene-expression laboratory data, and handcrafted functional annotations associated with genes. We use multiple learned models that individually predict promoters, terminators, and operons themselves. A key part of our approach is a dynamic-programming method that uses our individual operon predictions to map every known and putative gene in a given genome into its most probable operon. We evaluate our approach using data from the E. coli K-12 genome.

This is joint work with Joseph Bockhorst, Mark Craven, Jeremy Glasner, and David Page. No prior knowledge about molecular biology will be assumed of the audience.

Peter Haas
IBM Almaden Research Center
Mon. Sept. 25 @ 1 pm

Techniques for Online Exploration of Large Object-Relational Databases

We review some recent techniques for exploring large object-relational databases in an interactive online manner. The idea is to provide continuously updated running estimates of the final query results to the user, along with an indication of the precision of the estimates. The user can then halt the query as soon as the answer is sufficiently precise---the time required to obtain an acceptable approximate answer can be faster by orders of magnitude than the time needed to completely process the query. We focus on methods for online processing of complex aggregation queries and also discuss methods for online visualization and online display of a large set of result records. We briefly relate our work to more traditional OLAP (Online Analytical Processing) techniques and to static sampling in databases.


Jaideep Srivastava
Department of Computer Science and Engineering
University of Minnesota

Fri. Sept. 29 @ 11am

Web Mining: A Challenging Frontier for Data Mining

The ease and speed with which business transactions can be carried out over the Web has been a key driving force in the rapid growth of electronic commerce. The Web is revolutionizing the way business interact with each other (B2B) and with each customer (B2C). It has introduced entirely new ways of doing commerce, including e.g. auctions and reverse auctions. It also made it imperative for organizations and companies to optimize their electronic business. Knowledge about the Web user is fundamental for the establishment of viable e-business solutions. Web mining is the application of data mining techniques to acquire this knowledge for e-business. Typical concerns in e-business include improved cross-sells, up-sells, personalized ads, targeted assortments, improved conversion rates, and measurements of the effectiveness of actions.

This talk will


Raghu Ramakrishnan
Professor and Vilas Associate
Department of Computer Sciences
, UW-Madison

Fri, Oct 13 @ 11 am

Exploratory Analysis of Large Datasets

Exploratory data analysis has a long and rich history, especially in the AI and Statistics communities. Recently, it has gained a lot of attention under the name "Data Mining", with the new twist being an increased emphasis on analyzing large datasets. In this talk, I will discuss applications that motivate data mining and consider some new challenges that arise as a consequence of considering large datasets.


Inderjit Dhillon
Department of Computer Sciences
University of Texas - Austin

To be announced



Ed George
Department of MSIS
University of Texas - Austin
February 2001

Bayesian Treed Modeling

When simple parametric models such as linear regression fail to adequately approximate a relationship across an entire set of data, an alternative may be to consider a partition of the data, and then use a separate simple model within each subset of the partition. Such an alternative is provided by a treed model which uses a binary tree to identify such a partition. However, treed models go further than conventional trees (eg CART, C4.5) by fitting models rather than a simple mean or proportion within each subset. In this talk, I will discuss a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression. The potential of this approach is illustrated by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets. This is joint work with Hugh Chipman, University of Waterloo and Rob McCulloch, University of Chicago. A paper is available at

Oren Etzioni
Department of Computer Science and Engineering
University of Washington
Jan/Feb 2001

Adaptive Web Sites

Today's web sites are intricate but not intelligent; While web navigation is dynamic and idiosyncratic, all too often web sites are fossils cast in HTML. This talk describes our research on Adaptive Web Sites: sites that automatically improve their organization and presentation by learning from visitor access patterns. Adaptive web sites mine the data buried in web server logs to produce more easily navigable web sites.

To demonstrate the feasibility of adaptive web sites, I will describe the problem of index page synthesis and sketch a solution that relies on novel clustering and conceptual clustering techniques. Our preliminary experiments show that high-quality candidate index pages can be generated automatically, and that our techniques outperform existing methods (including the Apriori algorithm, K-means clustering, hierarchical agglomerative clustering, and COBWEB) in this domain.

Ray Mooney
Department of Computer Sciences
University of Texas - Austin
Fri, Oct 27 @ ?pm

Text Mining with Information Extraction

Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents. An IE system is therefore capable of transforming a corpus of unstructured texts or web pages into a structured database. Our previous work has focused on using machine learning methods to automatically construct IE systems from training sets of manually annotated documents. Our current research focuses on form of text mining that extracts a database from a document corpus using a learned IE system and then mines this database for interesting patterns using traditional rule-induction methods. The mined rules can in turn be used to improve the accuracy of information extraction. We will present encouraging results on applying this approach to a corpus of computer job postings from the Usenet newsgroup

Nick Street
Department of Management Sciences
The University of Iowa

Feature Selection in Unsupervised Learning via Multi-Objective Evolutionary Search

Feature subset selection is an important knowledge discovery task, because of the insight gained from determining relevant modeling variables and for the improved understandability, scalability, and perhaps accuracy of the resulting models. We explore the feature selection problem in unsupervised learning (clustering) where, unlike supervised learning, there is no single adequate measure of the quality of clusters built using a given feature subset. We frame the problem as a multi-objective optimization task by selecting a number of heuristic criteria to estimate model quality. The combinatorial search for good solutions is performed using ELSA, an evolutionary local selection algorithm that maintains a diverse population of solutions approximating the multi-dimensional Pareto front. We evaluate our approach with both synthetic and real-world data sets of moderate to high dimension. Our results show promise in finding Pareto-optimal solutions through which we can identify significant feature subsets and an appropriate number of clusters.

Jerome H. Friedman
Stanford University

Predictive Data Mining with Multiple Additive Regression Trees

Predicting future outcomes based on past observational data is a common application in data mining. The primary goal is usually predictive accuracy, with secondary goals being speed, ease of use, and interpretability of the resulting predictive model. New automated procedures for predictive data mining, based on "boosting" (CART) regression trees, are described. The goal is a class of fast "off-the-shelf" procedures for regression and classification that are competitive in accuracy with more customized approaches, while being fairly automatic to use (little tuning), and highly robust especially when applied to less than clean data. Tools are presented for interpreting and visualizing these multiple additive regression tree (MART) models.

Prabhakar Raghavan
Chief Scientist, Verity Inc.

Networks and Sub-Networks in the World-Wide Web

We discuss several structural properties of graphs arising from the world-wide web including the graph of hyperlinks, and the graph induced by connections between distributed search servents. We review a number of algorithmic investigations of the structure of these networks, and conclude by proposing stochastic models for the evolution of these networks.

Bio: Prabhakar Raghavan is Chief Scientist at Verity, and is also a consulting professor at Stanford University. Prior to Verity, he worked at the IBM Almaden Research Center, where his group worked on algorithms, web mining and search, complexity theory, security and cryptography, optimization and reasoning about knowledge. Prabhakar got his undergraduate degree in electrical engineering from IIT, Madras, and PhD in computer science from UC Berkeley.

Joydeep Ghosh
Electrical and Computer Engineering
University of Texas, Austin

Graphical Techniques for Efficient and Balanced Clustering of Very High-Dimensional Data

Segmentation or clustering of large data sets with thousands of dimensions is very challenging because of sparsity problems arising from the ``curse of dimensionality''. Recent graphical approaches provide novel and efficient solutions by working in similarity space rather than in the original vector space in which the data points were specified.
I shall describe one such approach based on constrained, weighted graph-partitioning, that has yielded remarkable results on real-life market basket analysis and on clustering web pages. Issues on which similarity space should be used, and how to evaluate and visualize the clusters will be addressed in the process.

Tom Mitchell
Carnegie Mellon University
and Whizbang! Labs

Machine Learning and Extracting Information from the Web

Today's search engines can retrieve and display over a billion web pages, but their use is limited by the fact that they don't analyze the content of these pages in any depth.
This talk will describe research that has resulted in systems that answer these kinds of questions by extracting detailed factual information automatically from millions of web pages. Our approach relies heavily on machine learning algorithms to train the system to find and extract targeted information. For example, in one case we trained our system to find and extract job postings from the web, resulting in the world's largest database of job openings (over 600,000 jobs, see This talk will describe machine learning algorithms for classifying and extracting information from web pages, including results of recent research on using unlabeled data and other kinds of information to improve learning accuracy.

Gautam Das
Microsoft Research

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

The popularity of data warehousing applications makes the problem of approximately answering aggregation queries important. Recently, researchers in this area have presented several sampling based schemes for approximate processing of aggregation queries. While most of these studies have acknowledged the importance of exploiting workload information in solving the problem, they still fall short in several important ways. First, many of them use ad-hoc schemes for picking samples of the data, sometimes resulting in unacceptable quality of answers even for the given workload. Second, most previous approaches ignore variance in the data distribution in determining the sample, which can also lead to poor quality. We present a comprehensive and practical solution that addresses all these problems. We use basic ideas from stratified sampling to formulate the problem as an optimization problem whose goal is to minimize the error for the given workload. We show how our solution can be implemented efficiently in a database system, and present results of extensive experiments on a commercial database management system that show the superior quality of our method compared to relevant previous work.

David Scott
Rice University

Nonparametric Clustering for Data Mining

The use of density estimation to find clusters in data is supplementing ad hoc hierarchical methodology. Examples include finding high-density regions, finding modes in a kernel density estimator, and the mode tree. Alternatively, a mixture model may be fit and the mixture components associated with individual clusters. Fitting a high-dimensional mixture model with many components is difficult to estimate in practice. Here, we survey mode and level set methods for finding clusters. We describe a new algorithm that estimates a subset of a mixture model. In particular, we demonstrate how to fit one component at a time and how the fits may be organized to reveal the complete clustering model.

Daniel Boley
University of Minnesota

A Scalable Hierarchical Algorithm for Unsupervised Clustering

Top-down hierarchical clustering can be done in a scalable way. Here we describe a scalable unsupervised clustering algorithm designed for large datasets from a variety of applications. The method constructs a tree of nested clusters top-down, where each cluster in the tree is split according to the leading principal direction. We use a fast principal direction solver to achieve a fast overall method. The algorithm can be applied to any dataset whose entries can be embedded in a high dimensional Euclidean space, and takes full advantage of any sparsity present in the data. We show the performance of the method on text document data, in terms of both scalability and quality of clusters. We demonstrate the versatility of the method in different domains by showing results from text documents, human cancer gene expression data, and astrophysical data. For that last domain, we use an out of core variant of the underlying method which is capable of efficiently clustering large datasets using only a relatively small memory partition.

Daniel Boley received his A.B. degree Summa Cum Laude in Mathematics and with Distinction in All Subjects from Cornell University in 1974, and his M.S. and Ph.D. degrees in Computer Science from Stanford University in 1976 and 1981, respectively. Since 1981, he has been on the faculty of the Department of Computer Science and Engineering at the University of Minnesota. He has had extended visiting positions at the Los Alamos Scientific Laboratory, the IBM Research Center in Zurich (Switzerland), the Australian National University in Canberra, Stanford University, and the University of Salerno (Italy). Dr. Boley is known for his past work on numerical linear algebra methods for control problems, parallel algorithms, iterative methods for matrix eigenproblems, error correction for floating point computations, inverse problems in linear algebra, as well as his more recent work on numerical methods in robot motion planning and unsupervised document categorization in data mining. He has been an editor for the SIAM Journal of Matrix Analysis and has chaired several technical symposia.

Last modified by
Updated 29 March, 2001 7:44 PM