Data Mining Seminar Abstracts

Mailing List


Soumen Chakrabarti
IIndian Institute of Technology at Bombay
Fri, Aug 20, 2004

Is question answering an acquired skill?

We present a question answering (QA) system which learns how to detect and rank answer passages by analyzing questions and their answers (QA pairs) provided as training data. Our key technique is to recover, from the question, fragments of what might have been posed as a structured query, had a suitable schema been available. One fragment comprises _selectors_: tokens that are likely to appear (almost) unchanged in an answer passage. The other fragment contains question tokens which give clues about the _answer_type_, and are expected to be _replaced_ in the answer passage by tokens which _specialize_ or _instantiate_ the desired answer type. Selectors are like constants in where-clauses in relational queries, and answer types are like column names. We propose a simple conditional exponential model over a pair of feature vectors, one derived from the question and the other derived from the a candidate passage. Features are extracted using a lexical network and surface context as in named entity extraction, except that there is no direct supervision available in the form of fixed entity types and their examples. We do not need any manually-designed question type system or supervised question classification. Using the exponential model, we filter candidate passages and see substantial improvement in the mean rank at which the first answer passage is found. With TREC QA data, our system achieves mean reciprocal rank (MRR) that compares favorably with the best scores in recent years, and generalizes from one corpus to another.

This is a joint work with Ganesh Ramakrishnan, Deepa Paranjpe, and Pushpak Bhattacharyya.

Bio:Dr. Soumen Chakrabarti received his B.Tech in Computer Science from the Indian Institute of Technology, Kharagpur, in 1991 and his M.S. and Ph.D. in Computer Science from the University of California, Berkeley in 1992 and 1996. At Berkeley he worked on compilers and runtime systems for running scalable parallel scientific software on message passing multiprocessors. He was a Research Staff Member at IBM Almaden Research Center from 1996 to 1999, where he worked on the Web search project and led the Focused Crawling project. In 1999 he moved as Assistant Professor to Department of Computer Science and Engineering at the Indian Institute of Technology, Bombay, where he has been an Associate professor since 2003. In Spring 2004 he is Visiting Associate professor at Carnegie-Mellon. He has published in the WWW, SIGIR, SIGKDD, SIGMOD, VLDB, ICDE, SODA, STOC, SPAA and other conferences as well as Scientific American, IEEE Computer, VLDB and other journals. He holds eight US patents on Web-related inventions. He has served as vice-chair or program committee member for WWW, SIGIR, SIGKDD, VLDB, ICDE, SODA and other conferences, and guest editor or editorial board member for DMKD and TKDE journals. He is also author of a Mining the Web: Discovering Knowledge from Hypertext Data. His current research interests include question answering, Web analysis, monitoring and search, mining irregular and relational data, and textual data integration.


Arindam Banerjee
University of Texas at Austin
Fri, Apr 16, 2004

Bregman Clustering and Co-clustering

A wide variety of distortion measures such as squared Euclidean distance, relative entropy etc., have been used in the literature for parametric clustering. In recent years, several such distortion measures have also been used for co-clustering, i.e., simultaneous clustering of the rows and columns of two-dimensional data matrices. In this talk, we discuss a unification of a large class of clustering and co-clustering techniques using Bregman divergences.
The talk has two parts. In the first part, we propose and analyze parametric hard clustering with Bregman divergences. The proposed class of algorithms unify centroid-based parametric clustering approaches, such as classical kmeans and information-theoretic clustering, which arise by special choices of the Bregman divergence. Further, we show an explicit bijection between Bregman divergences and exponential families that leads to a simple soft clustering algorithm for all Bregman divergences. In the second part, we discuss a general framework for co-clustering wherein loss functions corresponding to all Bregman divergences can be used and various conditional expectation based constraints can be considered, thereby giving rise to different parametric co-clustering models on a wide range of data matrices. The minimum Bregman information solution, a direct generalization of maximum entropy and least squares, plays a critical role in the analysis that leads an elegant meta-algorithm guaranteed to achieve local optimality. Empirical evidence is provided to establish the generality and efficacy of the proposed framework.

This is a joint work with Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmednra Modha.


Inderjit S. Dhillon
University of Texas at Austin
Fri, Mar 26, 2004

Information-Theoretic Clustering, Co-clustering and Matrix Approximations

Given a two-dimensional co-occurrence matrix, we consider the problem of co-clustering or simultaneous clustering of the rows and columns. Viewing the (scaled) co-occurrence matrix as an empirical joint probability distribution of two discrete random variables, we can pose the co-clustering problem as an optimization problem in information theory --- the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. It can be shown that this objective is identical to that of finding a "low-complexity" non-negative matrix approximation that is closest in relative entropy to the given matrix under clustering constraints. Based on this formulation, we present a co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. We apply the algorithm to some problems in text mining: distributional clustering of words applied to text classification, and simultaneous clustering of words and documents. Experimental results indicate that distributional word clustering leads to improved text classification when the training data is small in size, and co-clustering can overcome problems due to high-dimensionality and sparsity. Generalizations to varied distortion measures, such as Bregman divergences, will be discussed.

This is a joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu and D. Modha.


Daniel Boley
University of Minnesota
Fri, Mar 12, 2004

Training Support Vector Machines via Adaptive Clustering

We discuss a way to speed up the training process for support vector machines by means of clustering the training data and using representative points for the training process. An iterative procedure is proposed to refine the resulting support vector machine. This procedure eventually converges to the true support vector machine that would be obtained using all the original training data, but often the earlier approximate support vector machines will suffice. The approach is modular by allowing the user to plug in arbitrary methods for both the clustering and the support vector machine training. Using off-the-shelf methods for both tasks, we illustrate the approach with an artificial dataset and a "real" dataset.

Bio:Dr. Daniel Boley received his A.B. degree Summa Cum Laude in Mathematics and with Distinction in All Subjects from Cornell University in 1974, and his M.S. and Ph.D. degrees in Computer Science from Stanford University in 1976 and 1981, respectively. Since 1981, he has been on the faculty of the Department of Computer Science and Engineering at the University of Minnesota. He has had extended visiting positions at the Los Alamos Scientific Laboratory, the IBM Research Center in Zurich (Switzerland), the Australian National University in Canberra, Stanford University, and the University of Salerno (Italy). Dr. Boley is known for his past work on numerical linear algebra methods for control problems, parallel algorithms, iterative methods for matrix eigenproblems, error correction for floating point computations, inverse problems in linear algebra, as well as his more recent work on numerical methods in robot motion planning and unsupervised document categorization in data mining. He has been an editor for the SIAM Journal of Matrix Analysis and has chaired several technical symposia.


Dharmendra S. Modha
IBM Almaden Research Center
Wed, Apr 5, 2000

Clustering Hypertext with Applications to Web Searching

Clustering separates unrelated documents and groups related documents, and is useful for discrimination, disambiguation, summarization, organization, and navigation of unstructured collections of hypertext documents. We propose a novel clustering algorithm that clusters hypertext documents using words (contained in the document), out-links (from the document), and in-links (to the document). The algorithm automatically determines the relative importance of words, out-links, and in-links for a given collection of hypertext documents. We annotate each cluster using six information nuggets: summary, breakthrough, review, keywords, citation, and reference. These nuggets constitute high-quality information resources that are representatives of the content of the clusters, and are extremely effective in compactly summarizing and navigating the collection of hypertext documents. We employ web searching as an application to illustrate our results. Anecdotally, when applied to the documents returned by AltaVista in responses to the query abduction, our algorithm separates documents about "alien abduction" from those about "child abduction."

This is a joint work with W. Scott Spangler and will appear in the ACM Hypertext 2000 Conference.


Haym Hirsh
Computer Science Department
Rutgers University
Thurs, Aug 3, 2000

Integrating Multiple Sources of Information in Text Classification

One popular class of supervised learning problems concerns the task of classifying items that are comprised primarily of text. Such problems can occur when assessing the relevance of a Web page to a given user, assigning topic labels to technical papers, and assessing the importance of a user's email. Most commonly this task is performed by extrapolating from a corpus of labeled textual training documents procedures for classifying further unlabeled documents. This talk presents an approach for corpus-based text classification based on WHIRL, a database system that augments traditional relational database technology with textual-similarity operations developed in the information retrieval community. Not only does the approach perform competitively on traditional text classification tasks, we show that it enables the incorporation of a range of sources of information into the classification process in a fairly robust and general fashion.


Jude Shavlik
Departments of Computer Sciences and Biostatistics & Medical Informatics
University of Wisconsin - Madison
Fri, Aug 4, 2000

Using Diverse Evidence Sources to Uncover Coordinately Controlled Genes via Machine Learning and Dynamic Programming

Now that the complete genomes of numerous organisms have been determined, a key problem in computational molecular biology is uncovering the relationships that exist among the genes in each organism and the regulatory mechanisms that control their operation. We are developing computational methods for discovering such regulatory mechanisms and relationships. In our first project, we have developed a computational approach for predicting operons in the genomes of prokaryotic organisms. Operons are contiguous genes on a chromosome that are "turned on" or "shut off" as a unit. Our approach uses machine-learning methods to induce predictive models for this task from a rich variety of data types including DNA sequence data, gene-expression laboratory data, and handcrafted functional annotations associated with genes. We use multiple learned models that individually predict promoters, terminators, and operons themselves. A key part of our approach is a dynamic-programming method that uses our individual operon predictions to map every known and putative gene in a given genome into its most probable operon. We evaluate our approach using data from the E. coli K-12 genome.

This is a joint work with Joseph Bockhorst, Mark Craven, Jeremy Glasner, and David Page. No prior knowledge about molecular biology will be assumed of the audience.


Peter Haas
IBM Almaden Research Center
Mon, Sept 25, 2000

Techniques for Online Exploration of Large Object-Relational Databases

We review some recent techniques for exploring large object-relational databases in an interactive online manner. The idea is to provide continuously updated running estimates of the final query results to the user, along with an indication of the precision of the estimates. The user can then halt the query as soon as the answer is sufficiently precise---the time required to obtain an acceptable approximate answer can be faster by orders of magnitude than the time needed to completely process the query. We focus on methods for online processing of complex aggregation queries and also discuss methods for online visualization and online display of a large set of result records. We briefly relate our work to more traditional OLAP (Online Analytical Processing) techniques and to static sampling in databases.


Jaideep Srivastava
Department of Computer Science and Engineering
University of Minnesota
Fri, Sept 29, 2000

Web Mining: A Challenging Frontier for Data Mining

The ease and speed with which business transactions can be carried out over the Web has been a key driving force in the rapid growth of electronic commerce. The Web is revolutionizing the way business interact with each other (B2B) and with each customer (B2C). It has introduced entirely new ways of doing commerce, including e.g. auctions and reverse auctions. It also made it imperative for organizations and companies to optimize their electronic business. Knowledge about the Web user is fundamental for the establishment of viable e-business solutions. Web mining is the application of data mining techniques to acquire this knowledge for e-business. Typical concerns in e-business include improved cross-sells, up-sells, personalized ads, targeted assortments, improved conversion rates, and measurements of the effectiveness of actions.

This talk will


Raghu Ramakrishnan
Professor and Vilas Associate
Department of Computer Sciences, UW-Madison
Fri, Oct 13, 2000

Exploratory Analysis of Large Datasets

Exploratory data analysis has a long and rich history, especially in the AI and Statistics communities. Recently, it has gained a lot of attention under the name "Data Mining", with the new twist being an increased emphasis on analyzing large datasets. In this talk, I will discuss applications that motivate data mining and consider some new challenges that arise as a consequence of considering large datasets.


Ray Mooney
Department of Computer Sciences
University of Texas - Austin
Fri, Oct 27, 2000

Text Mining with Information Extraction

Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents. An IE system is therefore capable of transforming a corpus of unstructured texts or web pages into a structured database. Our previous work has focused on using machine learning methods to automatically construct IE systems from training sets of manually annotated documents. Our current research focuses on form of text mining that extracts a database from a document corpus using a learned IE system and then mines this database for interesting patterns using traditional rule-induction methods. The mined rules can in turn be used to improve the accuracy of information extraction. We will present encouraging results on applying this approach to a corpus of computer job postings from the Usenet newsgroup


Nick Street
Department of Management Sciences
The University of Iowa
Thurs, Nov 9, 2000

Feature Selection in Unsupervised Learning via Multi-Objective Evolutionary Search

Feature subset selection is an important knowledge discovery task, because of the insight gained from determining relevant modeling variables and for the improved understandability, scalability, and perhaps accuracy of the resulting models. We explore the feature selection problem in unsupervised learning (clustering) where, unlike supervised learning, there is no single adequate measure of the quality of clusters built using a given feature subset. We frame the problem as a multi-objective optimization task by selecting a number of heuristic criteria to estimate model quality. The combinatorial search for good solutions is performed using ELSA, an evolutionary local selection algorithm that maintains a diverse population of solutions approximating the multi-dimensional Pareto front. We evaluate our approach with both synthetic and real-world data sets of moderate to high dimension. Our results show promise in finding Pareto-optimal solutions through which we can identify significant feature subsets and an appropriate number of clusters.


Jerome H. Friedman
Stanford University
Fri, Nov 10, 2000

Predictive Data Mining with Multiple Additive Regression Trees

Predicting future outcomes based on past observational data is a common application in data mining. The primary goal is usually predictive accuracy, with secondary goals being speed, ease of use, and interpretability of the resulting predictive model. New automated procedures for predictive data mining, based on "boosting" (CART) regression trees, are described. The goal is a class of fast "off-the-shelf" procedures for regression and classification that are competitive in accuracy with more customized approaches, while being fairly automatic to use (little tuning), and highly robust especially when applied to less than clean data. Tools are presented for interpreting and visualizing these multiple additive regression tree (MART) models.


Prabhakar Raghavan
Chief Scientist, Verity Inc.
Fri, Nov 17, 2000

Networks and Sub-Networks in the World-Wide Web

We discuss several structural properties of graphs arising from the world-wide web including the graph of hyperlinks, and the graph induced by connections between distributed search servents. We review a number of algorithmic investigations of the structure of these networks, and conclude by proposing stochastic models for the evolution of these networks.

Bio:Dr. Prabhakar Raghavan is Chief Scientist at Verity, and is also a consulting professor at Stanford University. Prior to Verity, he worked at the IBM Almaden Research Center, where his group worked on algorithms, web mining and search, complexity theory, security and cryptography, optimization and reasoning about knowledge. Prabhakar got his undergraduate degree in electrical engineering from IIT, Madras, and PhD in computer science from UC Berkeley.


Tom Mitchell
Carnegie Mellon University
and Whizbang! Labs
Fri, Feb 09, 2001

Machine Learning and Extracting Information from the Web

Today's search engines can retrieve and display over a billion web pages, but their use is limited by the fact that they don't analyze the content of these pages in any depth.
This talk will describe research that has resulted in systems that answer these kinds of questions by extracting detailed factual information automatically from millions of web pages. Our approach relies heavily on machine learning algorithms to train the system to find and extract targeted information. For example, in one case we trained our system to find and extract job postings from the web, resulting in the world's largest database of job openings (over 600,000 jobs, see This talk will describe machine learning algorithms for classifying and extracting information from web pages, including results of recent research on using unlabeled data and other kinds of information to improve learning accuracy.


Joydeep Ghosh
Electrical and Computer Engineering
University of Texas, Austin
Fri, Mar 23, 2001

Graphical Techniques for Efficient and Balanced Clustering of Very High-Dimensional Data

Segmentation or clustering of large data sets with thousands of dimensions is very challenging because of sparsity problems arising from the ``curse of dimensionality''. Recent graphical approaches provide novel and efficient solutions by working in similarity space rather than in the original vector space in which the data points were specified.
I shall describe one such approach based on constrained, weighted graph-partitioning, that has yielded remarkable results on real-life market basket analysis and on clustering web pages. Issues on which similarity space should be used, and how to evaluate and visualize the clusters will be addressed in the process.


Gautam Das
Microsoft Research
Fri, Mar 30, 2001

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

The popularity of data warehousing applications makes the problem of approximately answering aggregation queries important. Recently, researchers in this area have presented several sampling based schemes for approximate processing of aggregation queries. While most of these studies have acknowledged the importance of exploiting workload information in solving the problem, they still fall short in several important ways. First, many of them use ad-hoc schemes for picking samples of the data, sometimes resulting in unacceptable quality of answers even for the given workload. Second, most previous approaches ignore variance in the data distribution in determining the sample, which can also lead to poor quality. We present a comprehensive and practical solution that addresses all these problems. We use basic ideas from stratified sampling to formulate the problem as an optimization problem whose goal is to minimize the error for the given workload. We show how our solution can be implemented efficiently in a database system, and present results of extensive experiments on a commercial database management system that show the superior quality of our method compared to relevant previous work.


Ed George
Department of MSIS
University of Texas - Austin
Fri, Apr 13, 2001

Bayesian Treed Modeling

When simple parametric models such as linear regression fail to adequately approximate a relationship across an entire set of data, an alternative may be to consider a partition of the data, and then use a separate simple model within each subset of the partition. Such an alternative is provided by a treed model which uses a binary tree to identify such a partition. However, treed models go further than conventional trees (eg CART, C4.5) by fitting models rather than a simple mean or proportion within each subset. In this talk, I will discuss a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression. The potential of this approach is illustrated by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets. This is joint work with Hugh Chipman, University of Waterloo and Rob McCulloch, University of Chicago. A paper is available at


David Scott
Rice University
Fri, Apr 20, 2001

Nonparametric Clustering for Data Mining

The use of density estimation to find clusters in data is supplementing ad hoc hierarchical methodology. Examples include finding high-density regions, finding modes in a kernel density estimator, and the mode tree. Alternatively, a mixture model may be fit and the mixture components associated with individual clusters. Fitting a high-dimensional mixture model with many components is difficult to estimate in practice. Here, we survey mode and level set methods for finding clusters. We describe a new algorithm that estimates a subset of a mixture model. In particular, we demonstrate how to fit one component at a time and how the fits may be organized to reveal the complete clustering model.


Daniel Boley
University of Minnesota
Fri, Apr 27, 2001

A Scalable Hierarchical Algorithm for Unsupervised Clustering

Top-down hierarchical clustering can be done in a scalable way. Here we describe a scalable unsupervised clustering algorithm designed for large datasets from a variety of applications. The method constructs a tree of nested clusters top-down, where each cluster in the tree is split according to the leading principal direction. We use a fast principal direction solver to achieve a fast overall method. The algorithm can be applied to any dataset whose entries can be embedded in a high dimensional Euclidean space, and takes full advantage of any sparsity present in the data. We show the performance of the method on text document data, in terms of both scalability and quality of clusters. We demonstrate the versatility of the method in different domains by showing results from text documents, human cancer gene expression data, and astrophysical data. For that last domain, we use an out of core variant of the underlying method which is capable of efficiently clustering large datasets using only a relatively small memory partition.

Bio:Dr. Daniel Boley received his A.B. degree Summa Cum Laude in Mathematics and with Distinction in All Subjects from Cornell University in 1974, and his M.S. and Ph.D. degrees in Computer Science from Stanford University in 1976 and 1981, respectively. Since 1981, he has been on the faculty of the Department of Computer Science and Engineering at the University of Minnesota. He has had extended visiting positions at the Los Alamos Scientific Laboratory, the IBM Research Center in Zurich (Switzerland), the Australian National University in Canberra, Stanford University, and the University of Salerno (Italy). Dr. Boley is known for his past work on numerical linear algebra methods for control problems, parallel algorithms, iterative methods for matrix eigenproblems, error correction for floating point computations, inverse problems in linear algebra, as well as his more recent work on numerical methods in robot motion planning and unsupervised document categorization in data mining. He has been an editor for the SIAM Journal of Matrix Analysis and has chaired several technical symposia.


Soumen Chakrabarti
Indian Institute of technology - Bombay

Beyond hubs and authorities: Enhanced topic distillation using text, markup tags, and hyperlinks

Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem is common for narrow topics, because most hubs contain admixtures of other topics. The problem also becomes severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. As a by-product, our algorithm can extract annotated sections of hubs which are best suited to surveying the given topic. Using over 50 topics, 28 from earlier topic distillation work, we have analyzed over 700000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.

This is a joint work with Mukul Joshi and Vivek Tawde.


Joseph Sirosh
HNC Software Inc.
Fri, Sept 21, 2001

Unit Vector Representations of Unstructured Information

One of the frequently encountered challenges in data mining is the problem of representing unstructured information such as words in text, action codes, and similar non numerical information. In this talk, I will introduce a technology developed at HNC called Context Vectors for representing and operating on unstructured data. Using context vectors, many operations on text data, such as search and indexing, categorization, clustering and summarization can be performed using vector operations. I will also review some of the applications of this technology, such as email response, personalization, search, recommendation generation and segmentation.

Bio:Dr. Joseph Sirosh leads Advanced Technology Solutions (ATS), the R&D group of HNC Software Inc. (NASDAQ: HNCS). HNC is a leader in Customer Insight solutions incorporating decision management and customer analytics for better customer insight in the financial/banking, incurance/ healthcare, telecommunications and e-commerce industries. Sirosh oversees and directs research projects on text/multimedia retrieval, wireless web information management, information extraction and personalization, predictive modeling, intelligent agents and genome sequence mining. During his tenure at HNC, he won highly competitive externally funded projects in information retrieval and bioinformatics, helping HNC create breakthrough technology and establish core competencies in new fields. In his career, Sirosh has published more than 20 technical papers and one electronic book, and has been a lead investigator of various research grants from DARPA and other government agencies. Sirosh holds a PhD and Masters in Computer Sciences from the University of Texas at Austin and a BTech degree in Computer Science & Engineering from the Indian Institute of Technology, Madras.


Sridhar Rajagopalan
IBM Almaden Research Center
Fri, Oct 5, 2001

Three applications of data mining principles

In this talk, I will describe three applications of data mining algorithms similar to Agrawal and Srikant's a priori algorithm. As we go through the sequence, we will see increasing use of simple graph theoretic methods, and explain why the even simpler base technology fails.

Bio:Dr. Sridhar Rajagopalan recieved his Ph.D. from University of California, Berkeley in 1994. He subsequently spent two years at Princeton University as a DIMACS postdoctoral fellow. He has since been a researcher at the IBM Almaden Research Center. He is interested in Algorithms, Databases, Randomization in Computing, and Information Retrieval on the Web.


Foster Provost
New York University
Fri, Oct 26, 2001

Reviving the Learning Curve: A critical study of machine learning performance

Off-the-shelf machine-learning algorithms now are used widely both for manual data analysis and as important components for building intelligent systems. We revive the "learning curve," which relates learning performance to training-set size, to examine two of the most popular families of off-the-shelf classifier induction algorithms: decision-tree inducers and logistic regression. This new analysis (conducted using 37 large data sets, comparing classification and ranking performance) shows some remarkable things. Most generally, one can not safely ignore learning curves when comparing algorithms; on the same data sets, conclusions can be strikingly different depending on training-set size. This calls into question the conclusions of innumerable comparison studies that do not factor the size of the data set into the analysis. More specifically, looking carefully at the learning curves of these two popular induction algorithm families we see clearly that each is preferable for a different type of data.


Oren Etzioni
Department of Computer Science and Engineering
University of Washington
Fri, Nov 2, 2001

Adaptive Web Sites

Today's web sites are intricate but not intelligent; While web navigation is dynamic and idiosyncratic, all too often web sites are fossils cast in HTML. This talk describes our research on Adaptive Web Sites: sites that automatically improve their organization and presentation by learning from visitor access patterns. Adaptive web sites mine the data buried in web server logs to produce more easily navigable web sites.

To demonstrate the feasibility of adaptive web sites, I will describe the problem of index page synthesis and sketch a solution that relies on novel clustering and conceptual clustering techniques. Our preliminary experiments show that high-quality candidate index pages can be generated automatically, and that our techniques outperform existing methods (including the Apriori algorithm, K-means clustering, hierarchical agglomerative clustering, and COBWEB) in this domain.


Andrew McCallum
WhizBang Labs & Carnegie Mellon University
Fri, Feb 22, 2002

Information Extraction with Finite State Models

Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human readable text. Finite state machines are the dominant model for information extraction both in research and industry. In this talk I will give several examples of information extraction tasks at WhizBang Labs, and then describe two new finite state models designed to take special advantage of the multi-faceted nature of text on the web. Maximum Entropy Markov Models (MEMMs) are discriminative sequence models that allow each observation to be represented as a collection of arbitrary overlapping features (such as word identity, capitalization, part-of-speech and formatting, plus agglomerative features of the entire sequence and features from the past and future). Conditional Random Fields (CRFs) are a generalization of MEMMs that solve a fundamental limitation of MEMMs and all other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. I will introduce both models, skim over their parameter estimations algorithms, and present experimental results on real and synthetic data.

This is a joint work with Fernando Pereira, John Lafferty and Dayne Freitag.

Bio:Dr. Andrew McCallum received his BA from Dartmouth College and his PhD in Computer Science from University of Rochester, then a Post-Doctoral Fellowship at Carnegie Mellon University. He is currently VP of Research and Development at WhizBang! Labs, a company that does machine-learning-based information extraction from the Web. His interests include machine learning and text mining, especially information extraction, document classification and techniques for learning from limited labeled training data.


Pedro Domingos
Department of Computer Science and Engineering
University of Washington
Fri, Oct 25, 2002

Learning From Networks of Examples

Most machine learning algorithms assume that examples are independent of each other, but many (or most) domains violate this assumption. For example, in real markets customers' buying decisions are influenced by their friends and acquaintances, but data mining for marketing ignores this (as does traditional economics). In this talk I will describe how we can learn models that account for example dependences, and use them to make better decisions. For example, in the marketing domain we are able to pinpoint the most influential customers, "seed" the network by marketing to them, and unleash a wave of word of mouth. We mine these models from collaborative filtering systems and knowledge-sharing Web sites, and show that they are surprisingly robust to imperfect knowledge of the network. I will also survey other applications of learning from networks of examples we are working on, including: combining link and content information in Google-style Web search; automatically translating between ontologies on the Semantic Web; and predicting the evolution of scientific communities.

This is a joint work with Matt Richardson.

Bio:Dr. Domingos received his BA(1988) and M.S(1992) in EECS from IST, in Lisbon, and his M.S.(1994) and Ph.D.(1997) in Information and Computer Science from the University of California at Irvine. After spending two years as an assistant professor at IST, he joined the faculty of the University of Washington in 1999. He is the author or co-author of over 80 technical publications in machine learning, data mining, and other areas.


David Page, Jr.
Department of Biostatistics and Bioinformatics, Department of Computer Sciences
University of Wisconsin - Medison
Tues, Aug 19, 2003

Machine Learning in Cancer Genetics

In recent years several novel types of genetic data have become available, and a primary application area for these types of data is in the study of cancer. Two of the most widely-publicized sources of such data are gene expression microarrays, or "gene chips," and single-nucleotide polymorphisms, or "SNPs." This talk will review these types of data and prior applications of machine learning to gene chip data. It then will present novel results using machine learning to analyze gene chip and SNP data for multiple myeloma, an incurable form of blood cancer. To our knowledge, the application to SNP data is the first such machine learning application. The talk will close with a brief discussion of other novel biological data types to which machine learning also should be applicable.

This is joint work with Michael Waddell (also of UW) and with John Shaughnessy, Fenghuang Zhan and Bart Barlogie of the Lambert Laboratory for Myeloma Genetics and the Myeloma Institute at the University of Arkansas for Medical Sciences.


Jiawei Han
Department of Computer Science
University of Illinois - Urbana-Champaign
Fri, Oct 24, 2003

Mining Dynamics of Data Streams in Multidimensional Space

Traditional data mining systems and methods assume that data resides on disks or in main memory, and a data mining process can scan the data sets multiple times to uncover interesting patterns. However, real-time production systems and other dynamic environments often generate tremendous (potentially infinite) volumes of stream data in fast speed---the volume of data is too huge to be stored on disks or even if it were stored on disks, it could be too expensive be fetched and scanned multiple times. Moreover, many applications may require real time mining of unusual patterns in data streams, including finding unusual network or telecommunication traffic, real-time pattern mining in video surveillance, detecting suspicious on-line transactions or terrorist activities, and so on. Recently there have been substantial growing research interests in developing methods for querying and mining stream data. In this talk, I will first present an overview of recent developments on stream data querying and mining and outline what are the major challenging research problems for mining dynamics of data streams in multi-dimensional space. In particular, I will be addressing the following issues in detail: (1) multi-dimensional on-line analysis methods for discovering unusual patterns in stream data; (2) mining time-sensitive frequent patterns in stream data; (3) mining clusters and outliers in stream data; and (4) dynamic stream classification methods.

Bio:Dr. Jiawei Han is a professor at Department of Computer Science, University of Illinois at Urbana-Champaign. He has been working on research into data mining, data warehousing, database systems, spatial and multimedia databases, deductive and object-oriented databases, Web databases, bio-medical databases, etc. with over 250 journal and conference publications. He has chaired or served in many program committees of international conferences and workshops, including 2001 and 2002 SIAM-Data Mining Conference (PC co-chair), 2004 and 2002 International Conferences on Data Engineering (PC vice-chair), ACM SIGKDD conferences (2001 best paper award chair, 2002 student award chair), 2002 and 2003 ACM SIGMOD conference (2000 demo/exhibit program chair), etc. He also served or is serving on the editorial boards for Data Mining and Knowledge Discovery: An International Journal, IEEE Transactions on Knowledge and Data Engineering, and Journal of Intelligent Information Systems. His textbook "Data Mining: Concepts and Techniques" (Morgan Kaufmann, 2001) has been popularly adopted for data mining courses in universities. His research is currently supported by NSF, NSF/ITR, ONR, Univ. of Illinois, IBM Faculty Award, and a gift from Microsoft Research.


Last modified by
Updated 10 Sept., 2001 6:35 PM