Data Mining Seminar Abstracts

Mailing List


Joseph Sirosh
Wed, March 19, 2008

How Analytics Is Revolutionizing E-Commerce

Machine learning and data mining applications are central to eCommerce. Recommendation systems, online risk management, behavioral ad targeting, targeted web and email marketing, online self-service systems, search relevance ranking etc. are just a few of the applications that make extensive use of data mining techniques to enable a unique customer experience on the web. This talk will explore the enormous wealth of opportunities for leveraging data to enhance the eCommerce experience. Using examples drawn from leading web sites, I will provide an overview of how some of these systems work and how they generate great value for customers. We'll then look to some emerging trends and what the future holds for data mining and machine learning on the web.

Bio: Joseph Sirosh is currently Vice President of Business Intelligence and Transaction Risk Management at Prior to joining he worked at Fair Isaac Corporation as VP of the Advanced Technology R&D group, exploring advanced analytic and data mining applications. At Fair Isaac and at HNC Software prior to that, he has led several significant R&D projects on security and fraud detection, predictive modeling, information retrieval, content management, intelligent agents and bioinformatics. He has made important contributions in the field of neural network algorithms and in understanding the fundamental principles by which information is organized and processed in the brain. Dr. Sirosh has published over 20 technical papers and one book, and has been a lead investigator of various research grants from DARPA and other Government agencies. He holds a PhD and Masters in Computer Science from the University of Texas at Austin, and a B.Tech. in Computer Science & Engineering, from the Indian Institute of Technology, Chennai.


Katya Scheinberg
IBM T. J. Watson Research Center
Thurs, April 05, 2007

Active set methods for large-scale support vector machines

In the past decade one of the most popular approaches to solving classification problem in machine learning is support vector machines(SVM). At the core of the SVM approach lies a convex quadratic programming (QP) problem. This approach has been shown to be very efficient on certain applications in terms of resulting generalization error. There also have been numerous implementations targeting large-scale data sets. Most of these implementations, however, use inferior optimization methods to solve the convex QP, due to inherent computational cost of higher order methods. We will discuss how exploiting the special structure of SVM QPs leads to a competitive implementation of an classical active set method for convex QPs. We will also discuss the advantages on active set methods over first order methods used before. One of the advantages is that the method readily adapts to the parametric mode which computes the entire regularization path for SVM. The method is similar to that described in the paper by Hastie, Rosset, Tibshirani and Zhu. We will compare this method to a dual active set method which computes just one solution on a regularization path. In theory, this is as hard as computing the entrie regularization path, but in practice this is not so. We will describe the challenges of parametric active set method, present computational comparison of the two methods on large-scale classification problems and discuss possible approach of reducing the computational time by computing an approximate regularization path.

Bio: Dr. Katya Scheinberg was born in Moscow, Russia. After 4 years of undergraduate study in applied mathematics at Moscow State University, she joined the Ph.D. program in the Department of Industrial Engineering and Operations Research at Columbia University. She received her Ph.D. in 1997. Her thesis work was dedicated to various aspects of semidefinite programming and interior point methods. During the last two years in her Ph.D. program she was awarded an IBM Fellowship for graduate Students and began collaboration on derivative free methods for general nonlinear optimization. She is currently employed as a Research Staff Member at the Mathematical Sciences Department at T.J. Watson Research Center, IBM where she has been since 1997. Her research interests include theoretical and implementational issues of derivative free optimization (she released an open source software called DFO and is currently working on a book on the subject), numerical stability and efficient algorithms for linear algebra in interior points methods, parametric optimization, conic linear optimization, large-scale support vector machines (she is the author of an open source software called SVM-QP.) She has been recently involved in an effort to bring together the optimization and machine learning community and organized several workshops dedicated to the subject.


Renata Vieira
Universidade do Vale do Rio dos Sinos (UNISINOS)
Fri, March 30, 2007

Have I mentioned it?

Coreference resolution is a well known NLP task, relevant for the more general task of information extraction, among others. A related problem is the problem of identifying if a referring expression is introducing a new entity in the text or if it follows previously mentioned ones. Definite descriptions are a type of referring expressions which are highly ambiguous between these two roles. In fact they sometimes lie in between these two, when mentioning a new entity whose interpretation is anchored in given ones. When developing a classifier that distinguishes between these many roles of definite descriptions we face not only the old AI problem of grasping common sense and world knowledge but also the problem of unbalanced data. In this talk I will present some experiments dealing with these problems and I will mention some NLP tasks that the classifier is useful for.

Bio: Dr. Renata Vieira is Professor in the Computer Science Department at Universidade do Vale do Rio dos Sinos, Sao Leopoldo, Brazil. She received her PhD from the University of Edinburgh in 1998. Her research interests cover issues in computational linguistics, artificial intelligence, including natural language understanding, discourse processing, agent communication, knowledge representation, ontologies and the semantic web.


Rong Jin
Michigan State University
Wed, March 14, 2007

Generalized Maximum Margin Clustering and Unsupervised Kernel

Maximum margin clustering extends the theory of support vector machine to unsupervised learning, and has shown promising performance in recent studies. However, it has three major problems that question its application of real-world applications: (1) it is computationally expensive and difficult to scale to large-scale datasets; (2) it requires data preprocessing to ensure the clustering boundary to pass through the origins, which makes it unsuitable for clustering unbalanced dataset; and (3) its performance is sensitive to the choice of kernel functions. In this paper, we propose the "Generalized Maximum Margin Clustering" framework that addresses the above three problems simultaneously.
The new framework generalizes the maximum margin clustering algorithm in that (1) it allows any clustering boundaries including those not passing through the origins; (2) it significantly improves the computational efficiency by reducing the number of parameters; and (3) it automatically determines the appropriate kernel matrix without any labeled data. Our empirical studies demonstrate the efficiency and the effectiveness of the generalized maximum margin clustering algorithm. Furthermore, in this talk, I will show the theoretical connection among the spectral clustering, the maximum margin clustering and the generalized maximum margin clustering.

Bio: Dr. Rong Jin is an Assistant Prof. of the Computer Science and Engineering Dept. of Michigan State University since 2003. He is working in the areas of statistical machine learning and its application to information retrieval. Dr. Jin holds a B.A. in Engineering from Tianjin University, an M.S. in Physics from Beijing University, and an M.S. and Ph.D. from School of Computer Science of Carnegie Mellon University.


Andrew Tomkins
Yahoo! Research
Wed, May 3, 2006

Routing and growth models for online social networks

In this talk, I'll discuss some results from a number of large online social networks, including the LiveJournal blog hosting site, the Flikr photo sharing and tagging network, and the Yahoo! MyWeb2.0 social search application. I'll give some intuition about the denizens of these networks, covering demographs, interests, and locations. Then I'll discuss the small-world phenomenon in the context of online networks, and describe some results extending Kleinberg's work on small worlds to more general metric spaces and to non-uniform population densities. Finally, time permitting I'll close with a discussion of evolution of these networks over time.

This is a joint work with Ravi Kumar, David Liben-Nowell, Jasmine Novak, and Prabhakar Raghavan.

Bio: Dr. Andrew Tomkins joined Yahoo! Research in 2005 from IBM. His research over the last eight years has focused on measurement, modeling, and analysis of context, communities, and users on the World Wide Web. Prior to joining Yahoo! Research, he managed the "Information Management Principles" group at IBM's Almaden Research Center, and served as Chief Scientist on the WebFountain project. He received Bachelors degrees in Mathematics and Computer Science from MIT, and a PhD in CS from Carnegie Mellon University.


Lise Getoor
University of Maryland - College Park
Fri, Feb 10, 2006

Link Mining

A key challenge for data mining is tackling the problem of mining richly structured datasets, where the objects are linked in some way. Links among the objects may demonstrate certain patterns, which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records.
Link mining includes both descriptive and predictive modeling of link data. Classification and clustering in linked relational domains require new data mining models and algorithms. Furthermore, with the introduction of links, new predictive tasks come to light. Examples include predicting the numbers of links, predicting the type of link between two objects, inferring the existence of a link, inferring the identity of an object, finding co-references, and discovering subgraph patterns.
In this talk, I will give an overview of this newly emerging research area. I will describe novel aspects of the modeling, learning and inference tasks. Then, as time permits, I will describe some of my group's recent work on link-based classification and entity resolution in linked data.

Bio: Dr. Lise Getoor is an assistant professor in the Computer Science Department at the University of Maryland, College Park. She received her PhD from Stanford University in 2001. Her current work includes research on link mining, statistical relational learning and representing uncertainty in structured and semi-structured data. Her work in these areas has been supported by NSF, NGA, KDD, ARL and DARPA. In July 2004, she co-organized the third in a series of successful workshops on statistical relational learning, http://www.cs.umd/srl2004. She has published numerous articles in machine learning, data mining, database and AI forums. She is a member of AAAI Executive council, is on the editorial board of the Machine Learning Journal and JAIR and has served on numerous program committees including AAAI, ICML, IJCAI, KDD, SIGMOD, UAI, VLDB, and WWW.


Mehran Sahami
Google Inc.
Tues, May 31, 2005

Using Machine Learning to Improve Information Finding on the Web

Web search is one of the most important applications used on the Internet, and it also poses many interesting opportunities to apply machine learning. In order to better help people find relevant information in a growing sea of data, we discuss various machine learning techniques that can be harnessed to sift, organize, and present relevant information to users. In this talk, we provide a brief background on information retrieval, and then look at some of the challenges faced in searching the Web. We specifically examine applications of machine learning to improve information retrieval, image classification, topical inference of queries, and record linkage. We show how these tasks are directly related to the overarching goal of improving various aspects of search on the web.

Bio: Mehran Sahami is a Senior Research Scientist at Google and is also a Lecturer in the Computer Science Department at Stanford University. At Google, Mehran conducts research in machine learning and information retrieval technologies to help improve information access. Prior to joining Google, Mehran was involved in a number of commercial and research machine learning projects at E.piphany, Xerox PARC, SRI International, and Microsoft Research. He received his BS, MS, and PhD in Computer Science from Stanford, and evidently still has separation anxiety with respect to completely leaving there.


Thomas G. Dietterich
Oregon State University
Fri, Nov 19, 2004

Three Challenges for Machine Learning Research

Over the past 25 years, machine learning research has made huge progress on the problem of supervised learning. This talk will argue that now is the time to consider three new directions. The first direction, which is already being pursued by many groups, is Structural Supervised Learning in which the input instances are no longer independent but instead are related by some kind of sequential, spatial, or graphical structure. A variety of methods are being developed, including hidden Markov support vector machines, conditional random fields, and sliding window techniques. The second new direction is Transfer Learning in which something is learned on one task that can help with a second, separate task. This includes transfer of learned facts, learned features, and learned ontologies. The third new direction is Deployable Learning Systems. Today's learning systems are primarily operated offline by machine learning experts. They provide an excellent way of constructing certain kinds of AI systems (e.g., speech recognizers, handwriting recognizers, data mining systems, etc.). But it is rare to see learning systems that can be deployed in real applications in which learning takes place on-line and without expert intervention. Deployed learning systems must deal with such problems as changes in the number, quality, and semantics of input features, changes in the output classes, and changes in the underlying probability distribution of instances. There are also difficult software engineering issues that must be addressed in order to make learning systems maintainable after they are deployed.

Bio:Dr. Thomas G. Dietterich (AB Oberlin College 1977; MS University of Illinois 1979; PhD Stanford University 1984) joined the Oregon State University faculty in January 1985. In 1987, he was named a Presidential Young Investigator for the NSF. In 1990, he published, with Dr. Jude Shavlik, the book entitled Readings in Machine Learning, and he also served as the Technical Program Co-Chair of the National Conference on Artificial Intelligence (AAAI-90). From 1992-1998 he held the position of Executive Editor of the journal Machine Learning. The American Association for Artificial Intelligence named him a Fellow in 1994, and the Association for Computing Machinery did the same in 2003. In 2000, he co-founded a new, free electronic journal: The Journal of Machine Learning Research. He served as Technical Program Chair of the Neural Information Processing Systems (NIPS) conference in 2000 and General Chair in 2001. He currently President of the International Machine Learning Society, a member of the DARPA Information Science and Technology Study Group, and he also serves on the Board of Trustees of the NIPS Foundation.


Foster Provost
New York University
Thurs, Nov 11, 2004

Classification and Learning with Networked Data: some observations and results

As information systems record and provide access to increasing amounts of data, connections between entities become available for analysis. Customer accounts are linked by communications and other transactions. Organizatons are linked by joint activities. Text documents are hyperlinked. Such networked data create opportunities for improving classification. For example, for detecting fraud a common and successful strategy is to use transactions to link a questionable account to previous fraudulent activity. Document classification can be improved by considering hyperlink structure. Marketing can change dramatically when customer communication is taken into account. In this talk I will focus on two unique characteristics of classification with networked data. (1) Knowing the classifications of some entities in the network can improve the classification of others. (2) Very-high-cardinality categorical attributes (e.g., identifiers) can be used effectively in learned models. I will discuss methods for taking advantage of these characteristics, and will demonstrate them on various real and synthetic data sets.

This is a joint work with Claudia Perlich and Sofus Macskassy.

Bio:Dr. Foster Provost is Associate Professor of Information Systems and NEC Faculty Fellow at New York University's Stern School of Business. He is Editor-in-Chief of the journal Machine Learning, and a founding board member of the International Machine Learning Society. Professor Provost's recent research focuses on mining networked data, economic machine learning, and applications of machine learning and data mining. Previously, at NYNEX/Bell Atlantic Science and Technology, he studied a variety of applications of machine learning to telecommunications problems including fraud detection, network diagnosis and monitoring, and customer contact management.


Michael J. Pazzani
Division Director, Information and Intelligent Systems, National Science Foundation
Thurs, Sept 30, 2004

Machine Learning for Personalized Wireless Portals

People have access to vast stores of information on the World Wide Web ranging from on line publications to electronic commerce. All this information, however, used to be accessible only while users are tethered to a computer at home or in an office. Wireless data and voice access to this vast store allows unprecedented access to information from any location at any time. The presentation of this information must be tailored to the constraints of mobile devices. Although browsing and searching are the acceptable methods of locating information on the wired web, those operations soon become cumbersome and inefficient in the wireless setting and impossible in voice interfaces. Small screens, slower connections, high latency, limited input capabilities, and the serial nature of voice interfaces present new challenges. This talk focuses on personalization techniques that are essential for the usability of hand held wireless devices.

Bio:Dr. Michael J. Pazzani is the Director of the Information and Intelligent Systems Division of the National Science Foundation. He received his Ph.D. in Computer Science from UCLA and is on leave from a full professor at the University of California, Irvine where he also served as department chair of Information and Computer Science at UCI for five years. Dr. Pazzani serves on the Board of Regents of the National Library of Medicine. He is a fellow of the American Association of Artificial Intelligence and has published numerous papers in machine learning, personalization, information retrieval, and cognitive science.


Soumen Chakrabarti
Indian Institute of Technology at Bombay
Fri, Aug 20, 2004

Is question answering an acquired skill?

We present a question answering (QA) system which learns how to detect and rank answer passages by analyzing questions and their answers (QA pairs) provided as training data. Our key technique is to recover, from the question, fragments of what might have been posed as a structured query, had a suitable schema been available. One fragment comprises _selectors_: tokens that are likely to appear (almost) unchanged in an answer passage. The other fragment contains question tokens which give clues about the _answer_type_, and are expected to be _replaced_ in the answer passage by tokens which _specialize_ or _instantiate_ the desired answer type. Selectors are like constants in where-clauses in relational queries, and answer types are like column names. We propose a simple conditional exponential model over a pair of feature vectors, one derived from the question and the other derived from the a candidate passage. Features are extracted using a lexical network and surface context as in named entity extraction, except that there is no direct supervision available in the form of fixed entity types and their examples. We do not need any manually-designed question type system or supervised question classification. Using the exponential model, we filter candidate passages and see substantial improvement in the mean rank at which the first answer passage is found. With TREC QA data, our system achieves mean reciprocal rank (MRR) that compares favorably with the best scores in recent years, and generalizes from one corpus to another.

This is a joint work with Ganesh Ramakrishnan, Deepa Paranjpe, and Pushpak Bhattacharyya.

Bio:Dr. Soumen Chakrabarti received his B.Tech in Computer Science from the Indian Institute of Technology, Kharagpur, in 1991 and his M.S. and Ph.D. in Computer Science from the University of California, Berkeley in 1992 and 1996. At Berkeley he worked on compilers and runtime systems for running scalable parallel scientific software on message passing multiprocessors. He was a Research Staff Member at IBM Almaden Research Center from 1996 to 1999, where he worked on the Clever Web search project and led the Focused Crawling project. In 1999 he moved as Assistant Professor to Department of Computer Science and Engineering at the Indian Institute of Technology, Bombay, where he has been an Associate professor since 2003. In Spring 2004 he is Visiting Associate professor at Carnegie-Mellon. He has published in the WWW, SIGIR, SIGKDD, SIGMOD, VLDB, ICDE, SODA, STOC, SPAA and other conferences as well as Scientific American, IEEE Computer, VLDB and other journals. He holds eight US patents on Web-related inventions. He has served as vice-chair or program committee member for WWW, SIGIR, SIGKDD, VLDB, ICDE, SODA and other conferences, and guest editor or editorial board member for DMKD and TKDE journals. He is also author of a new book on Web Mining. His current research interests include question answering, Web analysis, monitoring and search, mining irregular and relational data, and textual data integration.


Arindam Banerjee
University of Texas at Austin
Fri, Apr 16, 2004

Bregman Clustering and Co-clustering

A wide variety of distortion measures such as squared Euclidean distance, relative entropy etc., have been used in the literature for parametric clustering. In recent years, several such distortion measures have also been used for co-clustering, i.e., simultaneous clustering of the rows and columns of two-dimensional data matrices. In this talk, we discuss a unification of a large class of clustering and co-clustering techniques using Bregman divergences.
The talk has two parts. In the first part, we propose and analyze parametric hard clustering with Bregman divergences. The proposed class of algorithms unify centroid-based parametric clustering approaches, such as classical kmeans and information-theoretic clustering, which arise by special choices of the Bregman divergence. Further, we show an explicit bijection between Bregman divergences and exponential families that leads to a simple soft clustering algorithm for all Bregman divergences. In the second part, we discuss a general framework for co-clustering wherein loss functions corresponding to all Bregman divergences can be used and various conditional expectation based constraints can be considered, thereby giving rise to different parametric co-clustering models on a wide range of data matrices. The minimum Bregman information solution, a direct generalization of maximum entropy and least squares, plays a critical role in the analysis that leads an elegant meta-algorithm guaranteed to achieve local optimality. Empirical evidence is provided to establish the generality and efficacy of the proposed framework.

This is a joint work with Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmednra Modha.


Inderjit S. Dhillon
University of Texas at Austin
Fri, Mar 26, 2004

Information-Theoretic Clustering, Co-clustering and Matrix Approximations

Given a two-dimensional co-occurrence matrix, we consider the problem of co-clustering or simultaneous clustering of the rows and columns. Viewing the (scaled) co-occurrence matrix as an empirical joint probability distribution of two discrete random variables, we can pose the co-clustering problem as an optimization problem in information theory --- the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. It can be shown that this objective is identical to that of finding a "low-complexity" non-negative matrix approximation that is closest in relative entropy to the given matrix under clustering constraints. Based on this formulation, we present a co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. We apply the algorithm to some problems in text mining: distributional clustering of words applied to text classification, and simultaneous clustering of words and documents. Experimental results indicate that distributional word clustering leads to improved text classification when the training data is small in size, and co-clustering can overcome problems due to high-dimensionality and sparsity. Generalizations to varied distortion measures, such as Bregman divergences, will be discussed.

This is a joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu and D. Modha.


Daniel Boley
University of Minnesota
Fri, Mar 12, 2004

Training Support Vector Machines via Adaptive Clustering

We discuss a way to speed up the training process for support vector machines by means of clustering the training data and using representative points for the training process. An iterative procedure is proposed to refine the resulting support vector machine. This procedure eventually converges to the true support vector machine that would be obtained using all the original training data, but often the earlier approximate support vector machines will suffice. The approach is modular by allowing the user to plug in arbitrary methods for both the clustering and the support vector machine training. Using off-the-shelf methods for both tasks, we illustrate the approach with an artificial dataset and a "real" dataset.

Bio:Dr. Daniel Boley received his A.B. degree Summa Cum Laude in Mathematics and with Distinction in All Subjects from Cornell University in 1974, and his M.S. and Ph.D. degrees in Computer Science from Stanford University in 1976 and 1981, respectively. Since 1981, he has been on the faculty of the Department of Computer Science and Engineering at the University of Minnesota. He has had extended visiting positions at the Los Alamos Scientific Laboratory, the IBM Research Center in Zurich (Switzerland), the Australian National University in Canberra, Stanford University, and the University of Salerno (Italy). Dr. Boley is known for his past work on numerical linear algebra methods for control problems, parallel algorithms, iterative methods for matrix eigenproblems, error correction for floating point computations, inverse problems in linear algebra, as well as his more recent work on numerical methods in robot motion planning and unsupervised document categorization in data mining. He has been an editor for the SIAM Journal of Matrix Analysis and has chaired several technical symposia.


Jiawei Han
Department of Computer Science
University of Illinois - Urbana-Champaign
Fri, Oct 24, 2003

Mining Dynamics of Data Streams in Multidimensional Space

Traditional data mining systems and methods assume that data resides on disks or in main memory, and a data mining process can scan the data sets multiple times to uncover interesting patterns. However, real-time production systems and other dynamic environments often generate tremendous (potentially infinite) volumes of stream data in fast speed---the volume of data is too huge to be stored on disks or even if it were stored on disks, it could be too expensive be fetched and scanned multiple times. Moreover, many applications may require real time mining of unusual patterns in data streams, including finding unusual network or telecommunication traffic, real-time pattern mining in video surveillance, detecting suspicious on-line transactions or terrorist activities, and so on. Recently there have been substantial growing research interests in developing methods for querying and mining stream data. In this talk, I will first present an overview of recent developments on stream data querying and mining and outline what are the major challenging research problems for mining dynamics of data streams in multi-dimensional space. In particular, I will be addressing the following issues in detail: (1) multi-dimensional on-line analysis methods for discovering unusual patterns in stream data; (2) mining time-sensitive frequent patterns in stream data; (3) mining clusters and outliers in stream data; and (4) dynamic stream classification methods.

Bio:Dr. Jiawei Han is a professor at Department of Computer Science, University of Illinois at Urbana-Champaign. He has been working on research into data mining, data warehousing, database systems, spatial and multimedia databases, deductive and object-oriented databases, Web databases, bio-medical databases, etc. with over 250 journal and conference publications. He has chaired or served in many program committees of international conferences and workshops, including 2001 and 2002 SIAM-Data Mining Conference (PC co-chair), 2004 and 2002 International Conferences on Data Engineering (PC vice-chair), ACM SIGKDD conferences (2001 best paper award chair, 2002 student award chair), 2002 and 2003 ACM SIGMOD conference (2000 demo/exhibit program chair), etc. He also served or is serving on the editorial boards for Data Mining and Knowledge Discovery: An International Journal, IEEE Transactions on Knowledge and Data Engineering, and Journal of Intelligent Information Systems. His textbook "Data Mining: Concepts and Techniques" (Morgan Kaufmann, 2001) has been popularly adopted for data mining courses in universities. His research is currently supported by NSF, NSF/ITR, ONR, Univ. of Illinois, IBM Faculty Award, and a gift from Microsoft Research.


David Page, Jr.
Department of Biostatistics and Bioinformatics, Department of Computer Sciences
University of Wisconsin - Medison
Tues, Aug 19, 2003

Machine Learning in Cancer Genetics

In recent years several novel types of genetic data have become available, and a primary application area for these types of data is in the study of cancer. Two of the most widely-publicized sources of such data are gene expression microarrays, or "gene chips," and single-nucleotide polymorphisms, or "SNPs." This talk will review these types of data and prior applications of machine learning to gene chip data. It then will present novel results using machine learning to analyze gene chip and SNP data for multiple myeloma, an incurable form of blood cancer. To our knowledge, the application to SNP data is the first such machine learning application. The talk will close with a brief discussion of other novel biological data types to which machine learning also should be applicable.

This is joint work with Michael Waddell (also of UW) and with John Shaughnessy, Fenghuang Zhan and Bart Barlogie of the Lambert Laboratory for Myeloma Genetics and the Myeloma Institute at the University of Arkansas for Medical Sciences.


Pedro Domingos
Department of Computer Science and Engineering
University of Washington
Fri, Oct 25, 2002

Learning From Networks of Examples

Most machine learning algorithms assume that examples are independent of each other, but many (or most) domains violate this assumption. For example, in real markets customers' buying decisions are influenced by their friends and acquaintances, but data mining for marketing ignores this (as does traditional economics). In this talk I will describe how we can learn models that account for example dependences, and use them to make better decisions. For example, in the marketing domain we are able to pinpoint the most influential customers, "seed" the network by marketing to them, and unleash a wave of word of mouth. We mine these models from collaborative filtering systems and knowledge-sharing Web sites, and show that they are surprisingly robust to imperfect knowledge of the network. I will also survey other applications of learning from networks of examples we are working on, including: combining link and content information in Google-style Web search; automatically translating between ontologies on the Semantic Web; and predicting the evolution of scientific communities.

This is a joint work with Matt Richardson.

Bio:Dr. Domingos received his BA(1988) and M.S(1992) in EECS from IST, in Lisbon, and his M.S.(1994) and Ph.D.(1997) in Information and Computer Science from the University of California at Irvine. After spending two years as an assistant professor at IST, he joined the faculty of the University of Washington in 1999. He is the author or co-author of over 80 technical publications in machine learning, data mining, and other areas.


Andrew McCallum
WhizBang Labs & Carnegie Mellon University
Fri, Feb 22, 2002

Information Extraction with Finite State Models

Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human readable text. Finite state machines are the dominant model for information extraction both in research and industry. In this talk I will give several examples of information extraction tasks at WhizBang Labs, and then describe two new finite state models designed to take special advantage of the multi-faceted nature of text on the web. Maximum Entropy Markov Models (MEMMs) are discriminative sequence models that allow each observation to be represented as a collection of arbitrary overlapping features (such as word identity, capitalization, part-of-speech and formatting, plus agglomerative features of the entire sequence and features from the past and future). Conditional Random Fields (CRFs) are a generalization of MEMMs that solve a fundamental limitation of MEMMs and all other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. I will introduce both models, skim over their parameter estimations algorithms, and present experimental results on real and synthetic data.

This is a joint work with Fernando Pereira, John Lafferty and Dayne Freitag.

Bio:Dr. Andrew McCallum received his BA from Dartmouth College and his PhD in Computer Science from University of Rochester, then a Post-Doctoral Fellowship at Carnegie Mellon University. He is currently VP of Research and Development at WhizBang! Labs, a company that does machine-learning-based information extraction from the Web. His interests include machine learning and text mining, especially information extraction, document classification and techniques for learning from limited labeled training data.


Oren Etzioni
Department of Computer Science and Engineering
University of Washington
Fri, Nov 2, 2001

Adaptive Web Sites

Today's web sites are intricate but not intelligent; While web navigation is dynamic and idiosyncratic, all too often web sites are fossils cast in HTML. This talk describes our research on Adaptive Web Sites: sites that automatically improve their organization and presentation by learning from visitor access patterns. Adaptive web sites mine the data buried in web server logs to produce more easily navigable web sites.

To demonstrate the feasibility of adaptive web sites, I will describe the problem of index page synthesis and sketch a solution that relies on novel clustering and conceptual clustering techniques. Our preliminary experiments show that high-quality candidate index pages can be generated automatically, and that our techniques outperform existing methods (including the Apriori algorithm, K-means clustering, hierarchical agglomerative clustering, and COBWEB) in this domain.


Foster Provost
New York University
Fri, Oct 26, 2001

Reviving the Learning Curve: A critical study of machine learning performance

Off-the-shelf machine-learning algorithms now are used widely both for manual data analysis and as important components for building intelligent systems. We revive the "learning curve," which relates learning performance to training-set size, to examine two of the most popular families of off-the-shelf classifier induction algorithms: decision-tree inducers and logistic regression. This new analysis (conducted using 37 large data sets, comparing classification and ranking performance) shows some remarkable things. Most generally, one can not safely ignore learning curves when comparing algorithms; on the same data sets, conclusions can be strikingly different depending on training-set size. This calls into question the conclusions of innumerable comparison studies that do not factor the size of the data set into the analysis. More specifically, looking carefully at the learning curves of these two popular induction algorithm families we see clearly that each is preferable for a different type of data.


Sridhar Rajagopalan
IBM Almaden Research Center
Fri, Oct 5, 2001

Three applications of data mining principles

In this talk, I will describe three applications of data mining algorithms similar to Agrawal and Srikant's a priori algorithm. As we go through the sequence, we will see increasing use of simple graph theoretic methods, and explain why the even simpler base technology fails.

Bio:Dr. Sridhar Rajagopalan recieved his Ph.D. from University of California, Berkeley in 1994. He subsequently spent two years at Princeton University as a DIMACS postdoctoral fellow. He has since been a researcher at the IBM Almaden Research Center. He is interested in Algorithms, Databases, Randomization in Computing, and Information Retrieval on the Web.


Joseph Sirosh
HNC Software Inc.
Fri, Sept 21, 2001

Unit Vector Representations of Unstructured Information

One of the frequently encountered challenges in data mining is the problem of representing unstructured information such as words in text, action codes, and similar non numerical information. In this talk, I will introduce a technology developed at HNC called Context Vectors for representing and operating on unstructured data. Using context vectors, many operations on text data, such as search and indexing, categorization, clustering and summarization can be performed using vector operations. I will also review some of the applications of this technology, such as email response, personalization, search, recommendation generation and segmentation.

Bio:Dr. Joseph Sirosh leads Advanced Technology Solutions (ATS), the R&D group of HNC Software Inc. (NASDAQ: HNCS). HNC is a leader in Customer Insight solutions incorporating decision management and customer analytics for better customer insight in the financial/banking, incurance/ healthcare, telecommunications and e-commerce industries. Sirosh oversees and directs research projects on text/multimedia retrieval, wireless web information management, information extraction and personalization, predictive modeling, intelligent agents and genome sequence mining. During his tenure at HNC, he won highly competitive externally funded projects in information retrieval and bioinformatics, helping HNC create breakthrough technology and establish core competencies in new fields. In his career, Sirosh has published more than 20 technical papers and one electronic book, and has been a lead investigator of various research grants from DARPA and other government agencies. Sirosh holds a PhD and Masters in Computer Sciences from the University of Texas at Austin and a BTech degree in Computer Science & Engineering from the Indian Institute of Technology, Madras.


Soumen Chakrabarti
Indian Institute of technology - Bombay
Scheduled for Fri, Sept 14, 2001, but cancelled due to Sept 11

Beyond hubs and authorities: Enhanced topic distillation using text, markup tags, and hyperlinks

Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem is common for narrow topics, because most hubs contain admixtures of other topics. The problem also becomes severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. As a by-product, our algorithm can extract annotated sections of hubs which are best suited to surveying the given topic. Using over 50 topics, 28 from earlier topic distillation work, we have analyzed over 700000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.

This is a joint work with Mukul Joshi and Vivek Tawde.


Daniel Boley
University of Minnesota
Fri, Apr 27, 2001

A Scalable Hierarchical Algorithm for Unsupervised Clustering

Top-down hierarchical clustering can be done in a scalable way. Here we describe a scalable unsupervised clustering algorithm designed for large datasets from a variety of applications. The method constructs a tree of nested clusters top-down, where each cluster in the tree is split according to the leading principal direction. We use a fast principal direction solver to achieve a fast overall method. The algorithm can be applied to any dataset whose entries can be embedded in a high dimensional Euclidean space, and takes full advantage of any sparsity present in the data. We show the performance of the method on text document data, in terms of both scalability and quality of clusters. We demonstrate the versatility of the method in different domains by showing results from text documents, human cancer gene expression data, and astrophysical data. For that last domain, we use an out of core variant of the underlying method which is capable of efficiently clustering large datasets using only a relatively small memory partition.

Bio:Dr. Daniel Boley received his A.B. degree Summa Cum Laude in Mathematics and with Distinction in All Subjects from Cornell University in 1974, and his M.S. and Ph.D. degrees in Computer Science from Stanford University in 1976 and 1981, respectively. Since 1981, he has been on the faculty of the Department of Computer Science and Engineering at the University of Minnesota. He has had extended visiting positions at the Los Alamos Scientific Laboratory, the IBM Research Center in Zurich (Switzerland), the Australian National University in Canberra, Stanford University, and the University of Salerno (Italy). Dr. Boley is known for his past work on numerical linear algebra methods for control problems, parallel algorithms, iterative methods for matrix eigenproblems, error correction for floating point computations, inverse problems in linear algebra, as well as his more recent work on numerical methods in robot motion planning and unsupervised document categorization in data mining. He has been an editor for the SIAM Journal of Matrix Analysis and has chaired several technical symposia.


David Scott
Rice University
Fri, Apr 20, 2001

Nonparametric Clustering for Data Mining

The use of density estimation to find clusters in data is supplementing ad hoc hierarchical methodology. Examples include finding high-density regions, finding modes in a kernel density estimator, and the mode tree. Alternatively, a mixture model may be fit and the mixture components associated with individual clusters. Fitting a high-dimensional mixture model with many components is difficult to estimate in practice. Here, we survey mode and level set methods for finding clusters. We describe a new algorithm that estimates a subset of a mixture model. In particular, we demonstrate how to fit one component at a time and how the fits may be organized to reveal the complete clustering model.


Ed George
Department of MSIS
University of Texas - Austin
Fri, Apr 13, 2001

Bayesian Treed Modeling

When simple parametric models such as linear regression fail to adequately approximate a relationship across an entire set of data, an alternative may be to consider a partition of the data, and then use a separate simple model within each subset of the partition. Such an alternative is provided by a treed model which uses a binary tree to identify such a partition. However, treed models go further than conventional trees (eg CART, C4.5) by fitting models rather than a simple mean or proportion within each subset. In this talk, I will discuss a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression. The potential of this approach is illustrated by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets. This is joint work with Hugh Chipman, University of Waterloo and Rob McCulloch, University of Chicago. A paper is available at


Gautam Das
Microsoft Research
Fri, Mar 30, 2001

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

The popularity of data warehousing applications makes the problem of approximately answering aggregation queries important. Recently, researchers in this area have presented several sampling based schemes for approximate processing of aggregation queries. While most of these studies have acknowledged the importance of exploiting workload information in solving the problem, they still fall short in several important ways. First, many of them use ad-hoc schemes for picking samples of the data, sometimes resulting in unacceptable quality of answers even for the given workload. Second, most previous approaches ignore variance in the data distribution in determining the sample, which can also lead to poor quality. We present a comprehensive and practical solution that addresses all these problems. We use basic ideas from stratified sampling to formulate the problem as an optimization problem whose goal is to minimize the error for the given workload. We show how our solution can be implemented efficiently in a database system, and present results of extensive experiments on a commercial database management system that show the superior quality of our method compared to relevant previous work.


Joydeep Ghosh
Electrical and Computer Engineering
University of Texas, Austin
Fri, Mar 23, 2001

Graphical Techniques for Efficient and Balanced Clustering of Very High-Dimensional Data

Segmentation or clustering of large data sets with thousands of dimensions is very challenging because of sparsity problems arising from the ``curse of dimensionality''. Recent graphical approaches provide novel and efficient solutions by working in similarity space rather than in the original vector space in which the data points were specified.
I shall describe one such approach based on constrained, weighted graph-partitioning, that has yielded remarkable results on real-life market basket analysis and on clustering web pages. Issues on which similarity space should be used, and how to evaluate and visualize the clusters will be addressed in the process.


Tom Mitchell
Carnegie Mellon University
and Whizbang! Labs
Fri, Feb 09, 2001

Machine Learning and Extracting Information from the Web

Today's search engines can retrieve and display over a billion web pages, but their use is limited by the fact that they don't analyze the content of these pages in any depth.
This talk will describe research that has resulted in systems that answer these kinds of questions by extracting detailed factual information automatically from millions of web pages. Our approach relies heavily on machine learning algorithms to train the system to find and extract targeted information. For example, in one case we trained our system to find and extract job postings from the web, resulting in the world's largest database of job openings (over 600,000 jobs, see This talk will describe machine learning algorithms for classifying and extracting information from web pages, including results of recent research on using unlabeled data and other kinds of information to improve learning accuracy.


Prabhakar Raghavan
Chief Scientist, Verity Inc.
Fri, Nov 17, 2000

Networks and Sub-Networks in the World-Wide Web

We discuss several structural properties of graphs arising from the world-wide web including the graph of hyperlinks, and the graph induced by connections between distributed search servents. We review a number of algorithmic investigations of the structure of these networks, and conclude by proposing stochastic models for the evolution of these networks.

Bio:Dr. Prabhakar Raghavan is Chief Scientist at Verity, and is also a consulting professor at Stanford University. Prior to Verity, he worked at the IBM Almaden Research Center, where his group worked on algorithms, web mining and search, complexity theory, security and cryptography, optimization and reasoning about knowledge. Prabhakar got his undergraduate degree in electrical engineering from IIT, Madras, and PhD in computer science from UC Berkeley.


Jerome H. Friedman
Stanford University
Fri, Nov 10, 2000

Predictive Data Mining with Multiple Additive Regression Trees

Predicting future outcomes based on past observational data is a common application in data mining. The primary goal is usually predictive accuracy, with secondary goals being speed, ease of use, and interpretability of the resulting predictive model. New automated procedures for predictive data mining, based on "boosting" (CART) regression trees, are described. The goal is a class of fast "off-the-shelf" procedures for regression and classification that are competitive in accuracy with more customized approaches, while being fairly automatic to use (little tuning), and highly robust especially when applied to less than clean data. Tools are presented for interpreting and visualizing these multiple additive regression tree (MART) models.


Nick Street
Department of Management Sciences
The University of Iowa
Thurs, Nov 9, 2000

Feature Selection in Unsupervised Learning via Multi-Objective Evolutionary Search

Feature subset selection is an important knowledge discovery task, because of the insight gained from determining relevant modeling variables and for the improved understandability, scalability, and perhaps accuracy of the resulting models. We explore the feature selection problem in unsupervised learning (clustering) where, unlike supervised learning, there is no single adequate measure of the quality of clusters built using a given feature subset. We frame the problem as a multi-objective optimization task by selecting a number of heuristic criteria to estimate model quality. The combinatorial search for good solutions is performed using ELSA, an evolutionary local selection algorithm that maintains a diverse population of solutions approximating the multi-dimensional Pareto front. We evaluate our approach with both synthetic and real-world data sets of moderate to high dimension. Our results show promise in finding Pareto-optimal solutions through which we can identify significant feature subsets and an appropriate number of clusters.


Ray Mooney
Department of Computer Sciences
University of Texas - Austin
Fri, Oct 27, 2000

Text Mining with Information Extraction

Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents. An IE system is therefore capable of transforming a corpus of unstructured texts or web pages into a structured database. Our previous work has focused on using machine learning methods to automatically construct IE systems from training sets of manually annotated documents. Our current research focuses on form of text mining that extracts a database from a document corpus using a learned IE system and then mines this database for interesting patterns using traditional rule-induction methods. The mined rules can in turn be used to improve the accuracy of information extraction. We will present encouraging results on applying this approach to a corpus of computer job postings from the Usenet newsgroup


Raghu Ramakrishnan
Professor and Vilas Associate
Department of Computer Sciences, UW-Madison
Fri, Oct 13, 2000

Exploratory Analysis of Large Datasets

Exploratory data analysis has a long and rich history, especially in the AI and Statistics communities. Recently, it has gained a lot of attention under the name "Data Mining", with the new twist being an increased emphasis on analyzing large datasets. In this talk, I will discuss applications that motivate data mining and consider some new challenges that arise as a consequence of considering large datasets.


Jaideep Srivastava
Department of Computer Science and Engineering
University of Minnesota
Fri, Sept 29, 2000

Web Mining: A Challenging Frontier for Data Mining

The ease and speed with which business transactions can be carried out over the Web has been a key driving force in the rapid growth of electronic commerce. The Web is revolutionizing the way business interact with each other (B2B) and with each customer (B2C). It has introduced entirely new ways of doing commerce, including e.g. auctions and reverse auctions. It also made it imperative for organizations and companies to optimize their electronic business. Knowledge about the Web user is fundamental for the establishment of viable e-business solutions. Web mining is the application of data mining techniques to acquire this knowledge for e-business. Typical concerns in e-business include improved cross-sells, up-sells, personalized ads, targeted assortments, improved conversion rates, and measurements of the effectiveness of actions.

This talk will


Peter Haas
IBM Almaden Research Center
Mon, Sept 25, 2000

Techniques for Online Exploration of Large Object-Relational Databases

We review some recent techniques for exploring large object-relational databases in an interactive online manner. The idea is to provide continuously updated running estimates of the final query results to the user, along with an indication of the precision of the estimates. The user can then halt the query as soon as the answer is sufficiently precise---the time required to obtain an acceptable approximate answer can be faster by orders of magnitude than the time needed to completely process the query. We focus on methods for online processing of complex aggregation queries and also discuss methods for online visualization and online display of a large set of result records. We briefly relate our work to more traditional OLAP (Online Analytical Processing) techniques and to static sampling in databases.


Jude Shavlik
Departments of Computer Sciences and Biostatistics & Medical Informatics
University of Wisconsin - Madison
Fri, Aug 4, 2000

Using Diverse Evidence Sources to Uncover Coordinately Controlled Genes via Machine Learning and Dynamic Programming

Now that the complete genomes of numerous organisms have been determined, a key problem in computational molecular biology is uncovering the relationships that exist among the genes in each organism and the regulatory mechanisms that control their operation. We are developing computational methods for discovering such regulatory mechanisms and relationships. In our first project, we have developed a computational approach for predicting operons in the genomes of prokaryotic organisms. Operons are contiguous genes on a chromosome that are "turned on" or "shut off" as a unit. Our approach uses machine-learning methods to induce predictive models for this task from a rich variety of data types including DNA sequence data, gene-expression laboratory data, and handcrafted functional annotations associated with genes. We use multiple learned models that individually predict promoters, terminators, and operons themselves. A key part of our approach is a dynamic-programming method that uses our individual operon predictions to map every known and putative gene in a given genome into its most probable operon. We evaluate our approach using data from the E. coli K-12 genome.

This is a joint work with Joseph Bockhorst, Mark Craven, Jeremy Glasner, and David Page. No prior knowledge about molecular biology will be assumed of the audience.


Haym Hirsh
Computer Science Department
Rutgers University
Thurs, Aug 3, 2000

Integrating Multiple Sources of Information in Text Classification

One popular class of supervised learning problems concerns the task of classifying items that are comprised primarily of text. Such problems can occur when assessing the relevance of a Web page to a given user, assigning topic labels to technical papers, and assessing the importance of a user's email. Most commonly this task is performed by extrapolating from a corpus of labeled textual training documents procedures for classifying further unlabeled documents. This talk presents an approach for corpus-based text classification based on WHIRL, a database system that augments traditional relational database technology with textual-similarity operations developed in the information retrieval community. Not only does the approach perform competitively on traditional text classification tasks, we show that it enables the incorporation of a range of sources of information into the classification process in a fairly robust and general fashion.


Dharmendra S. Modha
IBM Almaden Research Center
Wed, Apr 5, 2000

Clustering Hypertext with Applications to Web Searching

Clustering separates unrelated documents and groups related documents, and is useful for discrimination, disambiguation, summarization, organization, and navigation of unstructured collections of hypertext documents. We propose a novel clustering algorithm that clusters hypertext documents using words (contained in the document), out-links (from the document), and in-links (to the document). The algorithm automatically determines the relative importance of words, out-links, and in-links for a given collection of hypertext documents. We annotate each cluster using six information nuggets: summary, breakthrough, review, keywords, citation, and reference. These nuggets constitute high-quality information resources that are representatives of the content of the clusters, and are extremely effective in compactly summarizing and navigating the collection of hypertext documents. We employ web searching as an application to illustrate our results. Anecdotally, when applied to the documents returned by AltaVista in responses to the query abduction, our algorithm separates documents about "alien abduction" from those about "child abduction."

This is a joint work with W. Scott Spangler and will appear in the ACM Hypertext 2000 Conference.


Last updated on Spring 2006 by