Large-Scale Data Mining

CS 395T

Unique Number: 49460

Course Announcement

Spring 2000
M-W 4:00-5:30pm
CPE 2.206

Professor: Inderjit Dhillon (send email)
Office: Taylor Hall 5.148
Office Hours: Wed 10:00-11:00am

TA: Shailesh Kumar (send email)
Office: ENS 518
Office Hours: Thurs 10am-1pm

Paper Readings

Class Projects

  • Current Projects.
  • Sample projects descriptions and associated resources.
  • Handouts

  • Course Information (contains grading information) handed out on Jan 19.
  • Class Survey, Jan 19.
  • Relevant Books (on reserve in PCL)

  • Pattern Classification and Scene Analysis by R. Duda and P. Hart, Wiley-Interscience, 1973. An old classic. The first six chapters are outstanding.
  • Foundations of Statistical Natural Language Processing by C. Manning and H. Schutze, MIT Press, 1999. Recent book with detailed treatment of some aspects of information retrieval.
  • Lectures

  • Lecture 1 - Introduction, syllabus.
  • Lecture 2 - Finding good "hubs" and "authorities" for broad-topic queries. Material from:
  • Authoritative sources in a hyperlinked environment by Jon Kleinberg.
  • Improved Algorithms for Topic Distillation in a Hyperlinked Environment by Krishna Bharat and Monika Henzinger.
  • Lecture 3 - Review of basic linear algebra (vectors, norms, eigenvalues/eigevectors).
  • Lecture 4 - Singular Value Decomposition, Proof that hub vector and authority vector converges to the dominant singular vectors, Vector-Space Models for text.
  • Lecture 5 - Latent Semantic Indexing for query retrieval.
  • Lecture 6 in two parts: 1 and 2 - Examples illustrating Latent Semantic Indexing.
  • Lecture 7 - First lecture on Clustering.
  • Lecture 8 - Clustering Algorithms (download the MATLAB code for the clustering demo).
  • Lecture 9 - Clustering (k-means).
  • Lecture 10 - Graph Partitioning. Also see lecture notes 1 & 2 by Jim Demmel.
  • Lecture 11 - Classification (k-nearest neighbor,probabilistic models,naive Bayes).
  • Lecture 12 - Classification (Maximum Likelihood Classifiers).
  • Lecture 13 - EM for Mixture Model Density Estimation.
  • Material to be covered

  • Mathematical preliminaries - basics of linear algebra.
  • SVD (Singular Value Decomposition) and its use in indexing documents. For example, Latent Semantic Indexing (LSI).
  • LSI page at Bellcore.
  • LSI page at Univ. of Tennessee, Knoxville.
  • Matrices, Vector Spaces and Information Retrieval by Michael W. Berry, Zlatko Drmac, Elizabeth R. Jessup.
  • Clustering algorithms (agglomerative clustering, graph-based algorithms, k-means).
  • Classification algorithms (linear discriminant analysis).
  • Focused Crawling of the WWW.
  • Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery by Soumen Chakrabarti, Martin van den Berg and Byron Dom.
  • Data Visualization (Self-Organizing Maps (SOMs), Class-Preserving Projections).
  • Class Visualization of High-Dimensional Data with Applications. by Inderjit Dhillon, Dharmendra Modha, Scott Spangler, 1999. Free Software is available here.
  • XGobi is a system for multivariate data visualization by Deborah Swayne, Di Cook, Andreas Buja at Bellcore. The same page contains XGvis that can draw discrete graphs using MDS(Multidimensional Scaling) and was developed by Andreas Buja, Deborah F. Swayne, Michael L. Littman, Nathaniel Dean. Free Software is available from the provided link.
  • WEBSOM can plot 2-d maps of tect documents using Kohonen's Self-Organizing Maps for Internet Exploration. The above link has a demo for visually browsing newsgroup data.
  • Support Vector Machines (SVMs) and their application to document classification.
  • Graph Partitioning with applications to Image Segmentation.
  • Lecture notes 1 & 2 on graph partitioning by Jim Demmel
  • Normalized Cuts and Image Segmentation by Jianbo Shi and Jitendra Malik.
  • Motion Segmentation and Tracking Using Normalized Cuts by Jianbo Shi and Jitendra Malik.
  • The METIS Graph Partitioning Package.
  • SVD in face recognition.
  • Papers and Faces Database by Larry Sirovich.
  • Eigenfaces and Face Recognition at the MIT Media Lab.
  • Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection by Peter Belhumeur and Jo Hespanha and David Kriegman, July 1997.
  • Analyzing the graph of the WWW (hubs and authorities, the CLEVER project at IBM, PageRank at Google)
  • Authoritative sources in a hyperlinked environment by Jon Kleinberg.
  • The CLEVER project at IBM Almaden.
  • Hypersearching the Web by Members of the CLEVER project.
  • Related Courses

  • Stanford's CS 349, Data Mining, Search, and the World Wide Web, Fall 1998.
  • UC Berkeley's CS 294-7, Large Datasets, Fall 1999.
  • UT Austin ECE course EE 380L, A Practicum in Data Mining, Fall 1999.
  • Princeton's CIS 700/702, Information Retrieval, ?.