Spring 2000
CS 395T
"Large-Scale Data Mining"
M-W 4-5:30pm
Welch 3.260
Prof. Inderjit Dhillon
Recent times have seen an explosive growth in the raw data available
electronically. The data occurs in various forms, e.g., as text, image,
video or numeric data; both in public domains and in private corporations.
Data mining is the automatic discovery of interesting patterns and
relationships in very large data sets.
This graduate course will focus on scalable algorithms for data mining.
A special emphasis will be on information retrieval, especially for the
World Wide Web. Topics covered will include (i) link analysis on the
internet (like Google), (ii) content analysis of documents using SVD (Singular
Value Decomposition), SVMs (Support Vector Machines), and linear discriminant
analysis, (iii) image segmentation using graph partitioning, (iv) face
detection in images, (v) clustering and classification algorithms,
(vi) visualization of high-dimensional data, etc. We may study
other application areas, such as bioinformatics, if there is sufficient
interest and need.
A substantial portion of this class will be paper readings and research
projects, where students will have freedom of choosing a well-defined
problem of their choice. Emphasis will be on implementing algorithms that
could lay the foundations of a usable system, such as, a text categorizer.
An elementary knowledge of linear algebra would be helpful but is not essential.
More course information is available at
http://www.cs.utexas.edu/users/inderjit/courses/datamining.html