Spring 2000 CS 395T "Large-Scale Data Mining" M-W 4-5:30pm Welch 3.260 Prof. Inderjit Dhillon Recent times have seen an explosive growth in the raw data available electronically. The data occurs in various forms, e.g., as text, image, video or numeric data; both in public domains and in private corporations. Data mining is the automatic discovery of interesting patterns and relationships in very large data sets. This graduate course will focus on scalable algorithms for data mining. A special emphasis will be on information retrieval, especially for the World Wide Web. Topics covered will include (i) link analysis on the internet (like Google), (ii) content analysis of documents using SVD (Singular Value Decomposition), SVMs (Support Vector Machines), and linear discriminant analysis, (iii) image segmentation using graph partitioning, (iv) face detection in images, (v) clustering and classification algorithms, (vi) visualization of high-dimensional data, etc. We may study other application areas, such as bioinformatics, if there is sufficient interest and need. A substantial portion of this class will be paper readings and research projects, where students will have freedom of choosing a well-defined problem of their choice. Emphasis will be on implementing algorithms that could lay the foundations of a usable system, such as, a text categorizer. An elementary knowledge of linear algebra would be helpful but is not essential. More course information is available at http://www.cs.utexas.edu/users/inderjit/courses/datamining.html