You are on page 1of 4

Large-Scale Data Mining

CS 395T
Unique Number: 49460

Course Announcement

Spring 2000
M-W 4:00-5:30pm
CPE 2.206

Professor: Inderjit Dhillon (send email)


Office: Taylor Hall 5.148
Office Hours: Wed 10:00-11:00am

TA: Shailesh Kumar (send email)


Office: ENS 518
Office Hours: Thurs 10am-1pm

Paper Readings

Class Projects

Current Projects.
Sample projects descriptions and associated resources.

Handouts

Course Information (contains grading information) handed out on


Jan 19.
Class Survey, Jan 19.

Relevant Books (on reserve in PCL)

Pattern Classification and Scene Analysis by R. Duda and P. Hart,


Wiley-Interscience, 1973. An old classic. The first six chapters are
outstanding.
Foundations of Statistical Natural Language Processing by C.
Manning and H. Schutze, MIT Press, 1999. Recent book with detailed
treatment of some aspects of information retrieval.
Lectures

Lecture 1 - Introduction, syllabus.


Lecture 2 - Finding good "hubs" and "authorities" for broad-topic
queries. Material from:
Authoritative sources in a hyperlinked environment by Jon
Kleinberg.
Improved Algorithms for Topic Distillation in a Hyperlinked
Environment by Krishna Bharat and Monika Henzinger.
Lecture 3 - Review of basic linear algebra (vectors, norms,
eigenvalues/eigevectors).
Lecture 4 - Singular Value Decomposition, Proof that hub vector and
authority vector converges to the dominant singular vectors, Vector-
Space Models for text.
Lecture 5 - Latent Semantic Indexing for query retrieval.
Lecture 6 in two parts: 1 and 2 - Examples illustrating Latent
Semantic Indexing.
Lecture 7 - First lecture on Clustering.
Lecture 8 - Clustering Algorithms (download the MATLAB code for
the clustering demo).
Lecture 9 - Clustering (k-means).
Lecture 10 - Graph Partitioning. Also see lecture notes 1 & 2 by Jim
Demmel.
Lecture 11 - Classification (k-nearest neighbor,probabilistic
models,naive Bayes).
Lecture 12 - Classification (Maximum Likelihood Classifiers).
Lecture 13 - EM for Mixture Model Density Estimation.

Material to be covered

Mathematical preliminaries - basics of linear algebra.


SVD (Singular Value Decomposition) and its use in indexing
documents. For example, Latent Semantic Indexing (LSI).
LSI page at Bellcore.
LSI page at Univ. of Tennessee, Knoxville.
Matrices, Vector Spaces and Information Retrieval by Michael
W. Berry, Zlatko Drmac, Elizabeth R. Jessup.
Clustering algorithms (agglomerative clustering, graph-based
algorithms, k-means).
Classification algorithms (linear discriminant analysis).
Focused Crawling of the WWW.
Focused Crawling: A New Approach to Topic-Specific Web
Resource Discovery by Soumen Chakrabarti, Martin van den
Berg and Byron Dom.
Data Visualization (Self-Organizing Maps (SOMs), Class-Preserving
Projections).
Class Visualization of High-Dimensional Data with
Applications. by Inderjit Dhillon, Dharmendra Modha, Scott
Spangler, 1999. Free Software is available here.
XGobi is a system for multivariate data visualization by
Deborah Swayne, Di Cook, Andreas Buja at Bellcore. The same
page contains XGvis that can draw discrete graphs using
MDS(Multidimensional Scaling) and was developed by Andreas
Buja, Deborah F. Swayne, Michael L. Littman, Nathaniel Dean.
Free Software is available from the provided link.
WEBSOM can plot 2-d maps of tect documents using
Kohonen's Self-Organizing Maps for Internet Exploration. The
above link has a demo for visually browsing newsgroup data.
Support Vector Machines (SVMs) and their application to document
classification.
Graph Partitioning with applications to Image Segmentation.
Lecture notes 1 & 2 on graph partitioning by Jim Demmel
Normalized Cuts and Image Segmentation by Jianbo
Shi and Jitendra Malik.
Motion Segmentation and Tracking Using Normalized
Cuts by Jianbo Shi and Jitendra Malik.
The METIS Graph Partitioning Package.
SVD in face recognition.
Papers and Faces Database by Larry Sirovich.
Eigenfaces and Face Recognition at the MIT Media Lab.
Eigenfaces vs. Fisherfaces: Recognition Using Class Specific
Linear Projection by Peter Belhumeur and Jo Hespanha and
David Kriegman, July 1997.
Analyzing the graph of the WWW (hubs and authorities, the
CLEVER project at IBM, PageRank at Google)
Authoritative sources in a hyperlinked environment by Jon
Kleinberg.
The CLEVER project at IBM Almaden.
Hypersearching the Web by Members of the CLEVER project.

Related Courses

Stanford's CS 349, Data Mining, Search, and the World Wide Web,
Fall 1998.
UC Berkeley's <="" a="">CS 294-7, Large Datasets, Fall 1999.
UT Austin ECE course <="" a="">EE 380L, A Practicum in Data
Mining, Fall 1999.
Princeton's CIS 700/702, Information Retrieval, ?.
The Data Mining Lab(DML) is led by Prof. Inderjit Dhillon. It is
closely affiliated with the Machine Learning Research Group
(MLRG) (led by Prof. Mooney) and the Intelligent Data Exploration
and Analysis Laboratory (IDEAL) (led by Prof. Ghosh of ECE). For
applications in bioinformatics, the group closely collaborates with Prof.
Marcotte who is a faculty member in the Chemistry/Biochemistry
department and the Center for Computational Biology and
Bioinformatics (CCBB).
The Data Mining Lab at UT Austin is focused on the analysis of very
large data sets, especially those that arise in the application areas of text
mining and bioinformatics. The emphasis is on finding sound,
theoretically-motivated algorithms for the central tasks in data mining,
such as high-dimensional clustering, classification algorithms and data
visualization.
The current focus of the group is on uncovering the latent low-
dimensional structure that is often inherent in high-dimensional data. In
many important applications, such as text mining and face recognition,
the data matrices that arise are sparse and non-negative. Thus it is
natural to seek low-dimensional approximations that preserve these
properties -- sparsity in approximations implies economy in
representation while non-negativity enhances interpretation (note that
traditional methods such as SVD and PCA do not preserve these
properties).
With the above goals in mind, the lab has recently been exploring the
application of information theory to data mining tasks. Information
Theory provides a natural way of dealing with non-negative data
vectors by treating them as probability vectors. Problems such as
clustering can then be posed as optimization problems in information
theory, such as maximizing mutual information. As an application to
text mining, such an approach has been shown to reveal the semantic
similarity of words thus leading to substantial reduction in classifier
complexity and increased accuracy in document classification when
training data is sparse. Further directions currently being explored
include: (a) information-theoretic clustering and approximation of
higher order non-negative tensors (that often arise in applications as
multidimensional contingency tables), and (b) new algorithms for low-
rank non-negative matrix factorization.
The Data Mining Lab has disseminated publications, software and
results for document clustering, clustering of gene expression data in
bioinformatics and multidimensional data visualization.

You might also like