CS 391L Machine Learning
1 Page Project Proposal Due: Nov. 3, 2005
Final Project Report Due: 9AM December 17, 2005
General Information and Resources
These are just suggestions, feel free to propose your own idea and I encourage
you to talk to me about your project ideas. The only real restriction is that
projects should involve doing some research in machine learning rather
than a writing a survey or discussion paper.
In addition to Weka Java ML
code, there are Lisp systems available in my ml-code and ml-progs directories, C4.5 (/u/mooney/C4.5/) in C to which there is a Lisp
ml-progs/c4.5.lisp, FOIL (/u/mooney/foil/) and ML systems in C++
available in MLC++.
Also checkout data at Irvine ML
Repository and data and program sources and other ML info at OFAI and NRL
Topics Related to UT ML Group Projects
DECORATE (see Prem Melville, melville@cs)
- Evaluate the difference between using hard vs. soft labeling of
- Extend Decorate to work for text categorization. The challenge here lies in
dealing with high-dimensional data, and generating artificial training
examples for text.
Test Decorate on base learners other than decision trees. Neural networks
and SVMs are good alternatives.
- Analyze the bias-variance decomposition of the error. Compare this
to BV decompositions for Bagging and Boosting (see Webb's MLJ '00
paper for a similar study).
Active Feature-value Acquisition (see Prem Melville, melville@cs orMaytal
Saar-Tsechansky, email@example.com )
Extend the current Expected Utility (EU) framework to deal with the setting
in which you have both missing feature values and missing class labels. One
straightforward approach is to treat a missing class label as just another
missing feature value.
- Expected Utility relies on the estimation of the feature-value
distributions and the model's performance (P(F_i,j=V_k) and A(.)). Explore if
EU can benefit from alternative methods of estimating these values. For
diagnostics analysis, examine the relationship between the quality of the
estimation of the model's performance and of the feature-value distributions
and EU's performance.
- Current experiments on AFA assume that the test instances are completely
specified. Extend the AFA framework to deal with incomplete test instances,
where the missing values can also be acquired at a cost during test time.
- An alternate approach to speeding up EU is to restrict the set of candidate
queries to the most informative features, where a subset of features can be
picked using a feature selection method. Compare this approach to random
sampling (which is currently employed in Sampled Expected Utility).
- Extend the AFA study to include data with numeric attributes. One simple
approach is to convert numeric features to nominal features using a
- Test the EU framework for performance objectives other than simple
accuracy. For example, one can introduce different misclassification costs.
- Applying Bayesian techniques to text categorization requires a
probabilistic generative model of text. The so called "multinomial" and
"multi-variate Bernoulli" are the most popular but there are other
possibilities such as mixtures of Gaussians or Rayleigh distributions.
Implement and compare these different probabilistic models for text
categorization (see Sugato Basu, sugato@cs).
Lodhi et. al. (JMLR '02) recently suggested "SSK kernels" for text
classification with SVM's. However, their experimental results for document
classification were not showing much improvement over standard word/n-gram
tokenization. Could string kernels be more effective for classification
of *short* strings, e.g. text abstracts? Compare SSK kernels with
word/n-gram bag-of-words on a corpus of short strings (possible datasets:
webpage titles, abstracts from DMOZ, news abstracts, ...).
(see Misha Bilenko, mbilenko@cs).
Information Extraction and Record Linkage (see Misha Bilenko, mbilenko@cs)
- Many text collections have hierarchical organization. Methods like
shrinkage that have been previously used for hierarchical classification
(where labels form a hierarchy (mccallum:ml98, james:61) can be employed
to improve the performance of IE methods in hierarchical collections.
Develop and evaluate a method for performing IE on a hierarchical
Several unsupervised information extraction methods have been proposed
recently (grenager:acl05,agichtein:kdd04) that do not use labeled
training data in the form of segmented sequences, but instead rely
either on side information or model constraints.
Many extensions of this work are possible, e.g., unsupervised models
based on discriminative models (CRFs/RMNs). Propose and evaluate an
alternative unsupervised information extraction model.
In record linkage, limiting the number of pairwise comparisons between records
is a crucial efficiency step known as "blocking". Previous work on blocking
has relied on hand-coded similarity functions. Implement a learnable, adaptive
blocking method and evaluate its performance.
Several learnable string edit distances have been proposed recently for textual
data (Bilenko KDD 03, Joachims TR04, McCallum UAI 05). Compare the accuracy of
these methods for record linkage and analyze the relative performance.
Semantic Parsing (see Yuk Wah Wong, ywwong@cs)
Several machine translation systems have been used for semantic parsing in the
ATIS domain (Papineni:Eurospeech97, Macherey:Eryurospeech01). Compare these
methods in other domains (http://www.cs.utexas.edu/~ml/nldata.html) and
analyze their performance.
Occam's Razor (see Ray Mooney)
- Empirically explore the value of Occam's razor for one or more concept
representations (e.g. decision trees, rules) by generating multiple consistent
hypotheses with various levels of complexity across a variety of data sets and
measuring the probability that the simpler hypothesis actually generalizes
better (has less test error). See Pazzani (JAIR, 1994) for a similar limited
Inductive Learning From Examples
Construct a data set for a novel problem of interest to you and experiment with
it with several of the available learning systems. Consider improvements to
some system motivated by results on one of these datasets.
Experiment with various multiple model (voting) methods such as bagging or
boosting applied to different learning methods.
Try applying various methods to the problem of text-categorization or modifying
methods to perform better at this task.
Experiment or develop new methods for learning applied to recommender systems,
such as recommending web-pages (like Syskill & Webert) or books (like
Implement and test recent methods for using unsupervised data with supervised
data in order to perform supervised classification with minumum amount of
sueprvision while exploiting information in unclassified data.
An implementation of the FRINGE algorithm for creating new features and using
decision tree to learn DNF is available for experimentation and extension.
An implementation of ID5 (incremental ID3) is is available for experimentation
Implement and compare various methods for dealing with noisy data in ID3 such
as chi-square, reduced-error pruning, minimum description length.
Implementations of a standard Bayesian probability approach to classification
and an instance-base nearest-neighbor algorithm are available for comparison to
rule-based approaches and extension and experimentation.
Develop or experiment with methods for learning Bayesian networks.
Experiment with and enhance my version of the AQ, PFOIL, or PGOLEM
propositional rule-learning systems (extend language, deal with noise, missing
Experiment with an efficient rule learning system called RIPPER, especially
suited for problems with large number of features like text categorization.
Implement and experiment with a version of the Winnow system, a perceptron
like system with provably faster convergence for domains with lots of
Develop and experiment with various active-learning methods which actively
select examples to use for training in order to learn more from fewer examples.
Implement and test a version of the CN2 system for learning decision lists
(a variant of DNF).
Enhance our implementation of FOIL to handle more features such as
recursion termination, pruning for noise, determinant literals.
Code for ILP systems GOLEM and FOIL is available for running experiments
comparing them to each other and to a locally developed ILP system CHILLIN.
Experiment with partial matching or probabilistic interpretations of inductively
learned rules and compare to a purely logical interpretation.
Add the ability to shift bias to an existing system and experiment. Such a
system would start with a strong bias (e.g. conjunctive only) and then if this
doesn't work shift to a more weakly biased language.
Implement and test a system that learns by questioning an oracle about the
class of an example (like MARVIN and ALVIN).
Do a PAC analysis of an interesting concept language.
Implement, enhance, and/or experiment with some neural-network learning
algorithms. Versions of the perceptron algorithm, backpropagation, competitive
learning, feature-maps are available for extension and experimentation.
We have available a backpropagation that uses certainty factor functions
to combine inputs rather than linear threshold units. Run some experiments
comparing this system to normal backprop.
A couple systems for integrating decision tree and perceptron learning are
available for experimentation and extension.
We have a number of ILP systems (Chillin, IFoil, Foidl) and some data sets
available for running comparative experiments. Systems can also be extended
in various ways.
Natural Language Learning
A local system CHILL that uses ILP to learn natural language parsers is
available for experiments and extension. Developing data sets in foreign
languages and using them to test Chill is a good topic.
A related local system WOLFIE that learns semantic lexicons (word meanings) is
also available for experimentation and extension.
A local system RAPIER that use ILP-like techniques to learn rules for
extracting information from natural language text is available for
experimentation and extension.
Data is avaialable for other natural language learning problems are available
such as learning morphology (past-tense) and word-sense disambiguation
(6 senses of the word "line").
Learning by Observation and Discovery
Experiment with various algorithms for clustering textual documents.
Consider applying them to book descriptions that are available
from the LIBRA project.
Enhance COBWEB to handle numerical data and do some interesting
Change COBWEB to deal with noise as described by Fisher in IJCAI-89.
Experiment with and extend the version of UNIMEM that I have.
Implement, experiment, and/or extend the AUTOCLASS Bayesian system for
Implement a small version of a scientific discovery system such as
BACON, GLAUBER, STAHL, or DALTON.
Develop rules for the EGGS system in some reasonably difficult domain
and conduct experiments on learning macro-rules in this domain.
Build an EBL system for learning plans using STRIPS operators
by explaining and generalizing observed sequences of operators.
Develop and experiment with methods for eliminating little used macro-rules or
selectively learning macro-rules.
Develop and experiment with methods for learning heuristics for when to
apply rules compared to learning macro-rules.
Develop and experiment with methods for limiting the use of learned rules
to insure performance improvement.
Get the distributed versions of the PRODIGY or SOAR systems (available
by FTP from CMU) and experiment with them.
Explanatory/Inductive Hybrids and Theory Refinement
FTP the code for the FOCL system from UC Irvine, which uses a background theory to guide a
FOIL-like learner, and run experiments comparing to other theory refinement systems.
Experiment with and/or extend one of the locally developed systems for revising
an existing theory to fit a set of data: EITHER and NEITHER (revises propositional
theories), FORTE (revises first-order theories, i.e. logic programs), RAPTURE
(revises probabilistic theories with certainty factors), and BANNER (revises
parameters and structure of Bayesian networks).
Experiment and extend the locally developed DOLPHIN system (in Prolog), which
combines FOIL and explanation-based learning to learn search heuristics for
speeding-up logic programs. Useful for learning in planning, problem solving,
and theorem proving.
Experiment and extend a locally developed system SCOPE for learning search
heuristics for partial-order planning.
A local system ASSERT for using theory refinement for modelling student errors
in an intelligent tutoring system is available for experimentation and extension.
Implement and test a version of KBANN for revising knowledge bases using
neural network methods.