CS 391L Machine Learning
Project Suggestions

1 Page Project Proposal Due: Nov. 3, 2005
Final Project Report Due: 9AM December 17, 2005

General Information and Resources

These are just suggestions, feel free to propose your own idea and I encourage you to talk to me about your project ideas. The only real restriction is that projects should involve doing some research in machine learning rather than a writing a survey or discussion paper.

In addition to Weka Java ML code, there are Lisp systems available in my ml-code and ml-progs directories, C4.5 (/u/mooney/C4.5/) in C to which there is a Lisp interface in ml-progs/c4.5.lisp, FOIL (/u/mooney/foil/) and ML systems in C++ available in MLC++.

Also checkout data at Irvine ML Repository and data and program sources and other ML info at OFAI and NRL

Topics Related to UT ML Group Projects

DECORATE (see Prem Melville, melville@cs)

Evaluate the difference between using hard vs. soft labeling of artificial examples.
Extend Decorate to work for text categorization. The challenge here lies in dealing with high-dimensional data, and generating artificial training examples for text.
Test Decorate on base learners other than decision trees. Neural networks and SVMs are good alternatives.
Analyze the bias-variance decomposition of the error. Compare this to BV decompositions for Bagging and Boosting (see Webb's MLJ '00 paper for a similar study).

Active Feature-value Acquisition (see Prem Melville, melville@cs orMaytal Saar-Tsechansky, maytal.saar-tsechansky@mccombs.utexas.edu )

Extend the current Expected Utility (EU) framework to deal with the setting in which you have both missing feature values and missing class labels. One straightforward approach is to treat a missing class label as just another missing feature value.
Expected Utility relies on the estimation of the feature-value distributions and the model's performance (P(F_i,j=V_k) and A(.)). Explore if EU can benefit from alternative methods of estimating these values. For diagnostics analysis, examine the relationship between the quality of the estimation of the model's performance and of the feature-value distributions and EU's performance.
Current experiments on AFA assume that the test instances are completely specified. Extend the AFA framework to deal with incomplete test instances, where the missing values can also be acquired at a cost during test time.
An alternate approach to speeding up EU is to restrict the set of candidate queries to the most informative features, where a subset of features can be picked using a feature selection method. Compare this approach to random sampling (which is currently employed in Sampled Expected Utility).
Extend the AFA study to include data with numeric attributes. One simple approach is to convert numeric features to nominal features using a discretization technique
Test the EU framework for performance objectives other than simple accuracy. For example, one can introduce different misclassification costs.

Text Categorization

Applying Bayesian techniques to text categorization requires a probabilistic generative model of text. The so called "multinomial" and "multi-variate Bernoulli" are the most popular but there are other possibilities such as mixtures of Gaussians or Rayleigh distributions. Implement and compare these different probabilistic models for text categorization (see Sugato Basu, sugato@cs).
Lodhi et. al. (JMLR '02) recently suggested "SSK kernels" for text classification with SVM's. However, their experimental results for document classification were not showing much improvement over standard word/n-gram tokenization. Could string kernels be more effective for classification of *short* strings, e.g. text abstracts? Compare SSK kernels with word/n-gram bag-of-words on a corpus of short strings (possible datasets: webpage titles, abstracts from DMOZ, news abstracts, ...). (see Misha Bilenko, mbilenko@cs).

Information Extraction and Record Linkage (see Misha Bilenko, mbilenko@cs)

Many text collections have hierarchical organization. Methods like shrinkage that have been previously used for hierarchical classification (where labels form a hierarchy (mccallum:ml98, james:61) can be employed to improve the performance of IE methods in hierarchical collections. Develop and evaluate a method for performing IE on a hierarchical document corpus.
Several unsupervised information extraction methods have been proposed recently (grenager:acl05,agichtein:kdd04) that do not use labeled training data in the form of segmented sequences, but instead rely either on side information or model constraints. Many extensions of this work are possible, e.g., unsupervised models based on discriminative models (CRFs/RMNs). Propose and evaluate an alternative unsupervised information extraction model.
In record linkage, limiting the number of pairwise comparisons between records is a crucial efficiency step known as "blocking". Previous work on blocking has relied on hand-coded similarity functions. Implement a learnable, adaptive blocking method and evaluate its performance.
Several learnable string edit distances have been proposed recently for textual data (Bilenko KDD 03, Joachims TR04, McCallum UAI 05). Compare the accuracy of these methods for record linkage and analyze the relative performance.

Semantic Parsing (see Yuk Wah Wong, ywwong@cs)

Several machine translation systems have been used for semantic parsing in the ATIS domain (Papineni:Eurospeech97, Macherey:Eryurospeech01). Compare these methods in other domains (http://www.cs.utexas.edu/~ml/nldata.html) and analyze their performance.

Occam's Razor (see Ray Mooney)

Empirically explore the value of Occam's razor for one or more concept representations (e.g. decision trees, rules) by generating multiple consistent hypotheses with various levels of complexity across a variety of data sets and measuring the probability that the simpler hypothesis actually generalizes better (has less test error). See Pazzani (JAIR, 1994) for a similar limited study.

Inductive Learning From Examples

Construct a data set for a novel problem of interest to you and experiment with it with several of the available learning systems. Consider improvements to some system motivated by results on one of these datasets.
Experiment with various multiple model (voting) methods such as bagging or boosting applied to different learning methods.
Try applying various methods to the problem of text-categorization or modifying methods to perform better at this task.
Experiment or develop new methods for learning applied to recommender systems, such as recommending web-pages (like Syskill & Webert) or books (like Libra).
Implement and test recent methods for using unsupervised data with supervised data in order to perform supervised classification with minumum amount of sueprvision while exploiting information in unclassified data.
An implementation of the FRINGE algorithm for creating new features and using decision tree to learn DNF is available for experimentation and extension.
An implementation of ID5 (incremental ID3) is is available for experimentation and extension.
Implement and compare various methods for dealing with noisy data in ID3 such as chi-square, reduced-error pruning, minimum description length.
Implementations of a standard Bayesian probability approach to classification and an instance-base nearest-neighbor algorithm are available for comparison to rule-based approaches and extension and experimentation.
Develop or experiment with methods for learning Bayesian networks.
Experiment with and enhance my version of the AQ, PFOIL, or PGOLEM propositional rule-learning systems (extend language, deal with noise, missing data etc.).
Experiment with an efficient rule learning system called RIPPER, especially suited for problems with large number of features like text categorization.
Implement and experiment with a version of the Winnow system, a perceptron like system with provably faster convergence for domains with lots of irrelevant features.
Develop and experiment with various active-learning methods which actively select examples to use for training in order to learn more from fewer examples.
Implement and test a version of the CN2 system for learning decision lists (a variant of DNF).
Enhance our implementation of FOIL to handle more features such as recursion termination, pruning for noise, determinant literals.
Code for ILP systems GOLEM and FOIL is available for running experiments comparing them to each other and to a locally developed ILP system CHILLIN.
Experiment with partial matching or probabilistic interpretations of inductively learned rules and compare to a purely logical interpretation.
Add the ability to shift bias to an existing system and experiment. Such a system would start with a strong bias (e.g. conjunctive only) and then if this doesn't work shift to a more weakly biased language.
Implement and test a system that learns by questioning an oracle about the class of an example (like MARVIN and ALVIN).
Do a PAC analysis of an interesting concept language.
Implement, enhance, and/or experiment with some neural-network learning algorithms. Versions of the perceptron algorithm, backpropagation, competitive learning, feature-maps are available for extension and experimentation.
We have available a backpropagation that uses certainty factor functions to combine inputs rather than linear threshold units. Run some experiments comparing this system to normal backprop.
A couple systems for integrating decision tree and perceptron learning are available for experimentation and extension.
We have a number of ILP systems (Chillin, IFoil, Foidl) and some data sets available for running comparative experiments. Systems can also be extended in various ways.

Natural Language Learning

A local system CHILL that uses ILP to learn natural language parsers is available for experiments and extension. Developing data sets in foreign languages and using them to test Chill is a good topic.
A related local system WOLFIE that learns semantic lexicons (word meanings) is also available for experimentation and extension.
A local system RAPIER that use ILP-like techniques to learn rules for extracting information from natural language text is available for experimentation and extension.
Data is avaialable for other natural language learning problems are available such as learning morphology (past-tense) and word-sense disambiguation (6 senses of the word "line").

Learning by Observation and Discovery

Experiment with various algorithms for clustering textual documents. Consider applying them to book descriptions that are available from the LIBRA project.
Enhance COBWEB to handle numerical data and do some interesting experimentation.
Change COBWEB to deal with noise as described by Fisher in IJCAI-89.
Experiment with and extend the version of UNIMEM that I have.
Implement, experiment, and/or extend the AUTOCLASS Bayesian system for clustering.
Implement a small version of a scientific discovery system such as BACON, GLAUBER, STAHL, or DALTON.

Explanation-Based Learning

Develop rules for the EGGS system in some reasonably difficult domain and conduct experiments on learning macro-rules in this domain.
Build an EBL system for learning plans using STRIPS operators by explaining and generalizing observed sequences of operators.
Develop and experiment with methods for eliminating little used macro-rules or selectively learning macro-rules.
Develop and experiment with methods for learning heuristics for when to apply rules compared to learning macro-rules.
Develop and experiment with methods for limiting the use of learned rules to insure performance improvement.
Get the distributed versions of the PRODIGY or SOAR systems (available by FTP from CMU) and experiment with them.

Explanatory/Inductive Hybrids and Theory Refinement

FTP the code for the FOCL system from UC Irvine, which uses a background theory to guide a FOIL-like learner, and run experiments comparing to other theory refinement systems.
Experiment with and/or extend one of the locally developed systems for revising an existing theory to fit a set of data: EITHER and NEITHER (revises propositional theories), FORTE (revises first-order theories, i.e. logic programs), RAPTURE (revises probabilistic theories with certainty factors), and BANNER (revises parameters and structure of Bayesian networks).
Experiment and extend the locally developed DOLPHIN system (in Prolog), which combines FOIL and explanation-based learning to learn search heuristics for speeding-up logic programs. Useful for learning in planning, problem solving, and theorem proving.
Experiment and extend a locally developed system SCOPE for learning search heuristics for partial-order planning.
A local system ASSERT for using theory refinement for modelling student errors in an intelligent tutoring system is available for experimentation and extension.
Implement and test a version of KBANN for revising knowledge bases using neural network methods.

CS 391L Machine Learning Project Suggestions