AI Technical Report Abstracts


03-299

Yeh, Peter, Bruce Porter, and Ken Barker. "Transformation Rules for Knowledge-Based Pattern Matching." The University of Texas at Austin, Department of Computer Sciences. AI Technical Report 03-299. December 2003. 32 pages.

Download


03-302

Stanley, Kenneth. "Learning Concept Drift with a Committee of Decision Trees." The University of Texas at Austin, Department of Computer Sciences. AI Technical Report 03-302. September 2003. 14 pages.

Concept drift occurs when a target concept changes over time. We present a new method for learning shifting target concepts during concept drift. The method, called Concept Drift Committee (CDC),uses a weighted committee of hypotheses that votes on the current classification. When a committee member's voting record drops below a minimum threshold, the member is forced to retire. A new committee member then takes the open place on the committee. The algorithm is compared to a leading algorithm on several benchmarks. The results indicate that using a committee to track drift has several advantages over traditional window-based approaches.

Download


03-303

Gomez, Faustino J. "Robust Non-Linear Control through Neuroevolution." The University of Texas at Austin, Department of Computer Sciences. AI Technical Report 03-303. August 2003. 150 pages.

Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning approaches have made progress in such problems, but have so far not scaled well. Neuroevolution, has improved upon conventional reinforcement learning, but has still not been successful in full-scale, non-linear control problems. This dissertation develops a methodology for solving real world control tasks consisting of three components: (1) an efficient neuroevolution algorithm that solves difficult non-linear control tasks by coevolving neurons, (2) an incremental evolution method to scale the algorithm to the most challenging tasks, and (3) a technique for making controllers robust so that they can transfer from simulation to the real world. The method is faster than other approaches on a set of difficult learning benchmarks, and is used in two full-scale control tasks demonstrating its applicability to real world problems.

Download


03-304

Stone, Peter, Kurt Dresner, Selim T. Erdogan, Peggy Fidelman, Nicholas K. Jong, Nate Kohl, Gregory Kuhlmann, Ellie Lin, Mohan Sridharan, Daniel Stronger, and Gurushyam Hariharan. "UT Austin Villa 2003: A New RoboCup Four-Legged Team." The University of Texas at Austin, Department of Computer Sciences. AI Technical Report 03-304. October 6, 2003. 76 pages.

The UT Austin Villa RoboCup 2003 Four-Legged Team was a new entry in the ongoing series of RoboCup legged league competitions. The team development began in mid-January of 2003, at which time none of the team members had any familiarity with the Aibos. Without using any RoboCup-related code from other teams, we entered a team in the American Open competition at the end of April, and met with some success at the annual RoboCup competition that took place in Padova, Italy at the beginning of July. In this report, we describe both our development process and the technical details of its end result, the UT Austin Villa team. The main contributions of this paper are (i) a roadmap for new teams entering the competition who are starting from scratch, and (ii) full documentation of the algorithms behind our approach with the goal of making them fully replicable.

Download


03-305

Bilenko, Mikhail. "Learnable Similarity Functions and Their Applications to Record Linkage and Clustering." The University of Texas at Austin, Department of Computer Sciences. AI Technical Report 03-305. November 26, 2003. 47 pages.

Many machine learning tasks require similarity functions that estimate likeness between observations. Similarity computations are particularly important for clustering and record linkage algorithms that depend on accurate estimates of the distance between datapoints. However, standard measures such as string edit distance and Euclidean distance often fail to capture an appropriate notion of similarity for a particular domain or dataset. This problem can be alleviated by employing learnable similarity functions that adapt using training data. In this proposal, we introduce two adaptive string similarity measures: (1) Learnable Edit Distance with Affine Gaps, and (2) Learnable Vector-Space Similarity Based on Pairwise Classification. These similarity functions can be trained using a corpus of labeled pairs of equivalent and non-equivalent strings. We illustrate the accuracy improvements obtained with these measures using MARLIN, our system for record linkage in databases that learns to combine adaptive and static string similarity functions in a two-level learning framework.

Obtaining useful training examples for learnable similarity functions can be problematic due to scarcity of informative similar and dissimilar object pairs. We propose two strategies, Static-Active Selection and Weakly-Labeled Selection, that facilitate efficient training data collection for record linkage. Additionally, we describe a method for combining seeding with Euclidean distance learning for semi-supervised k-means clustering. Experimental evaluation demonstrates that our method outperforms unsupervised clustering and semi-supervised clustering that employs seeding or metric learning separately.

In future research, we intend to pursue several directions in developing accurate learnable similarity functions and applying them to record linkage and clustering problems. This work will involve improving the proposed string similarity functions as well as introducing several novel approaches to adaptive string distance computation. We also plan to extend our initial work on learnable similarity functions for clustering, particularly for high-dimensional data. Finally, we will investigate the utility of various active learning strategies for learning similarity functions, as well as extend the preliminary work on static-active selection of training pairs.

Download


03-306

Melville, Prem. "Creating Diverse Ensemble Classifiers." The University of Texas at Austin, Department of Computer Sciences. AI Technical Report 03-306. October 2003. 57 pages.

Ensemble methods like Bagging and Boosting which combine the decisions of multiple hypotheses are some of the strongest existing machine learning methods. The diversity of the members of an ensemble is known to be an important factor in determining its generalization error. We present a new method for generating ensembles, DECORATE (Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples), that directly constructs diverse hypotheses using additional artificially-constructed training examples. The technique is a simple, general meta-learner that can use any strong learner as a base classifier to build diverse committees. Experimental results using decision-tree induction as a base learner demonstrate that this approach consistently achieves higher predictive accuracy than both the base classifier and Bagging. DECORATE also obtains higher accuracy than Boosting early in the learning curve when training data is limited. We propose to show that DECORATE can also be effectively used for (1) active learning, to reduce the number of training examples required to achieve high accuracy; (2) exploiting unlabeled data to improve accuracy in a semi-supervised learning setting; (3) combining active learning with semi-supervision for improved results; (4) obtaining better class membership probability estimates; (5) reducing the error of regressors; and (6) improving the accuracy of relational learners.

Download


03-307

Basu, Sugato. "Semi-supervised Clustering: Learning from Limited User Feedback." The University of Texas at Austin, Department of Computer Sciences. AI Technical Report 03-307. November 2003. 41 pages.

In many machine learning domains (e.g. text processing, bioinformatics), there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate. Consequently, semi-supervised learning, learning from a combination of both labeled and unlabeled data, has become a topic of significant recent interest. In the proposed thesis, our research focus is on semi-supervised clustering, which uses a small amount of supervised data in the form of class labels or pairwise constraints on some examples to aid unsupervised clustering. Semi-supervised clustering can be either search-based, i.e., changes are made to the clustering objective to satisfy user-specified labels/constraints, or similarity-based, i.e., the clustering similarity metric is trained to satisfy the given labels/constraints. Our main goal in the proposed thesis is to study search-based semi-supervised clustering algorithms and apply them to different domains.

In our initial work, we have shown how supervision can be provided to clustering in the form of labeled data points or pairwise constraints. We have also developed an active learning framework for selecting informative constraints in the pairwise constrained semi-supervised clustering model, and proposed a method for unifying search-based and similarity-based techniques in semi-supervised clustering.

In this thesis, we want to study other aspects of semi-supervised clustering. Some of the issues we want to investigate include: (1) effect of noisy, probabilistic or incomplete supervision in clustering; (2) model selection techniques for automatic selection of number of clusters in semi-supervised clustering; (3) ensemble semi-supervised clustering. In our work so far, we have mainly focussed on generative clustering models, e.g. KMeans and EM, and ran experiments on clustering low-dimensional UCI datasets or high-dimensional text datasets. In future, we want to study the effect of semi-supervision on other clustering algorithms, especially in the discriminative clustering and online clustering framework. We also want to study the effectiveness of our semi-supervised clustering algorithms on other domains, e.g., web search engines (clustering of search results), astronomy (clustering of Mars spectral images) and bioinformatics (clustering of gene microarray data).

Download


Questions to trcenter@cs.utexas.edu