Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages (2013)
Dan Garrette, Jason Mielens, and Jason Baldridge
Developing natural language processing tools for low-resource languages often requires creating resources from scratch. While a variety of semi-supervised methods exist for training from incomplete data, there are open questions regarding what types of training data should be used and how much is necessary. We discuss a series of experiments designed to shed light on such questions in the context of part-of-speech tagging. We obtain timed annotations from linguists for the low-resource languages Kinyarwanda and Malagasy (as well as English) and evaluate how the amounts of various kinds of data affect performance of a trained POS-tagger. Our results show that annotation of word types is the most important, provided a sufficiently capable semi-supervised learning infrastructure is in place to project type information onto a raw corpus. We also show that finite-state morphological analyzers are effective sources of type information when few labeled examples are available.
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-2013) (2013), pp. 583--592.

Dan Garrette Ph.D. Alumni dhg [at] cs utexas edu