Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages (2013)

Dan Garrette, Jason Mielens, and Jason Baldridge

Developing natural language processing tools for low-resource languages often requires creating resources from scratch. While a variety of semi-supervised methods exist for training from incomplete data, there are open questions regarding what types of training data should be used and how much is necessary. We discuss a series of experiments designed to shed light on such questions in the context of part-of-speech tagging. We obtain timed annotations from linguists for the low-resource languages Kinyarwanda and Malagasy (as well as English) and evaluate how the amounts of various kinds of data affect performance of a trained POS-tagger. Our results show that annotation of word types is the most important, provided a sufficiently capable semi-supervised learning infrastructure is in place to project type information onto a raw corpus. We also show that finite-state morphological analyzers are effective sources of type information when few labeled examples are available.

View:

PDF

Citation:

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-2013) (2013), pp. 583--592.

Bibtex:

People

Dan Garrette

Ph.D. Alumni

dhg [at] cs utexas edu

Areas of Interest

Machine Learning Natural Language Processing Semi-Supervised Learning

Labs

Machine Learning