Learning a Part-of-Speech Tagger from Two Hours of Annotation

Learning a Part-of-Speech Tagger from Two Hours of Annotation (2013)

Dan Garrette, Jason Baldridge

Most work on weakly-supervised learning for part-of-speech taggers has been based on unrealistic assumptions about the amount and quality of training data. For this paper, we attempt to create true low-resource scenarios by allowing a linguist just two hours to annotate data and evaluating on the languages Kinyarwanda and Malagasy. Given these severely limited amounts of either type supervision (tag dictionaries) or token supervision (labeled sentences), we are able to dramatically improve the learning of a hidden Markov model through our method of automatically generalizing the annotations, reducing noise, and inducing word-tag frequency information.

View:

PDF

Citation:

Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-13) (2013), pp. 138--147.

Bibtex:

Presentation:

Slides (PDF) Video

People

Dan Garrette

Ph.D. Alumni

dhg [at] cs utexas edu

Areas of Interest

Machine Learning Natural Language Processing Semi-Supervised Learning

Labs

Machine Learning