Title:     Statistical Models for Text Segmentation
 
Authors:   Doug Beeferman   (dougb@cs.cmu.edu)
           Adam Berger      (aberger@cs.cmu.edu)
           John Lafferty    (lafferty@cs.cmu.edu)
 
Keywords:  Text segmentation, topic segmentation, maximum entropy
           models, feature induction, decision trees
 
Abstract:
 
This paper introduces a new statistical approach to partitioning text
automatically into coherent segments. Our approach enlists both
short-range and long-range language models to help it sniff out likely
sites of topic changes in text.  To aid its search, the system
consults a set of simple lexical hints it has learned to associate
with the presence of boundaries through inspection of a large corpus
of annotated data.  We also propose a new probabilistically motivated
error metric for use by the natural language processing and
information retrieval communities, intended to supersede precision and
recall for appraising segmentation algorithms.  Qualitative assessment
of our algorithm as well as evaluation using this new metric
demonstrate the effectiveness of our approach in two very different
domains: Wall Street Journal articles and broadcast news transcripts
from the TDT Corpus.  We show that our algorithm compares favorably
with decision trees built from the same features, and that in
combination the two learning paradigms are even more effective.