Title: Statistical Models for Text Segmentation Authors: Doug Beeferman (dougb@cs.cmu.edu) Adam Berger (aberger@cs.cmu.edu) John Lafferty (lafferty@cs.cmu.edu) Keywords: Text segmentation, topic segmentation, maximum entropy models, feature induction, decision trees Abstract: This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains: Wall Street Journal articles and broadcast news transcripts from the TDT Corpus. We show that our algorithm compares favorably with decision trees built from the same features, and that in combination the two learning paradigms are even more effective.