An Efficient, Probabilistically Sound Segmentation Algorithm

Michael R. Brent

Johns Hopkins University

Address correspondence to:  
    
    Michael R. Brent
    Department of Cognitive Science
    Johns Hopkins University
    Baltimore, MD 21218

    email:  brent@jhu.edu
    phone: 410-516-6844
    fax: 410-516-8020

Abstract

 This paper presents an algorithm for recovering word-boundaries in a
natural-language text from which they have been deleted.  Although a
number of other algorithms have been proposed, this is the first one to
be based on a sound probability model of the source that generated the
text. The model yields a language-independent probability distribution
on all possible sequences of all possible words over a given alphabet,
based on the assumption that the input was generated by concatenating
words from a fixed but unknown lexicon.  The model is unusual in that it
treats the entire observed corpus, regardless of length, as a single
event in the probability space, rather than treating the occurrence of
each word as a separate event.  Accordingly, the algorithm does not
estimate any probability distributions; instead, it attempts to
calculate the exact probability of word sequences that could underlie
the observed text. This approach appears to hold promise for a wide
range of problems in language-learning. Finally, we report experiments
on broad phonetic transcripts of spontaneous speech by parents to young
children.  These experiments suggest that our algorithm is more
effective than other proposed algorithms, at least when utterance
boundaries are known and the text includes a substantial number of short
utterances.