An Efficient, Probabilistically Sound Segmentation Algorithm Michael R. Brent Johns Hopkins University Address correspondence to: Michael R. Brent Department of Cognitive Science Johns Hopkins University Baltimore, MD 21218 email: brent@jhu.edu phone: 410-516-6844 fax: 410-516-8020 Abstract This paper presents an algorithm for recovering word-boundaries in a natural-language text from which they have been deleted. Although a number of other algorithms have been proposed, this is the first one to be based on a sound probability model of the source that generated the text. The model yields a language-independent probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the entire observed corpus, regardless of length, as a single event in the probability space, rather than treating the occurrence of each word as a separate event. Accordingly, the algorithm does not estimate any probability distributions; instead, it attempts to calculate the exact probability of word sequences that could underlie the observed text. This approach appears to hold promise for a wide range of problems in language-learning. Finally, we report experiments on broad phonetic transcripts of spontaneous speech by parents to young children. These experiments suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are known and the text includes a substantial number of short utterances.