Starting from a minimum description-length modeling framework, I will introduce and motivate a hierarchical coding scheme that represents both signals and model parameters as compositions of model parameters. This coding scheme is extremely well-behaved with respect to search, eliminates competition between interesting and uninteresting patterns in data, and mirrors linguistic mechanisms. Optimizing model parameters over a sequence of text or speech samples produces a statistical language model, a segmentation of the input, and a set of parameters that have natural linguistic interpretations.
The final models fare well on both linguistic and statistical grounds. I will present record text compression rates as well as record recall rates for intuitive segmentation boundaries in English and Chinese text and transcripts of speech to children. I will describe extensions to the basic framework for learning from speech rather than text and present the first dictionaries learned directly from spoken utterances. Finally, time permitting, I will present extensions for learning aspects of grammar and for learning translation models between two languages or one language and representations of meaning.
HOST: Professor Eric Grimson
|
Modified: Jun 25, 1997
|
Current events
|
Your comments
and inquiries are welcome.