MIT Department of Electrical Engineering & Computer Science

E E C S

Statistical Models for Parsing Natural Language

Michael Collins
University of Pennsylvania

Thursday, April 30, 1998
3:00 PM (refreshments 2:45)
Room NE43-941
EECS Special Seminar

Abstract

The vast amount of information now available in electronic form has led to increasing demand for applications that process natural language. (Example applications include machine translation, summarization and information extraction). Accurate methods for parsing unrestricted text will almost certainly be a key component in these applications.

Unfortunately, the traditional approach to syntactic analysis -- writing a grammar by hand -- has encountered two major problems. First, ambiguity: even moderate-length sentences often receive thousands of analyses, with no indication of which is correct. Second, coverage: constructing an exhaustive grammar of English has proved to be extremely difficult owing to the huge number of rules needed.

In this talk I will describe my work on machine learning methods for parsing. A statistical model is trained from a corpus of sentences that have been annotated for syntactic structure. Competing analyses for a test data sentence can then be ranked by their probability under the model; moreover the most probable analysis can be efficiently found. I will show how careful design of the model can lead to linguistically motivated parameters, and crucially to parameters that condition heavily on lexical information. The resulting models recover constituents in Wall Street Journal text with 88% accuracy, the best published results on this task. I will discuss information extraction, machine translation and speech recognition as possible applications of the parser.


URL of this page: http://www-eecs.mit.edu/AY97-98/events/40.html
Created: Apr 23, 1998  | Modified: Apr 23, 1998
This event is from the MIT EECS 1997-98 archive.  | Current events
To MIT EECS home page  | Your comments and inquiries are welcome.