Doctoral Thesis: From Structured Document To Structured Knowledge

Thursday, May 18
3:00 pm - 4:30 pm

32-D463

Yujie Qian

Abstract:

Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structure is essential for a comprehensive understanding of these documents. This thesis presents novel algorithms for extracting structured knowledge from structured documents. 

First, we propose GraphIE, an information extraction framework designed to model the non-local and non-sequential dependencies in structured documents. GraphIE leverages structural information through graph neural networks to enhance word-level tagging predictions. In evaluations across three extraction tasks, GraphIE consistently outperforms a sequential model that operates solely on plain text. 

Next, we delve into information extraction in the chemistry domain. Scientific literature often depicts molecules and reactions in infographic form. To extract these molecules, we develop MolScribe, a tool that translates a molecular image into its graph structure. MolScribe integrates symbolic chemistry constraints within an image-to-graph generation model, demonstrating robust performance in handling diverse drawing styles and conventions. To extract reaction schemes, we propose RxnScribe, which parses reaction diagrams through a sequence generation formulation. Despite being trained on a modest dataset, RxnScribe achieves strong performance across different types of diagrams.

Finally, we introduce TextReact, a novel method that directly augments predictive chemistry with text retrieval, bypassing the intermediate information extraction step. Our experiments on reaction condition recommendation and retrosynthetic prediction demonstrate TextReact’s efficacy in retrieving relevant information from the literature and generalizing to new inputs.

Thesis Committee: Professors Regina Barzilay (supervisor), Tommi Jaakkola, Connor Coley

Details

  • Date: Thursday, May 18
  • Time: 3:00 pm - 4:30 pm
  • Category:
  • Location: 32-D463
Additional Location Details:

To attend via zoom:

https://mit.zoom.us/j/98642251144