Bootstrapping Dependency Grammar Inducers from Incomplete Sentence Fragments via Austere Models

نویسندگان

  • Valentin I. Spitkovsky
  • Hiyan Alshawi
  • Daniel Jurafsky
چکیده

Modern grammar induction systems often employ curriculum learning strategies that begin by training on a subset of all available input that is considered simpler than the full data. Traditionally, filtering has been at granularities of whole input units, e.g., discarding entire sentences with too many words or punctuation marks. We propose instead viewing interpunctuation fragments as atoms, initially, thus making some simple phrases and clauses of complex sentences available to training sooner. Splitting input text at punctuation in this way improved our state-of-the-art grammar induction pipeline. We observe that resulting partial data, i.e., mostly incomplete sentence fragments, can be analyzed using reduced parsing models which, we show, can be easier to bootstrap than more nuanced grammars. Starting with a new, bare dependency-and-boundary model (DBM-0), our grammar inducer attained 61.2% directed dependency accuracy on Section 23 (all sentences) of the Wall Street Journal corpus: more than 2% higher than previous published results for this task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Three Dependency-and-Boundary Models for Grammar Induction

We present a new family of models for unsupervised parsing, Dependency and Boundary models, that use cues at constituent boundaries to inform head-outward dependency tree generation. We build on three intuitions that are explicit in phrase-structure grammars but only implicit in standard dependency formulations: (i) Distributions of words that occur at sentence boundaries — such as English dete...

متن کامل

Unsupervised Learning of Dependency Structure for Language Modeling

This paper presents a dependency language model (DLM) that captures linguistic constraints via a dependency structure, i.e., a set of probabilistic dependencies that express the relations between headwords of each phrase in a sentence by an acyclic, planar, undirected graph. Our contributions are three-fold. First, we incorporate the dependency structure into an n-gram language model to capture...

متن کامل

Dependency-Based N-Gram Models for General Purpose Sentence Realisation

We present dependency-based n-gram models for general-purpose, widecoverage, probabilistic sentence realisation. Our method linearises unordered dependencies in input representations directly rather than via the application of grammar rules, as in traditional chartbased generators. The method is simple, efficient, and achieves competitive accuracy and complete coverage on standard English (Penn...

متن کامل

Categorial Dependency Grammars: from Theory to Large Scale Grammars

Categorial Dependency Grammars (CDG) generate unlimited projective and non-projective dependency structures, are completely lexicalized and analyzed in polynomial time. We present an extension of the CDG, also analyzed in polynomial time and dedicated for large scale dependency grammars. We define for the extended CDG a specific method of “Structural Bootstrapping” consisting in incremental con...

متن کامل

Disambiguation of Super Parts of Speech ( or Supertags ) : Almost

In a lexicalized grammar formalism such as Lexicalized Tree-Adjoining Grammar (LTAG), each lexical item is associated with at least one elementary structure (supertag) that localizes syntactic and semantic dependencies. Thus a parser for a lexicalized grammar must search a large set of supertags to choose the right ones to combine for the parse of the sentence. We present techniques for disambi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012