Parsimonious Data-Oriented Parsing
نویسنده
چکیده
This paper explores a parsimonious approach to Data-Oriented Parsing. While allowing, in principle, all possible subtrees of trees in the treebank to be productive elements, our approach aims at finding a manageable subset of these trees that can accurately describe empirical distributions over phrase-structure trees. The proposed algorithm leads to computationally much more tracktable parsers, as well as linguistically more informative grammars. The parser is evaluated on the OVIS and WSJ corpora, and shows improvements on efficiency, parse accuracy and testset likelihood. 1 Data-Oriented Parsing Data-Oriented Parsing (DOP) is a framework for statistical parsing and language modeling originally proposed by Scha (1990). Some of its innovations, although radical at the time, are now widely accepted: the use of fragments from the trees in an annotated corpus as the symbolic grammar (now known as “treebank grammars”, Charniak, 1996) and inclusion of all statistical dependencies between nodes in the trees for disambiguation (the “allsubtrees approach”, Collins & Duffy, 2002). The best known instantiations of the DOPframework are due to Bod (1998; 2001; 2003), using the Probabilistic Tree Substitution Grammar (PTSG) formalism. Bod has advocated a maximalist approach to DOP, inducing grammars that contain all subtrees of all parse trees in the treebank, and using them to parse unknown sentences where all of these subtrees can potentially contribute to the most probable parse. Although Bod’s empirical results have been excellent, his maximalism poses important computational challenges that, although not necessarily unsolvable, threaten both the scalability to larger treebanks and the cognitive plausibility of the models. In this paper I explore a different approach to DOP, that I will call “Parsimonious Data-Oriented Parsing” (P-DOP). This approach remains true to Scha’s original program, by allowing, in principle, all possible subtrees of trees in the treebank to be the productive elements. But unlike Bod’s approach, P-DOP aims at finding a succinct subset of such elementary trees, chosen such that it can still accurately describe observed distributions over phrasestructure trees. I will demonstrate that P-DOP leads to computationally more tracktable parsers, as well as linguistically more informative grammars. Moreover, as P-DOP is formulated as an enrichment of the treebank Probabilistic Context-free Grammar (PCFG), it allows for much easier comparison to alternative approaches to statistical parsing (Collins, 1997; Charniak, 1997; Johnson, 1998; Klein and Manning, 2003; Petrov et al., 2006). 2 Independence Assumptions in PCFGs Parsing with treebank PCFGs, in its simplest form, involves the following steps: (1) a treebank is created by extracting phrase-structure trees from an annotated corpus, and split in a trainand a testset; (2) a PCFG is read off from all productions in the trainset trees, with weights proportional to their fre-
منابع مشابه
Darwinised Data-Oriented Parsing - Statistical NLP with Added Sex and Death
We present the Darwinised DataOriented Parsing algorithm, an incremental, dy-namic form of Data-Oriented Parsing, in which exemplars are used as replicators, subject to a selection pressure towards gen-eralisability.
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کامل