Using the Spoken Dutch Corpus for type-logical grammar induction
نویسندگان
چکیده
Abstract The dependency-based annotation format employed within the Spoken Dutch Corpus (CGN) project (van der Wouden et al., 2002) has been designed in such a way as to enable a transparent mapping to the derivational structures of current ‘lexicalized’ grammar formalisms. Through such translations, the CGN tree bank can be used to train and evaluate computational grammars within these frameworks. In this paper we use the computational facilities of the Grail system (see Moot, 2002) to extract type logical grammars from the CGN annotation graphs. Grail is a general grammar development environment for type-logical categorial grammars (TLG). The Grail parsing engine combines proof net technology with structural rewriting.
منابع مشابه
Extraction of Type-Logical Supertags from the Spoken Dutch Corpus
The Spoken Dutch Corpus assigns 1 million of its 9 million total words a syntactic annotation in the form of dependency graphs. We will look at strategies for automatically extracting a lexicon of type-logical supertags from these dependency graphs and investigate how different levels of lexical detail affect the size of the resulting lexicon as well as the performancewith respect to supertag d...
متن کاملCLIoS: Cross-Lingual Induction of Speech Recognition Grammars for the Localization of Spoken Dialog Systems
We present an approach for the cross-lingual induction of speech recognition grammars that separates the task of translation from the task of grammar generation. The source speech recognition grammar is used to generate phrases, which are translated by a common translation service. The target recognition grammar is induced by using the production rules of the source language, manually translate...
متن کاملClausal Coordinate Ellipsis and its Varieties in Spoken German: A Study with the TüBa-D/S Treebank of the VERBMOBIL Corpus
Grammar rules for Clausal Coordinate Ellipsis (CCE) are based nearly exclusively on linguistic judgments (intuitions). For German, the extent to which grammar rules based on this type of empirical evidence generate all and only CCE structures that populate text corpora, has only been explored with the TIGER treebank of written newspaper text. How well these rules fit spoken German is unknown. I...
متن کاملWeb data harvesting for speech understanding grammar induction
The development of a speech understanding grammar for spoken dialogue systems can be greatly accelerated by using an in-domain corpus. The development of such a corpus, however, is a slow and expensive process. This paper proposes unsupervised, language-agnostic methods for finding relevant corpora in the web and mining the most informative parts. We show that by utilizing perplexity we are abl...
متن کاملCore Units of Spoken Grammar in Global ELT Textbooks
Materials evaluation studies have constantly demonstrated that there is no one fixed procedure for conducting textbook evaluation studies. Instead, the criteria must be selected according to the needs and objectives of the context in which evaluation takes place. The speaking skill as part of the communicative competence has been emphasized as an important objective in language teaching. The pr...
متن کامل