Smoothing fine-grained PCFG lexicons
نویسندگان
چکیده
We present an approach for smoothing treebank-PCFG lexicons by interpolating treebank lexical parameter estimates with estimates obtained from unannotated data via the Inside-outside algorithm. The PCFG has complex lexical categories, making relative-frequency estimates from a treebank very sparse. This kind of smoothing for complex lexical categories results in improved parsing performance, with a particular advantage in identifying obligatory arguments subcategorized by verbs unseen in the treebank.
منابع مشابه
Ckylark: A More Robust PCFG-LA Parser
This paper describes Ckylark, a PCFG-LA style phrase structure parser that is more robust than other parsers in the genre. PCFG-LA parsers are known to achieve highly competitive performance, but sometimes the parsing process fails completely, and no parses can be generated. Ckylark introduces three new techniques that prevent possible causes for parsing failure: outputting intermediate results...
متن کاملCorpus Induction of Lexicons for Treebank PCFGs by Inside-Outside Estimation and Frequency Transformations
We describe procedures which pool lexical information from a treebank with frequency information estimated from an unannotated corpus with the insideoutside algorithm. PCFG parameters for non-lexical productions are obtained purely from the treebank. The procedures produce substantial improvements (upto 20.34%) on the task of determining valences of tokens of novel verbs, relative to a smoothed...
متن کاملBuilding a fine-grained subjectivity lexicon from a web corpus
In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, verbs and adjectives. The method, which is applied for Dutch, is based on the comparison of word frequencies of three corpora: Wikipedia, News and News comments. Comparison of the corpora is carried out with two measures: log-likelihood ratio and a percentage difference calculation. The first step of ...
متن کاملAppropriately Handled Prosodic Breaks Help PCFG Parsing
This paper investigates using prosodic information in the form of ToBI break indexes for parsing spontaneous speech. We revisit two previously studied approaches, one that hurt parsing performance and one that achieved minor improvements, and propose a new method that aims to better integrate prosodic breaks into parsing. Although these approaches can improve the performance of basic probabilis...
متن کاملKorean Twitter Emotion Classification Using Automatically Built Emotion Lexicons and Fine-Grained Features
In recent years many people have begun to express their thoughts and opinions on Twitter. Naturally, Twitter has become an effective source to investigate people’s emotions for numerous applications. Classifying only positive and negative tweets has been exploited in depth, whereas analyzing finer emotions is still a difficult task. More elaborate emotion lexicons should be developed to deal wi...
متن کامل