Syntactic Scope Resolution in Uncertainty Analysis
نویسندگان
چکیده
We show how the use of syntactic structure enables the resolution of hedge scope in a hybrid, two-stage approach to uncertainty analysis. In the first stage, a Maximum Entropy classifier, combining surface-oriented and syntactic features, identifies cue words. With a small set of hand-crafted rules operating over dependency representations in stage two, we attain the best overall result (in terms of both combined ranks and average F1) in the 2010 CoNLL Shared Task. 1 Background—Motivation Recent years have witnessed an increased interest in the analysis of various aspects of sentiment in natural language (Pang & Lee, 2008). The subtask of hedge resolution deals with the analysis of uncertainty as expressed in natural language, and the linguistic means (so-called hedges) by which speculation or uncertainty are expressed. Information of this kind is of importance for various mining tasks which aim at extracting factual data. Example (1), taken from the BioScope corpus (Vincze, Szarvas, Farkas, Móra, & Csirik, 2008), shows a sentence where uncertainty is signaled by the modal verb may.1 (1) {The unknown amino acid 〈may〉 be used by these species}. The topic of the Shared Task at the 2010 Conference for Natural Language Learning (CoNLL) is hedge detection in biomedical literature—in a sense ‘zooming in’ on one particular aspect of the broader BioNLP Shared Task in 2009 (Kim, Ohta, Pyysalo, Kano, & Tsujii, 2009). It involves two subtasks: Task 1 is described as learning to detect In examples throughout this paper, angle brackets highlight hedge cues, and curly braces indicate the scope of a given cue, as annotated in BioScope. sentences containing uncertainty; the objective of Task 2 is learning to resolve the in-sentence scope of hedge cues (Farkas, Vincze, Mora, Csirik, & Szarvas, 2010). The organizers further suggest: This task falls within the scope of semantic analysis of sentences exploiting syntactic patterns [...]. The utility of syntactic information within various approaches to sentiment analysis in natural language has been an issue of some debate (Wilson, Wiebe, & Hwa, 2006; Ng, Dasgupta, & Arifin, 2006), and the potential contribution of syntax clearly varies with the specifics of the task. Previous work in the hedging realm has largely been concerned with cue detection, i.e. identifying uncertainty cues such as may in (1), which are predominantly individual tokens (Medlock & Briscoe, 2007; Kilicoglu & Bergler, 2008). There has been little previous work aimed at actually resolving the scope of such hedge cues, which presumably constitutes a somewhat different and likely more difficult problem. Morante and Daelemans (2009) present a machine-learning approach to this task, using token-level, lexical information only. To this end, CoNLL 2010 enters largely uncharted territory, and it remains to be seen (a) whether syntactic analysis indeed is a necessary component in approaching this task and, more generally, (b) to what degree the specific task setup can inform us about the strong and weak points in current approaches and technology. In this article, we investigate the contribution of syntax to hedge resolution, by reflecting on our experience in the CoNLL 2010 task.2 Our CoNLL system submission ranked fourth (of 24) on Task 1 and third (of 15) on Task 2, for an overall best average result (there appears to be very limited overlap among top performers for the two subtasks). It turns out, in fact, that all the top-performing systems in Task 2 of the CoNLLShared Task rely on syntactic information provided by parsers, either in features for machine learning or as input to manually crafted rules (Morante, Asch, & Daelemans, 2010; Rei & Briscoe, 2010). Sentences Hedged Cues Multi-Word Tokens Cue Tokens Sentences Cues Abstracts 11871 2101 2659 364 309634 3056 Articles 2670 519 668 84 68579 782 Total 14541 2620 3327 448 378213 3838s 11871 2101 2659 364 309634 3056 Articles 2670 519 668 84 68579 782 Total 14541 2620 3327 448 378213 3838 Table 1: Summary statistics for the Shared Task training data. This article transcends our CoNLL system description (Velldal, Øvrelid, & Oepen, 2010) in several respects, presenting updated and improved cue detection results (§ 3 and § 4), focusing on the role of syntactic information rather than on machine learning specifics (§ 5 and § 6), providing an analysis and discussion of Task 2 errors (§ 7), and generally aiming to gauge the value of available annotated data and processing tools (§ 8). We present a hybrid, two-level approach for hedge resolution, where a statistical classifier detects cue words, and a small set of manually crafted rules operating over syntactic structures resolve scope. We show how syntactic information—produced by a datadriven dependency parser complemented with information from a ‘deep’, hand-crafted grammar— contributes to the resolution of in-sentence scope of hedge cues, discussing various types of syntactic constructions and associated scope detection rules in considerable detail. We furthermore present a manual error analysis, which reveals remaining challenges in our scope resolution rules as well as several relevant idiosyncrasies of the preexisting BioScope annotation. 2 Task, Data, and System Basics Task Definition and Evaluation Metrics Task 1 is a binary sentence classification task: identifying utterances as being certain or uncertain. Following common practice, this subtask is evaluated in terms of precision, recall, and F1 for the ‘positive’ class, i.e. uncertain. In our work, we approach Task 1 as a byproduct of the full hedge resolution problem, labeling a sentence as uncertain if it contains at least one token classified as a hedge cue. In addition to the sentence-level evaluation for Task 1, we also present precision, recall, and F1 for the cue-level. Task 2 comprises two subtasks: cue detection and scope resolution. The official CoNLL evaluation does not tease apart these two aspects of the problem, however: Only an exact match of both the cue and scope bracketing (in terms of substring positions) will be counted as a success, again quantified in terms of precision, recall, and F1. Discussing our results below, we report cue detection and scope resolution performance separately, and further put scope results into perspective against an upper bound based on the goldstandard cue annotation. Besides the primary biomedical domain data, some annotated Wikipedia data was provided for Task 1, and participating systems are classified as in-domain (using exclusively the domainspecific data), cross-domain (combining both types of training data), or open (utilizing additional uncertainty-related resources). In our work, we focus on the interplay of syntax and the more challenging Task 2; we ignored the Wikipedia track in Task 1. Despite our using general NLP tools (see below), our system falls into the most restrictive, in-domain category. Training and Evaluation Data The training data for the CoNLL 2010 Shared Task is taken from the BioScope corpus (Vincze et al., 2008) and consists of 14,541 ‘sentences’ (or other root-level utterances) from biomedical abstracts and articles (see Table 1).3 The BioScope corpus provides annotation for hedge cues as well as their scope. According to the annotation guidelines (Vincze et al., 2008), the annotation adheres to a principle of minimalism when it comes to hedge cues, i.e. the minimal unit expressing hedging is annotated. The inverse is true of scope annotations, which adhere to a principle of maximal scope—meaning that scope should be set to the largest syntactic As it was known beforehand that evaluation would draw on full articles only, we put more emphasis on the article subset of the training data, for example in cross validation testing and manual diagnosis of errors. ID FORM LEMMA POS FEATS HEAD DEPREL XHEAD XDEP 1 The the DT _ 4 NMOD 4 SPECDET 2 unknown unknown JJ degree:attributive 4 NMOD 4 ADJUNCT 3 amino amino JJ degree:attributive 4 NMOD 4 ADJUNCT 4 acid acid NN pers:3|case:nom|num:sg|ntype:common 5 SBJ 3 SUBJ 5 may may MD mood:ind|subcat:MODAL|tense:pres|clauseType:decl 0 ROOT 0 ROOT 6 be be VB _ 5 VC 7 PHI 7 used use VBN subcat:V-SUBJ-OBJ|vtype:main|passive:+ 6 VC 5 XCOMP 8 by by IN _ 7 LGS 9 PHI 9 these these DT deixis:proximal 10 NMOD 10 SPECDET 10 species specie NNS num:pl|pers:3|case:obl|common:count|ntype:common 8 PMOD 7 OBL-AG 11 . . . _ 5 P 0 PUNC Table 2: Stacked dependency representation of example (1), with MaltParser and XLE annotations. unit possible. For evaluation purposes, the task organizers provided newly annotated biomedical articles, following the same general BioScope principles. The CoNLL 2010 evaluation data comprises 5,003 additional utterances (138,276 tokens), of which 790 are annotated as hedged. The data contains a total of 1033 cues, of which 87 are so-called multiword cues (i.e. cues spanning multiple tokens), comprising 1148 cue tokens altogether. Stacked Dependency Parsing For syntactic analysis we employ the open-source MaltParser (Nivre, Hall, & Nilsson, 2006), a platform for data-driven dependency parsing. For improved accuracy and portability across domains and genres, we make our parser incorporate the predictions of a large-scale, general-purpose LFG parser—following the work of Øvrelid, Kuhn, and Spreyer (2009). A technique dubbed parser stacking enables the data-driven parser to learn, not only from gold standard treebank annotations, but from the output of another parser (Nivre & McDonald, 2008). This technique has been shown to provide significant improvements in accuracy for both English and German (Øvrelid et al., 2009), and a similar setup employing an HPSG grammar has been shown to increase domain independence in data-driven dependency parsing (Zhang & Wang, 2009). The stacked parser combines two quite different approaches—data-driven dependency parsing and ‘deep’ parsing with a handcrafted grammar—and thus provides us with a broad range of different types of linguistic information for the hedge resolution task. MaltParser is based on a deterministic parsing strategy in combination with treebank-induced classifiers for predicting parse transitions. It supports a rich feature representation of the parse history in order to guide parsing and may easily be extended to take additional features into account. The procedure to enable the data-driven parser to learn from the grammar-driven parser is quite simple. We parse a treebank with the XLE platform (Crouch et al., 2008) and the English grammar developed within the ParGram project (Butt, Dyvik, King, Masuichi, & Rohrer, 2002). We then convert the LFG output to dependency structures, so that we have two parallel versions of the treebank—one gold standard and one with LFG annotation. We extend the gold standard treebank with additional information from the corresponding LFG analysis and train MaltParser on the enhanced data set. Table 2 shows the enhanced dependency representation of example (1) above, taken from the training data. For each token, the parsed data contains information on the word form, lemma, and part of speech (PoS), as well as on the head and dependency relation in columns 6 and 7. The added XLE information resides in the FEATS column, and in the XLE-specific head and dependency columns 8 and 9. Parser outputs, which in turn form the basis for our scope resolution rules discussed in Section 5, also take this same form. The parser employed in this work is trained on the Wall Street Journal sections 2 – 24 of the Penn Treebank (PTB), converted to dependency format (Johansson & Nugues, 2007) and extended with XLE features, as described above. Parsing uses the arc-eager mode of MaltParser and an SVM with a polynomial kernel. When tested using 10-fold cross validation on the enhanced PTB, the parser achieves a labeled accuracy score of 89.8. PoS Tagging and Domain Variation Our parser is trained on financial news, and although stacking with a general-purpose LFG parser is expected to aid domain portability, substantial differences in domain and genre are bound to negatively affect syntactic analysis (Gildea, 2001). MaltParser presupposes that inputs have been PoS tagged, leaving room for variation in preprocessing. On the one hand, we aim to make parser inputs maximally similar to its training data (i.e. the conventions established in the PTB); on the other hand we wish to benefit from specialized resources for the biomedical domain. The GENIA tagger (Tsuruoka et al., 2005) is particularly relevant in this respect (as could be the GENIA Treebank proper4). However, we found that GENIA tokenization does not match the PTB conventions in about one out of five sentences (for example wrongly splitting tokens like ‘390,926’ or ‘Ca(2+)’); also in tagging proper nouns, GENIA systematically deviates from the PTB. Hence, we adapted an in-house tokenizer (using cascaded finite-state rules) to the CoNLL task, run two PoS taggers in parallel, and eclectically combine annotations across the various preprocessing components—predominantly giving precedence to GENIA lemmatization and PoS hypotheses. To assess the impact of improved, domainadapted inputs on our hedge resolution system, we contrast two configurations: first, running the parser in the exact same manner as Øvrelid, Kuhn, and Spreyer (2010), we use TreeTagger (Schmid, 1994) and its standard model for English (trained on the PTB) for preprocessing; second, we give as inputs to the parser our refined tokenization and merged PoS tags, as described above. When evaluating the two modes of preprocessing on the articles subset of the training data, and using goldstandard cues, our system for resolving cue scopes (presented in § 5) achieves an F1 of 66.31 with TreeTagger inputs, and 72.30 using our refined tokenization and tagger combination. These results underline the importance of domain adaptation for accurate syntactic analysis, and in the following we assume our hybrid in-house setup. Although the GENIA Treebank provides syntactic annotation in a form inspired by the PTB, it does not provide function labels. Therefore, our procedure for converting from constituency to dependency requires non-trivial adaptation before we can investigate the effects of retraining the parser against GENIA. 3 Stage 1: Identifying Hedge Cues For the task of identifying hedge cues, we developed a binary maximum entropy (MaxEnt) classifier. The identification of cue words is used for (a) classifying sentences as certain/uncertain (Task 1), and (b) providing input to the syntactic rules that we later apply for resolving the insentence scope of the cues (Task 2). We also report evaluation scores for the sub-task of cue detection in isolation. As annotated in the training data, it is possible for a hedge cue to span multiple tokens, e.g. as in whether or not. The majority of the multi-word cues in the training data are very infrequent, however, most occurring only once, and the classifier itself is not sensitive to the notion of multi-word cues. Instead, the task of determining whether a cue word forms part of a larger multi-word cue, is performed in a separate post-processing step (applying a heuristic rule targeted at only the most frequently occurring patterns of multi-word cues in the training data). During development, we trained cue classifiers using a wide variety of feature types, both syntactic and surface-oriented. In the end, however, we found n-gram-based lexical features to have the greatest contribution to classifier performance. Our best-performing classifier so far (see ‘Final’ in Table 3) includes the following feature types: n-grams over forms (up to 2 tokens to the right), n-grams over base forms (up to 3 tokens left and right), PoS (from GENIA), subcategorization frames (from XLE), and phrase-structural coordination level (from XLE). Our CoNLL system description includes more details of the various other feature types that we experimented with (Velldal et al., 2010). 4 Cue Detection Evaluation Table 3 summarizes the performance of our MaxEnt hedge cue classifier in terms of precision, recall and F1, computed using the official Shared Task scorer script. The sentence-level scores correspond to Task 1 of the Shared Task, and the cuelevel scores are based on the exact-match counts for full hedge cues (possibly spanning multiple to-
منابع مشابه
Semantic Interpretation of Unrealized Syntactic Material in LTAG
This paper presents a LTAG-based analysis of gapping and VP ellipsis, which proposes that resolution of the elided material is part of a general disambiguation procedure, which is also responsible for resolution of underspecified representations of scope.
متن کاملA New Version of Earned Value Analysis for Mega Projects Under Interval-valued Fuzzy Environment
The earned value technique is a crucial and important technique in analysis and control the performance and progress of mega projects by integrating three elements of them, i.e., time, cost and scope. This paper proposes a new version of earned value analysis (EVA) to handle uncertainty in mega projects under interval-valued fuzzy (IVF)-environment. Considering that uncertainty is very common i...
متن کاملClimate Change Impact on Precipitation Extreme Events in Uncertainty Situation; Passing from Global Scale to Regional Scale
Global warming and then climate change are important topics studied by researchers throughout the world in the recent decades. In these studies, climatic parameters changes are investigated. Considering large-scaled output of AOGCMs and low precision in computational cells, uncertainty analysis is one of the principles in doing hydrological studies. For this reason, it is tried that investigati...
متن کاملResolving Speculation and Negation Scope in Biomedical Articles with a Syntactic Constituent Ranker
We discuss how the scope of speculation and negation can be resolved by learning a ranking function that operates over syntactic constituent subtrees. An important assumption of this method is that scope aligns with constituents, and hence we investigate instances of disalignment. We also show how the method can be combined with an existing scope-resolution system based on manually-crafted rule...
متن کاملAnaphor Resolution and the Scope of Syntactic Constraints
An anal)hor resolution algorithm is presented which relies on a combination of strategies for narrowing down and selecting ti'om antecedent sets fl)r reflexive pronouns, nonreflexive pronom~s, and common 11011118. ~lqle work focuses on syntactic restrictions which are derived froin Chomsky's Binding Theory. It is discussed how these constraints can be incorporated adequately in an anaphor resol...
متن کامل