A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts
نویسندگان
چکیده
We present a simple and effective way to perform out-of-domain statistical parsing by drastically reducing lexical data sparseness in a PCFG-LA architecture. We replace terminal symbols with unsupervised word clusters acquired from a large newspaper corpus augmented with biomedical targetdomain data. The resulting clusters are effective in bridging the lexical gap between source-domain and target-domain vocabularies. Our experiments combine known self-training techniques with unsupervised word clustering and produce promising results, achieving an error reduction of 21% on a new evaluation set for biomedical text with manual bracketing annotations.
منابع مشابه
Evaluating Impact of Re-training a Lexical Disambiguation Model on Domain Adaptation of an HPSG Parser
This paper describes an effective approach to adapting an HPSG parser trained on the Penn Treebank to a biomedical domain. In this approach, we train probabilities of lexical entry assignments to words in a target domain and then incorporate them into the original parser. Experimental results show that this method can obtain higher parsing accuracy than previous work on domain adaptation for pa...
متن کاملA word clustering approach to domain adaptation: Robust parsing of source and target domains
We present a technique to improve out-of-domain statistical parsing by reducing lexical data sparseness in a PCFG-LA architecture. We replace terminal symbols with unsupervised word clusters acquired from a large newspaper corpus augmented with target-domain data. We also investigate the impact of guiding out-of-domain parsing with predicted part-of-speech tags. We provide an evaluation for Fre...
متن کاملDomain Adaptation for Dependency Parsing via Self-Training
This paper presents a successful approach for domain adaptation of a dependency parser via self-training. We improve parsing accuracy for out-of-domain texts with a self-training approach that uses confidence-based methods to select additional training samples. We compare two confidence-based methods: The first method uses the parse score of the employed parser to measure the confidence into a ...
متن کاملتأثیر ساختواژهها در تجزیه وابستگی زبان فارسی
Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...
متن کاملIntellectual structure of knowledge in Nanomedicine field (2009 to 2018): A Co-Word Analysis
Introduction: The Co-word analysis has the ability to identify the intellectual structure of knowledge in a research domain and reveal its subsurface research aspects. Objective: This study examines the intellectual structure of knowledge in the field of nanomedicine during the period of 2009 to 2018 by using Co-word analysis. Materials and Methods: This paper develops a sciento...
متن کامل