Understanding seed selection in bootstrapping

نویسندگان

  • Yo Ehara
  • Issei Sato
  • Hidekazu Oiwa
  • Hiroshi Nakagawa
چکیده

Bootstrapping has recently become the focus of much attention in natural language processing to reduce labeling cost. In bootstrapping, unlabeled instances can be harvested from the initial labeled “seed” set. The selected seed set affects accuracy, but how to select a good seed set is not yet clear. Thus, an “iterative seeding” framework is proposed for bootstrapping to reduce its labeling cost. Our framework iteratively selects the unlabeled instance that has the best “goodness of seed” and labels the unlabeled instance in the seed set. Our framework deepens understanding of this seeding process in bootstrapping by deriving the dual problem. We propose a method called expected model rotation (EMR) that works well on not well-separated data which frequently occur as realistic data. Experimental results show that EMR can select seed sets that provide significantly higher mean reciprocal rank on realistic data than existing naive selection methods or random seed sets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HITS-based Seed Selection and Stop List Construction for Bootstrapping

In bootstrapping (seed set expansion), selecting good seeds and creating stop lists are two effective ways to reduce semantic drift, but these methods generally need human supervision. In this paper, we propose a graphbased approach to helping editors choose effective seeds and stop list instances, applicable to Pantel and Pennacchiotti’s Espresso bootstrapping algorithm. The idea is to select ...

متن کامل

Unsupervised Part-of-Speech Acquisition for Resource-Scarce Languages

This paper proposes a new bootstrapping approach to unsupervised part-of-speech induction. In comparison to previous bootstrapping algorithms developed for this problem, our approach aims to improve the quality of the seed clusters by employing seed words that are both distributionally and morphologically reliable. In particular, we present a novel method for combining morphological and distrib...

متن کامل

A Bootstrapping Method for Extracting Paraphrases of Emotion Expressions from Texts

Because paraphrasing is one of the crucial tasks in natural language understanding and generation, this paper introduces a novel technique to extract paraphrases for emotion terms, from nonparallel corpora. We present a bootstrapping technique for identifying paraphrases, starting with a small number of seeds. WordNet Affect emotion words are used as seeds. The bootstrapping approach learns ext...

متن کامل

A Corpus-based Method for Extracting Paraphrases of Emotion Terms

Since paraphrasing is one of the crucial tasks in natural language understanding and generation, this paper introduces a novel technique to extract paraphrases for emotion terms, from non-parallel corpora. We present a bootstrapping technique for identifying paraphrases, starting with a small number of seeds. WordNet Affect emotion words are used as seeds. The bootstrapping approach learns extr...

متن کامل

A Bootstrapping Architecture for Time Expression Recognition in Unlabelled Corpora via Syntactic-Semantic Patterns

In this paper we describe a semi-supervised approach to the extraction of time expression mentions in large unlabelled corpora based on bootstrapping. Bootstrapping techniques rely on a relatively small amount of initial humansupplied observations (termed “seeds”) of the type of entity or concept to be learned, in order to capture an initial set of patterns or rules from the unlabelled text tha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013