Data collection and annotation for state-of-the-art NER using unmanaged crowds

نویسندگان

  • Spencer Rothwell
  • Steele Carter
  • Ahmad Elshenawy
  • Vladislavs Dovgalecs
  • Safiyyah Saleem
  • Daniela Braga
  • Bob Kennewick
چکیده

This paper presents strategies for generating entity level annotated text utterances using unmanaged crowds. These utterances are then used to build state-of-the-art Named Entity Recognition (NER) models, a required component to build dialogue systems. First, a wide variety of raw utterances are collected through a variant elicitation task. We ensure that these utterances are relevant by feeding them back to the crowd for a domain validation task. We also flag utterances with potential spelling errors and verify these errors with the crowd before discarding them. These strategies, combined with a periodic CAPTCHA to prevent automated responses, allow us to collect high quality text utterances despite the inability to use the traditional gold test question approach for spam filtering. These utterances are then tagged with appropriate NER labels using unmanaged crowds. The crowd annotation was 23% more accurate and 29% more consistent than in-house annotation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

Weakly Supervised Cross-Lingual Named Entity Recognition via Effective Annotation and Representation Projection

The state-of-the-art named entity recognition (NER) systems are supervised machine learning models that require large amounts of manually annotated data to achieve high accuracy. However, annotating NER data by human is expensive and time-consuming, and can be quite difficult for a new language. In this paper, we present two weakly supervised approaches for cross-lingual NER with no human annot...

متن کامل

SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data

We present SWELLSHARK, a framework for building biomedical named entity recognition (NER) systems quickly and without hand-labeled data. Our approach views biomedical resources like lexicons as function primitives for autogenerating weak supervision. We then use a generative model to unify and denoise this supervision and construct large-scale, probabilistically labeled datasets for training hi...

متن کامل

Harnessing Diversity in Crowds and Machines for Better NER Performance

Over the last years, information extraction tools have gained a great popularity and brought significant performance improvement in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to th...

متن کامل

State of the art in Turkish Named Entity Recognition

Named entity recognition (NER), which provides useful information for many high level NLP applications and semantic web technologies, is a well-studied topic for most of the languages and especially for English. However the studies for Turkish, which is a morphologically richer and lesser-studied language, have fallen behind these for a long while. In recent years, Turkish NER intrigued researc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015