The Mega-Word Tagged-Corpus Project

نویسندگان

  • Hiroshi Maruyama
  • Shiho Ogino
  • Masaru Hidano
چکیده

Large corpora with part-of-speech tagging play a very important role in recent statisticsbased and example-based natural language processing systems. However, no such corpora have become widely available for Japanese so far. Because the Japanese language has no explicit word boundaries, it is impossible even to count words without a corpus that has at. least word segmentations. This paper describes our attempts to develop a tagged corpus with over one million words taken from Japanese newspaper articles in a semi-mechanized way taken from Japanese newspaper articles. After dividing the original text into many chunks, we analyze the first chunk by using a Japanese morphological analyzer, and correct. the output manually; then, using that result, we improve the morphological analyzer and go on to the next chunk. Thus, the quality of the morphological analyzer increases at each iteration, decreasing the effort required for manual editing of the following chunks. Our experience in the first iteration of this 'boot-strapping' process has been encouraging.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

DutchSemCor: in quest of the ideal sense-tagged corpus

The most-frequent-sense and the predominant domain sense play an important role in the debate on word-sensedisambiguation. This discussion is, however, biased by the way sense-tagged corpora are built. In this paper, we argue that current sense-tagged corpora neglect rare senses and contexts and, as a result, do not represent a good corpus for training and testing word-sensedisambiguation. We d...

متن کامل

DutchSemCor: Targeting the ideal sense-tagged corpus

Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver ...

متن کامل

DutchSemCor: Building a semantically annotated corpus for Dutch

State of the art Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years, while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor pro...

متن کامل

The Syntactically Annotated ICE Corpus and the Automatic Induction of a Formal Grammar

The International Corpus of English is a corpus of national and regional varieties of English. The mega-word British component has been constructed, grammatically tagged, and syntactically parsed. This article is a description of work that aims at the automatic induction of a wide-coverage grammar from this corpus as well as an empirical evaluation of the grammar. It first of all describes the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993