The Mega-Word Tagged-Corpus Project
نویسندگان
چکیده
Large corpora with part-of-speech tagging play a very important role in recent statisticsbased and example-based natural language processing systems. However, no such corpora have become widely available for Japanese so far. Because the Japanese language has no explicit word boundaries, it is impossible even to count words without a corpus that has at. least word segmentations. This paper describes our attempts to develop a tagged corpus with over one million words taken from Japanese newspaper articles in a semi-mechanized way taken from Japanese newspaper articles. After dividing the original text into many chunks, we analyze the first chunk by using a Japanese morphological analyzer, and correct. the output manually; then, using that result, we improve the morphological analyzer and go on to the next chunk. Thus, the quality of the morphological analyzer increases at each iteration, decreasing the effort required for manual editing of the following chunks. Our experience in the first iteration of this 'boot-strapping' process has been encouraging.
منابع مشابه
PAYMA: A Tagged Corpus of Persian Named Entities
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...
متن کاملDutchSemCor: in quest of the ideal sense-tagged corpus
The most-frequent-sense and the predominant domain sense play an important role in the debate on word-sensedisambiguation. This discussion is, however, biased by the way sense-tagged corpora are built. In this paper, we argue that current sense-tagged corpora neglect rare senses and contexts and, as a result, do not represent a good corpus for training and testing word-sensedisambiguation. We d...
متن کاملDutchSemCor: Targeting the ideal sense-tagged corpus
Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver ...
متن کاملDutchSemCor: Building a semantically annotated corpus for Dutch
State of the art Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years, while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor pro...
متن کاملThe Syntactically Annotated ICE Corpus and the Automatic Induction of a Formal Grammar
The International Corpus of English is a corpus of national and regional varieties of English. The mega-word British component has been constructed, grammatically tagged, and syntactically parsed. This article is a description of work that aims at the automatic induction of a wide-coverage grammar from this corpus as well as an empirical evaluation of the grammar. It first of all describes the ...
متن کامل