Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese

نویسندگان

  • Kikuo Maekawa
  • Makoto Yamazaki
  • Takehiko Maruyama
  • Masaya Yamaguchi
  • Hideki Ogura
  • Wakako Kashino
  • Toshinobu Ogiso
  • Hanae Koiso
  • Yasuharu Den
چکیده

Compilation of a 100 million words balanced corpus called the Balanced Corpus of Contemporary Written Japanese (or BCCWJ) is underway at the National Institute for Japanese Language and Linguistics. The corpus covers a wide range of text genres including books, magazines, newspapers, governmental white papers, textbooks, minutes of the National Diet, internet text (bulletin board and blogs) and so forth, and when possible, samples are drawn from the rigidly defined statistical populations by means of random sampling. All texts are dually POS-analyzed based upon two different, but mutually related, definitions of ‘word.’ Currently, more than 90 million words have been sampled and XML annotated with respect to text-structure and lexical and character information. A preliminary linear discriminant analysis of text genres using the data of POS frequencies and sentence length revealed it was possible to classify the text genres with a correct identification rate of 88% as far as the samples of books, newspapers, whitepapers, and internet bulletin boards are concerned. When the samples of blogs were included in this data set, however, the identification rate went down to 68%, suggesting the considerable variance of the blog texts in terms of the textual register and style.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Balanced Corpus of Contemporary Written Japanese

Construction of 100 million words balanced corpus of contemporary written Japanese is underway at the National Institute for Japanese Language. The unique property of the corpus consists in that the majority of its sample texts are selected randomly from well-defined statistical populations covering wide range of written texts.

متن کامل

Reading-Time Annotations for "Balanced Corpus of Contemporary Written Japanese"

The Dundee Eyetracking Corpus contains eyetracking data collected while native speakers of English and French read newspaper editorial articles. Similar resources for other languages are still rare, especially for languages in which words are not overtly delimited with spaces. This is a report on a project to build an eyetracking corpus for Japanese. Measurements were collected while 24 native ...

متن کامل

KOTONOHA and BCCWJ: Development of a Balanced Corpus of Contemporary Written Japanese

The National Institute for Japanese Language (NIJL) has launched a long-term language corpus development initiative aiming at the development of a super-corpus called KOTONOHA, which is consisting of a multitude of independent corpora. Among the constituent corpora of KOTONOHA, the one that bears the most urgent need is a largescale balanced corpus of the present-day written Japanese. Construct...

متن کامل

Text Readability and Word Distribution in Japanese

This paper reports the relation between text readability and word distribution in the Japanese language. There was no similar study in the past due to three major obstacles: (1) unclear definition of Japanese “word”, (2) no balanced corpus, and (3) no readability measure. Compilation of the Balanced Corpus of Contemporary Written Japanese (BCCWJ) and development of a readability predictor remov...

متن کامل

An Approach toward Register Classification of Book Samples in the Balanced Corpus of Contemporary Written Japanese

Japanese books are usually classified into ten genres by Nippon Decimal Classification (NDC) based on their subject. However, this classification is sometimes insufficient for corpus studies which describe characteristics of the texts in the book. Here, we propose a method of classifying text samples taken from Japanese books into some registers and text types. Firstly, we discuss useful criter...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010