The Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi

نویسندگان

  • Kam Tang Lau
  • Yan Song
  • Fei Xia
چکیده

In this paper, we present a segmented and part-of-speech (POS) tagged Archaic Chinese corpus along with its construction process, which is performed by automatic segmentation and tagging with manual correction as post-processing. We use both Modern and Archaic Chinese labeled data for training word segmenter and POS tagger, which are further improved by domain adaptation techniques, as well as by adding linguistic and morphological features derived from the characteristics of Archaic Chinese language. The experimental results showed the effectiveness of our approach. In particular, the domain adaptation techniques and the added features significantly improve POS tagging performance. During our manual correction, we categorize the errors resulted from the automatic segmentation and POS tagging process, and investigate the sources of those errors. Finally, we give the statistics of the resulted corpus on the distributions of words and POS tags. Our work is a preliminary study that could be easily extended to annotating other Archaic Chinese text, and the resulted corpus is a valuable resource for research on archaic Chinese language.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modern Chinese Helps Archaic Chinese Processing: Finding and Exploiting the Shared Properties

Languages change over time and ancient languages have been studied in linguistics and other related fields. A main challenge in this research area is the lack of empirical data; for instance, ancient spoken languages often leave little trace of their linguistic properties. From the perspective of natural language processing (NLP), while the NLP community has created dozens of annotated corpora,...

متن کامل

A Classical Chinese Corpus with Nested Part-of-Speech Tags

We introduce a corpus of classical Chinese poems that has been word segmented and tagged with parts-ofspeech (POS). Due to the ill-defined concept of a ‘word’ in Chinese, previous Chinese corpora suffer from a lack of standardization in word segmentation, resulting in inconsistencies in POS tags, therefore hindering interoperability among corpora. We address this problem with nested POS tags, w...

متن کامل

Hua Yu: A Word-segmented and Part-Of-Speech Tagged Chinese Corpus

As the outcome of a 3-year joint effort of Department of Computer Science, Tsinghua University and Language Information Processing Institute, Beijing Language and Culture University, Beijing, China, a word-segmented and part-of-speech tagged Chinese corpus with size of 2 million Chinese characters, named HuaYu, has been established. This paper firstly introduces some basics about HuaYu in brief...

متن کامل

SINICA CORPUS : Design Methodology for Balanced Corpora

The Academia Sinica Balanced Corpus (Sinica Corpus) is the first balanced Chinese corpus with part-of-speech tagging. The corpus (Sinica 2.0) is open to the research community through the WWW (http://www.sinica.edu.twiftms-binikiwi.sh). Current size of the corpus is 3.5 million words, and the immediate expansion target is five million words. Each text in the corpus is classified and marked acco...

متن کامل

Construction of Chinese Segmented and POS-tagged Conversational Corpora and Their Evaluations on Spontaneous Speech Recognitions

The performance of a corpus-based language and speech processing system depends heavily on the quantity and quality of the training corpora. Although several famous Chinese corpora have been developed, most of them are mainly written text. Even for some existing corpora that contain spoken data, the quantity is insufficient and the domain is limited. In this paper, we describe the development o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013