Hua Yu: A Word-segmented and Part-Of-Speech Tagged Chinese Corpus
نویسندگان
چکیده
As the outcome of a 3-year joint effort of Department of Computer Science, Tsinghua University and Language Information Processing Institute, Beijing Language and Culture University, Beijing, China, a word-segmented and part-of-speech tagged Chinese corpus with size of 2 million Chinese characters, named HuaYu, has been established. This paper firstly introduces some basics about HuaYu in brief, as its genre distribution, fundamental considerations in designing it, word segmentation and part-of-speech tagging standards. Then the complete list of tag set used in HuaYu is given, along with typical examples for each tag accordingly. Several pieces of annotated texts in each genre are also included at last for reader’s reference. 1 This work is supported by National Natural Science Foundation of China and National Basic Research Development Scheme(973) of China.
منابع مشابه
A Classical Chinese Corpus with Nested Part-of-Speech Tags
We introduce a corpus of classical Chinese poems that has been word segmented and tagged with parts-ofspeech (POS). Due to the ill-defined concept of a ‘word’ in Chinese, previous Chinese corpora suffer from a lack of standardization in word segmentation, resulting in inconsistencies in POS tags, therefore hindering interoperability among corpora. We address this problem with nested POS tags, w...
متن کاملThe Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi
In this paper, we present a segmented and part-of-speech (POS) tagged Archaic Chinese corpus along with its construction process, which is performed by automatic segmentation and tagging with manual correction as post-processing. We use both Modern and Archaic Chinese labeled data for training word segmenter and POS tagger, which are further improved by domain adaptation techniques, as well as ...
متن کاملSINICA CORPUS : Design Methodology for Balanced Corpora
The Academia Sinica Balanced Corpus (Sinica Corpus) is the first balanced Chinese corpus with part-of-speech tagging. The corpus (Sinica 2.0) is open to the research community through the WWW (http://www.sinica.edu.twiftms-binikiwi.sh). Current size of the corpus is 3.5 million words, and the immediate expansion target is five million words. Each text in the corpus is classified and marked acco...
متن کاملChinese Web Scale Linguistic Datasets and Toolkit
The web provides a huge collection of web pages for researchers to study natural languages. However, processing web scale texts is not an easy task and needs many computational and linguistic resources. In this paper, we introduce two Chinese parts-of-speech tagged web-scale datasets and describe tools that make them easy to use for NLP applications. The first is a Chinese segmented and POS-tag...
متن کاملThe Penn Chinese TreeBank: Phrase structure annotation of a large corpus
With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with di erent segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefor...
متن کامل