High OOV-Recall Chinese Word Segmenter
نویسندگان
چکیده
For the competition of Chinese word segmentation held in the first CIPS-SIGHNA joint conference. We applied a subwordbased word segmenter using CRFs and extended the segmenter with OOV words recognized by Accessor Variety. Moreover, we proposed several post-processing rules to improve the performance. Our system achieved promising OOV recall among all the participants.
منابع مشابه
The Character-based CRF Segmenter of MSRA&NEU for the 4th Bakeoff
This paper describes the Chinese Word Segmenter for the fourth International Chinese Language Processing Bakeoff. Base on Conditional Random Field (CRF) model, a basic segmenter is designed as a problem of character-based tagging. To further improve the performance of our segmenter, we employ a word-based approach to increase the in-vocabulary (IV) word recall and a post-processing to increase ...
متن کاملTowards a Hybrid Model for Chinese Word Segmentation
This paper describes a hybrid Chinese word segmenter that is being developed as part of a larger Chinese unknown word resolution system. The segmenter consists of two components: a tagging component that uses the transformation-based learning algorithm to tag each character with its position in a word, and a merging component that transforms a tagged character sequence into a word-segmented sen...
متن کاملDiscriminative Boosting from Dictionary and Raw Text - A Novel Approach to Build A Chinese Word Segmenter
Chinese word segmentation (CWS) is a basic and important task for Chinese information processing. Standard approaches to CWS treat it as a sequence labelling task. Without manually annotated corpora, these approaches are ineffective. When a dictionary is available, dictionary maximum matching (DMM) is a good alternative. However, its performance is far from perfect due to the poor ability on ou...
متن کاملEnhancing Chinese Word Segmentation Using Unlabeled Data
This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In ad...
متن کاملChinese Word Segmentation for Terrorism-Related Contents
In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and ...
متن کامل