Chinese Text Segmentation With MBDP-1: Making the Most of Training Corpora
نویسندگان
چکیده
منابع مشابه
An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training
In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. We utilize two segmenters: 1) a word-based segmenter leveraging a word-level language model, and 2) a character-based segmenter using characterlevel features within a CRF-based sequence labeler. These two segmenters are initially trained with a small amount of segmented data, and then iter...
متن کاملAdapting Chinese Word Segmentation for Machine Translation Based on Short Units
In Chinese texts, words composed of single or multiple characters are not separated by spaces, unlike most western languages. Therefore Chinese word segmentation is considered an important first step in machine translation (MT) and its performance impacts MT results. Many factors affect Chinese word segmentations, including the segmentation standards and segmentation strategies. The performance...
متن کاملChinese Work Segmentation without Using Lexicon and Hand-crafted Training Data
Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinese corpora. The prelimin...
متن کامل應用平行語料建構中文斷詞組件 (Applications of Parallel Corpora for Chinese Segmentation) [In Chinese]
Instead of directly providing the service of Chinese segmentation, some open-source software allows us to train segmentation models with segmented text. The resulting models can perform quite well, if training data of high quality are available. In reality, it is not easy to obtain sufficient and excellent training data, unfortunately. We report an exploration of using parallel corpora and vari...
متن کاملImproving Word Segmentation by Simultaneously Learning Phonotactics
The most accurate unsupervised word segmentation systems that are currently available (Brent, 1999; Venkataraman, 2001; Goldwater, 2007) use a simple unigram model of phonotactics. While this simplifies some of the calculations, it overlooks cues that infant language acquisition researchers have shown to be useful for segmentation (Mattys et al., 1999; Mattys and Jusczyk, 2001). Here we explore...
متن کامل