Unigram Language Model for Chinese Word Segmentation
نویسندگان
چکیده
This paper describes a Chinese word segmentation system based on unigram language model for resolving segmentation ambiguities. The system is augmented with a set of pre-processors and post-processors to extract new words in
منابع مشابه
Contextual Dependencies in Unsupervised Word Segmentation
Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model (and previous probabilistic...
متن کاملChinese Unknown Word Identification Based on Local Bigram Model
This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine thes...
متن کاملChinese Unknown Word Identification Based on Local Bigram Model with Integrally Smoothing Assumption
The paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. To explain this local approximation, we make an “integrally smoothing assumption”. As a simplifica...
متن کاملA Projection Extension Algorithm for Statistical Machine Translation
In this paper, we describe a phrase-based unigram model for statistical machine translation that uses a much simpler set of model parameters than similar phrasebased models. The units of translation are blocks – pairs of phrases. During decoding, we use a block unigram model and a word-based trigram language model. During training, the blocks are learned from source interval projections using a...
متن کاملClosed-Set Chinese Word Segmentation Based on Convolutional Neural Network Model
This paper proposes a neural model for closed-set Chinese word segmentation. The model follows the character-based approach which assigns a class label to each character, indicating its relative position within the word it belongs to. To do so, it first constructs shallow representations of characters by fusing unigram and bigram information in limited context window via an element-wise maximum...
متن کامل