Human-Computer Interactive Chinese Word Segmentation: An Adaptive Dirichlet Process Mixture Model Approach
نویسندگان
چکیده
Previous research shows that Kalman filter based human-computer interactive Chinese word segmentation achieves an encouraging effect in reducing user interventions, but suffers from the drawback of incompetence in distinguishing segmentation ambiguities. This paper proposes a novel approach to handle this problem by using an adaptive Dirichlet process mixture model. By adjusting the hyperparameters of the model, ideal classifiers can be generated to conform to the interventions provided by the users. Experiments reveal that our approach achieves a notable improvement in handling segmentation ambiguities. With knowledge learnt from users, our model outperforms the baseline Kalman filter model by about 0.5% in segmenting homogeneous texts.
منابع مشابه
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in rea...
متن کاملRegion Segmentation based on Gaussian Dirichlet Process Mixture Model and its Application to 3D Geometric Stricture Detection
In general, image-based 3D scenes can now be found in many popular vision systems, computer games and virtual reality tours. So, It is important to segment ROI (region of interest) from input scenes as a preprocessing step for geometric stricture detection in 3D scene. In this paper, we propose a method for segmenting ROI based on tensor voting and Dirichlet process mixture model. In particular...
متن کاملBayesian Order-Adaptive Clustering for Video Segmentation
Video segmentation requires the partitioning of a series of images into groups that are both spatially coherent and smooth along the time axis. We formulate segmentation as a Bayesian clustering problem. Context information is propagated over time by a conjugate structure. The level of segment resolution is controlled by a Dirichlet process prior. Our contributions include a conjugate nonparame...
متن کاملSemi-automatic Annotation of Chinese Word Structure
Chinese word structure annotation is potentially useful for many NLP tasks, especially for Chinese word segmentation. Li and Zhou (2012) have presented an annotation for word structures in the Penn Chinese Treebank. But they only consider words that have productive affixes, which covers 35% of word types in that corpus. In this paper, we propose a linguistically inspired annotation that covers ...
متن کاملSequence segmentation for statistical machine translation
In the last decade, while statistical machine translation has advanced significantly, there is still much room for further improvements relating to many natural language processing tasks such as word segmentation, word alignment and parsing. Human language is composed of sequences of meaningful units. These sequences can be words, phrases, sentences or even articles serving as basic elements in...
متن کامل