A Chinese Word Segmentation System Based on Structured Support Vector Machine Utilization of Unlabeled Text Corpus
نویسندگان
چکیده
We have participated in the open tracks and closed tracks on four corpora of Chinese word segmentation tasks in CIPSSIGHAN-2010 Bake-offs. In our experiments, we used the Chinese inner phonology information in all tracks. For open tracks, we proposed a double hidden layers’ HMM (DHHMM) in which Chinese inner phonology information was used as one hidden layer and the BIO tags as another hidden layer. N-best results were firstly generated by using DHHMM, then the best one was selected by using a new lexical statistic measure. For close tracks, we used CRF model in which the Chinese inner phonology information was used as features.
منابع مشابه
Chinese Word Segmentation with Conditional Support Vector Inspired Markov Models
Character-based tagging method has achieved great success in Chinese Word Segmentation (CWS). This paper proposes a new approach to improve the CWS tagging accuracy by structured support vector machine (SVM) utilization of unlabeled text corpus. First, character N-grams in unlabeled text corpus are mapped into low-dimensional space by adopting SOM algorithm. Then new features extracted from the...
متن کاملImproving Chinese Word Segmentation by Adopting Self-Organized Maps of Character N-gram
Character-based tagging method has achieved great success in Chinese Word Segmentation (CWS). This paper proposes a new approach to improve the CWS tagging accuracy by combining Self-Organizing Map (SOM) with structured support vector machine (SVM) for utilization of enormous unlabeled text corpus. First, character N-grams are clustered and mapped into a low-dimensional space by adopting SOM al...
متن کاملContext-Based Chinese Word Segmentation using SVM Machine-Learning Algorithm without Dictionary Support
This paper presents a new machine-learning Chinese word segmentation (CWS) approach, which defines CWS as a break-point classification problem; the break point is the boundary of two subsequent words. Further, this paper exploits a support vector machine (SVM) classifier, which learns the segmentation rules of the Chinese language from a context model of break points in a corpus. Additionally, ...
متن کاملTerm Contributed Boundary Feature using Conditional Random Fields for Chinese Word Segmentation Task
This paper proposes a novel feature for conditional random field (CRF) model in Chinese word segmentation system. The system uses a conditional random field as machine learning model with one simple feature called term contributed boundaries (TCB) in addition to the “BIEO” character-based label scheme. TCB can be extracted from unlabeled corpora automatically, and segmentation variations of dif...
متن کاملCombination of Machine Learning Methods for Optimum Chinese Word Segmentation
This article presents our recent work for participation in the Second International Chinese Word Segmentation Bakeoff. Our system performs two procedures: Out-ofvocabulary extraction and word segmentation. We compose three out-of-vocabulary extraction modules: Character-based tagging with different classifiers – maximum entropy, support vector machines, and conditional random fields. We also co...
متن کامل