Term Contributed Boundary Tagging by Conditional Random Fields for SIGHAN 2010 Chinese Word Segmentation Bakeoff
نویسندگان
چکیده
This paper presents a Chinese word segmentation system submitted to the closed training evaluations of CIPSSIGHAN-2010 bakeoff. The system uses a conditional random field model with one simple feature called term contributed boundaries (TCB) in addition to the “BI” character-based tagging approach. TCB can be extracted from unlabeled corpora automatically, and segmentation variations of different domains are expected to be reflected implicitly. The experiment result shows that TCB does improve “BI” tagging domainindependently about 1% of the F1 measure score.
منابع مشابه
Term Contributed Boundary Feature using Conditional Random Fields for Chinese Word Segmentation Task
This paper proposes a novel feature for conditional random field (CRF) model in Chinese word segmentation system. The system uses a conditional random field as machine learning model with one simple feature called term contributed boundaries (TCB) in addition to the “BIEO” character-based label scheme. TCB can be extracted from unlabeled corpora automatically, and segmentation variations of dif...
متن کاملUsing Part-of-Speech Reranking to Improve Chinese Word Segmentation
Chinese word segmentation and Part-ofSpeech (POS) tagging have been commonly considered as two separated tasks. In this paper, we present a system that performs Chinese word segmentation and POS tagging simultaneously. We train a segmenter and a tagger model separately based on linear-chain Conditional Random Fields (CRF), using lexical, morphological and semantic features. We propose an approx...
متن کاملEnhancement of Feature Engineering for Conditional Random Field Learning in Chinese Word Segmentation Using Unlabeled Data
This work proposes a unified view of several features based on frequent strings extracted from unlabeled data that improve the conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS) and its variation of left-right co-existed feature (LRAVS), term-contributed frequency (TCF), and term-con...
متن کاملA Study of Chinese Lexical Analysis Based on Discriminative Models
This paper briefly describes our system in The Fourth SIGHAN Bakeoff. Discriminative models including maximum entropy model and conditional random fields are utilized in Chinese word segmentation and named entity recognition with different tag sets and features. Transformation-based learning model is used in part-of-speech tagging. Evaluation shows that our system achieves the F-scores: 92.64% ...
متن کاملCombination of Machine Learning Methods for Optimum Chinese Word Segmentation
This article presents our recent work for participation in the Second International Chinese Word Segmentation Bakeoff. Our system performs two procedures: Out-ofvocabulary extraction and word segmentation. We compose three out-of-vocabulary extraction modules: Character-based tagging with different classifiers – maximum entropy, support vector machines, and conditional random fields. We also co...
متن کامل