Term Contributed Boundary Feature using Conditional Random Fields for Chinese Word Segmentation Task

نویسندگان

  • Mike Tian-Jian Jiang
  • Shih-Hung Liu
  • Cheng-Lung Sung
  • Wen-Lian Hsu
چکیده

This paper proposes a novel feature for conditional random field (CRF) model in Chinese word segmentation system. The system uses a conditional random field as machine learning model with one simple feature called term contributed boundaries (TCB) in addition to the “BIEO” character-based label scheme. TCB can be extracted from unlabeled corpora automatically, and segmentation variations of different domains are expected to be reflected implicitly. The dataset used in this paper is the closed training task in CIPS-SIGHAN-2010 bakeoff, including simplified and traditional Chinese texts. The experiment result shows that TCB does improve “BIEO” tagging domain-independently about 1% of the F1 measure score.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Term Contributed Boundary Tagging by Conditional Random Fields for SIGHAN 2010 Chinese Word Segmentation Bakeoff

This paper presents a Chinese word segmentation system submitted to the closed training evaluations of CIPSSIGHAN-2010 bakeoff. The system uses a conditional random field model with one simple feature called term contributed boundaries (TCB) in addition to the “BI” character-based tagging approach. TCB can be extracted from unlabeled corpora automatically, and segmentation variations of differe...

متن کامل

Unsupervised Overlapping Feature Selection for Conditional Random Fields Learning in Chinese Word Segmentation

Wen-lian Hsu Institute of Information Science Academia Sinica [email protected] Abstract This work represents several unsupervised feature selections based on frequent strings that help improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based N-gram (CNG), Accessor Variety based string (AVS), and Term Contributed Frequency (TC...

متن کامل

Enhancement of Feature Engineering for Conditional Random Field Learning in Chinese Word Segmentation Using Unlabeled Data

This work proposes a unified view of several features based on frequent strings extracted from unlabeled data that improve the conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS) and its variation of left-right co-existed feature (LRAVS), term-contributed frequency (TCF), and term-con...

متن کامل

Rules-based Chinese Word Segmentation on MicroBlog for CIPS-SIGHAN on CLP2012

In this evaluation, we have taken part in the task of the Word Segmentation on Chinese MicroBlog. In this task, after analysing the feature of the MicroBlog and the result of our original Chinese word segmentation system, four Optimization Rules are proposed to optimize the segmentation algorithm for Chinese word segmentation on MicroBlog corpora. The optimized segmentation system is based on c...

متن کامل

Chinese Word Segmentation based on Mixing Multiple Preprocessor and CRF

This paper describes the Chinese Word Segmenter for our participation in CIPSSIGHAN-2010 bake-off task of Chinese word segmentation. We formalize the tasks as sequence tagging problems, and implemented them using conditional random fields (CRFs) model. The system contains two modules: multiple preprocessor and basic segmenter. The basic segmenter is designed as a problem of character-based tagg...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010