POC-NLW Template for Chinese Word Segmentation

نویسندگان

  • Bo Chen
  • Weiran Xu
  • Tao Peng
  • Jun Guo
چکیده

In this paper, a language tagging template named POC-NLW (position of a character within an n-length word) is presented. Based on this template, a twostage statistical model for Chinese word segmentation is constructed. In this method, the basic word segmentation is based on n-gram language model, and a Hidden Markov tagger based on the POC-NLW template is used to implement the out-of-vocabulary (OOV) word identification. The system participated in the MSRA_Close and UPUC_Close word segmentation tracks at SIGHAN Bakeoff 2006. Results returned by this bakeoff are reported here.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hybrid Model for Chinese Word Segmentation

This paper describes a hybrid model that combines machine learning with linguistic and statistical heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two major components: a tagging component that annotates each character in a Chinese sentence with a position-of-character (POC) tag that indicates its position in a word, and a merging com...

متن کامل

Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation

This paper describes a hybrid model that combines machine learning with linguistic heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two components: a position-of-character (POC) tagging component that annotates each character in a sentence with a POC tag that indicates its position in a word, and a merging component that transforms a P...

متن کامل

Word Boundary Token Model for the SIGHAN Bakeoff 2007

This paper describes a Chinese word segmentation system based on word boundary token model and triple template matching model for extracting unknown words; and word support model for resolving segmentation ambiguity.

متن کامل

On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching

This paper addresses two major problems in closed task of Chinese word segmentation (CWS): tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. To resolve the former, we apply Kmeans clustering to identify non-Chinese characters, and then adopt a two-tagger architecture: one for Chinese text and the other for non-Chinese text. For the latter problem,...

متن کامل

Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling

This paper is concerned with Chinese word segmentation, which is regarded as a character based tagging problem under conditional random field framework. It is different in our method that we consider both feature template selection and tag set selection, instead of feature template focused only method in existing work. Thus, there comes an empirical comparison study of performance among differe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006