Nonparametric Bayesian Semi-supervised Word Segmentation

نویسندگان

Ryo Fujii

Ryo Domoto

Daichi Mochihashi

چکیده

This paper presents a novel hybrid generative/discriminative model of word segmentation based on nonparametric Bayesian methods. Unlike ordinary discriminative word segmentation which relies only on labeled data, our semi-supervised model also leverages a huge amounts of unlabeled text to automatically learn new “words”, and further constrains them by using a labeled data to segment non-standard texts such as those found in social networking services. Specifically, our hybrid model combines a discriminative classifier (CRF; Lafferty et al. (2001) and unsupervised word segmentation (NPYLM; Mochihashi et al. (2009)), with a transparent exchange of information between these two model structures within the semisupervised framework (JESS-CM; Suzuki and Isozaki (2008)). We confirmed that it can appropriately segment non-standard texts like those in Twitter and Weibo and has nearly state-of-the-art accuracy on standard datasets in Japanese, Chinese, and Thai.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation

Words in Chinese text are not naturally separated by delimiters, which poses a challenge to standard machine translation (MT) systems. In MT, the widely used approach is to apply a Chinese word segmenter trained from manually annotated data, using a fixed lexicon. Such word segmentation is not necessarily optimal for translation. We propose a Bayesian semi-supervised Chinese word segmentation m...

متن کامل

Minimally-Supervised Morphological Segmentation using Adaptor Grammars

This paper explores the use of Adaptor Grammars, a nonparametric Bayesian modelling framework, for minimally supervised morphological segmentation. We compare three training methods: unsupervised training, semisupervised training, and a novel model selection method. In the model selection method, we train unsupervised Adaptor Grammars using an over-articulated metagrammar, then use a small labe...

متن کامل

Unsupervised and Semi-supervised Myanmar Word Segmentation Approaches for Statistical Machine Translation

In statistical machine translation (SMT), word segmentation is generally a necessary step for languages that do not naturally delimit words. For many low-resource languages there are no word segmentation tools, and research on word segmentation for these languages is often quite scarce. In this paper, we study several plausible methods for Myanmar word segmentation for machine translation in or...

متن کامل

Bayesian Unsupervised Word Segmentation with Hierarchical Language Modeling

This paper proposes a novel unsupervised morphological analyzer of arbitrary language that does not need any supervised segmentation nor dictionary. Assuming a string as the output from a nonparametric Bayesian hierarchical n-gram language model of words and characters, “words” are iteratively estimated during inference by a combination of MCMC and an efficient dynamic programming. This model c...

متن کامل

Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models

We propose a nonparametric Bayesian model for joint unsupervised word segmentation and part-of-speech tagging from raw strings. Extending a previous model for word segmentation, our model is called a Pitman-Yor Hidden SemiMarkov Model (PYHSMM) and considered as a method to build a class n-gram language model directly from strings, while integrating character and word level information. Experime...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

TACL

دوره 5 شماره

صفحات -

تاریخ انتشار 2017

Nonparametric Bayesian Semi-supervised Word Segmentation

نویسندگان

چکیده

منابع مشابه

Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation

Minimally-Supervised Morphological Segmentation using Adaptor Grammars

Unsupervised and Semi-supervised Myanmar Word Segmentation Approaches for Statistical Machine Translation

Bayesian Unsupervised Word Segmentation with Hierarchical Language Modeling

Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models

عنوان ژورنال:

اشتراک گذاری