Memory-Efficient Katakana Compound Segmentation using Conditional Random Fields

نویسندگان

  • Krauchanka Siarhei
  • Artsimenya Artsiom
چکیده

The absence of explicit word boundary delimiters, such as spaces, in Japanese texts causes all kinds of troubles for Japanese morphological analysis systems. Particularly, out-of-vocabulary words represent a very serious problem for the systems which rely on dictionary data to establish word boundaries. In this paper we present a solution for decompounding of katakana sequences (one of the main sources of the out-of-vocabulary words) using a discriminative model based on Conditional Random Fields. One of the notable features of the proposed approach is its simplicity and memory efficiency.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Painless Semi-Supervised Morphological Segmentation using Conditional Random Fields

We discuss data-driven morphological segmentation, in which word forms are segmented into morphs, that is the surface forms of morphemes. We extend a recent segmentation approach based on conditional random fields from purely supervised to semi-supervised learning by exploiting available unsupervised segmentation techniques. We integrate the unsupervised techniques into the conditional random f...

متن کامل

Character Categorization via Latent Dirichlet Allocation for Kana Sequence Segmentation with Conditional Random Fields

We propose an efficient Kana sequence segmentation as a component of faster and easier interfaces for e-learning systems. We assign categories to Kana characters via latent Dirichlet allocation (LDA) and use the categories to compose additional features for conditional random fields (CRF). We compare the categories our method gives and those manually prepared by their efficiency in Kana sequenc...

متن کامل

Studies for Segmentation of Historical Texts: Sentences or Chunks?

We present some experiments on text segmentation for German texts aimed at developing a method of segmenting historical texts. Since such texts have no (consistent) punctuation, we use a machine learning approach to label tokens with their relative positions in text segments using Conditional Random Fields. We compare the performance of this approach on the task of segmenting of text into sente...

متن کامل

Efficient Structured Prediction with Latent Variables for General Graphical Models

In this paper we propose a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. We describe a local entropy approximation for this general formulation using duality, and derive an efficient message passing algorithm that is guaranteed to converge. We demonstrate its effec...

متن کامل

Broadcast News Story Segmentation Using Conditional Random Fields and Multimodal Features

This paper proposes to integrate multi-modal features using conditional random fields (CRF) for broadcast news story segmentation. We study story boundary cues from lexical, audio and video modalities, where lexical features consist of lexical similarity, chain strength and overall cohesiveness, acoustic features involve pause duration, pitch, speaker change and audio event type, and visual fea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012