A Self-Organizing Japanese Word Segmenter using Heuristic Word Identification and Re-estimation
نویسنده
چکیده
We present a self-organized method to build a stochastic Japanese word segmenter from a small number of basic words and a large amount of unsegmented training text. It consists of a word-based statistical language model, an initial estimation procedure, and a re-estimation procedure. Initial word frequencies are estimated by counting all possible longest match strings between the training text and the word list. The initial word list is au~nented by identifying words in the training text using a heuristic rule based on character type. The word-based language model is then re-estimated to filter out inappropriate word hypotheses generated by the initial word identification. When the word segmeuter is trained on 3.9M character texts and 1719 initial words, its word segmentation accuracy is 86.3% recall and 82.5% precision. We find that the combination of heuristic word identi~cation and re-estimation is so effective that the initial word list need not be large. 1 I n t r o d u c t i o n Word segmentation is an important problem for Japanese because word boundaries are not marked in its writing system. Other Asian languages such as Chinese and Thai have the same problem. Any Japanese NLP application requ/res word segmentation as the first stage because there are phonological and semantic units whose pronunciation and meaning is not trivially derivable from that of the individual characters. Once word segmentation is done, all established techniques can be exploited to build practically important applications such as spelling correction [Nagata, 1996] and text retrieval [Nie and Brisebois, 1996] In a sense, Japanese word segmentation is a solved problem if (and only if) we have plenty of segmented training text. Around 95% word segmentation accuracy is reported by using a word-based language model and the Viterbi-like dynamic programi-g procedure [Nagata, 1994, Takeuchi and Matsumoto, 1995, Yamamoto, 1996]. However, manually segmented corpora are not always available in a particular target domain and manual segmentation is very expensive. The goal of our research is unsupervised learning of Japanese word segmentation. That is, to build a Japanese word segmenter from a list of initial words and unsegmented training text. Today, it is easy to obtain a 10K-100K word list from either commercial or public domain on-line Japanese dictionaries. Gigabytes of Japanese text are readily available from newspapers, patents, HTML documents, etc.. Few works have examined unsupervised word segmentation in Japanese. Both [Yamamoto, 1996] and [Takeuchi and Matsumoto, 1995] built a word-based language model from unsegmented text
منابع مشابه
Automatic Extraction of New Words from Japanese Texts using Generalized Forward-Backward Search
We present a novel new word extraction method from Japanese texts based on expected word frequencies. First, we compute expected word frequencies from Japanese texts using a robust stochastic N-best word segmenter. We then extract new words by filtering out erroneous word hypotheses whose expected word frequencies are lower than the predefined threshold. The method is derived from an approximat...
متن کاملHolistic Farsi handwritten word recognition using gradient features
In this paper we address the issue of recognizing Farsi handwritten words. Two types of gradient features are extracted from a sliding vertical stripe which sweeps across a word image. These are directional and intensity gradient features. The feature vector extracted from each stripe is then coded using the Self Organizing Map (SOM). In this method each word is modeled using the discrete Hidde...
متن کاملTowards a Hybrid Model for Chinese Word Segmentation
This paper describes a hybrid Chinese word segmenter that is being developed as part of a larger Chinese unknown word resolution system. The segmenter consists of two components: a tagging component that uses the transformation-based learning algorithm to tag each character with its position in a word, and a merging component that transforms a tagged character sequence into a word-segmented sen...
متن کاملExploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation
Unknown words and word segmentation granularity are two main problems in Chinese word segmentation for ChineseJapanese Machine Translation (MT). In this paper, we propose an approach of exploiting common Chinese characters shared between Chinese and Japanese in Chinese word segmentation optimization for MT aiming to solve these problems. We augment the system dictionary of a Chinese segmenter b...
متن کاملAdapting Conventional Chinese Word Segmenter for Segmenting Micro-blog Text: Combining Rule-based and Statistic-based Approaches
We describe two adaptation strategies which are used in our word segmentation system in participating the Microblog word segmentation bake-off: Domain invariant information is extracted from the in-domain unlabelled corpus, and is incorporated as supplementary features to conventional word segmenter based on Conditional Random Field (CRF), we call it statistic-based adaptation. Some heuristic r...
متن کامل