Experiments in Unsupervised Entropy-Based Corpus Segmentation
نویسنده
چکیده
The paper presents an entropy-based approach to segment a corpus into words, when no additional information about the corpus or the language, and no other resources such as a lexicon or grammar are available. To segment the corpus, the algorithm searches for separators, without knowing a priory by which symbols they are constituted. Good results can be obtained with corpora containing "clearly perceptible" separators such as blank or new-line.
منابع مشابه
From “Manbearpig” to “Man bear pig”: An Evaluation of Unsupervised Word Segmentation Algorithms
In this paper, we explore diverse methods of unsupervised morphemic segmentation. We test Successor and Predecessor Count algorithms, Entropy algorithms, and Affix Discovery algorithms. The paper examines word stemming based on these algorithms, and the influence of training corpus size on segmentation accuracy. We propose variations on these algorithms to improve overall efficacy. While these ...
متن کاملA Maximum Entropy Tagger with Unsupervised Hidden Markov Models
We describe a new tagging model where the states of a hidden Markov model (HMM) estimated by unsupervised learning are incorporated as the features in a maximum entropy model. Our method for exploiting unsupervised learning of a probabilistic model can reduce the cost of building taggers with no dictionary and a small annotated corpus. Experimental results on English POS tagging and Japanese wo...
متن کاملAn Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluat...
متن کاملUnsupervised Texture Image Segmentation Using MRFEM Framework
Texture image analysis is one of the most important working realms of image processing in medical sciences and industry. Up to present, different approaches have been proposed for segmentation of texture images. In this paper, we offered unsupervised texture image segmentation based on Markov Random Field (MRF) model. First, we used Gabor filter with different parameters’ (frequency, orientatio...
متن کاملUnsupervised Texture Image Segmentation Using MRFEM Framework
Texture image analysis is one of the most important working realms of image processing in medical sciences and industry. Up to present, different approaches have been proposed for segmentation of texture images. In this paper, we offered unsupervised texture image segmentation based on Markov Random Field (MRF) model. First, we used Gabor filter with different parameters’ (frequency, orientatio...
متن کامل