Unsupervised Learning of Lexical Information for Language Processing Systems
نویسنده
چکیده
Natural language processing systems such as speech recognition and machine translation conventionally treat words as their fundamental unit of processing. However, in many cases the definition of a “word” is not obvious, such as in languages without explicit white space delimiters, in agglutinative languages, or in streams of continuous speech. This thesis attempts to answer the question of which lexical units should be used for these applications by acquiring them through unsupervised learning. This has the potential to lead to improvements in accuracy, as it can choose lexical units flexibly, using longer units when justified by the data, or falling back to shorter units when faced with data sparsity. In addition, this approach allows us to re-examine our assumptions of what units we should be using to recognize speech or translate text, which will provide insights to the designers of supervised systems. Furthermore, as the methods require no annotated data, they have the potential to remove the annotation bottleneck, allowing for the processing of under-resourced languages for which no human annotations or analysis tools are available. Chapter 1 provides an overview of the general topics of word segmentation and morphological analysis, as well as previous research on learning lexical units from raw text. It goes on to discuss the problems with the existing approaches, and lays out the general motivation for and techniques used in the work presented in the following chapters. Chapter 2 describes the overall learning framework adopted in this thesis, which consists of models created using non-parametric Bayesian statistics, and inference procedures for the models using Gibbs sampling. Nonparametric Bayesian statistics are useful because they allow for automatically discovering the appropriate balance between model complexity and expressive power. We adopt Gibbs sampling as an inference procedure because it is a principled, yet flexible learning method that can be used with a wide variety of models. Within this framework, this thesis presents models for lexical learning for speech recognition and machine translation. With regards to speech recognition, Chapter 3 presents a method that
منابع مشابه
Corpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملLearning Constructions of Natural Language: Statistical Models and Evaluations
Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi Author Sami Virpioja Name of the doctoral dissertation Learning Constructions of Natural Language: Statistical Models and Evaluations Publisher School of Science Unit Department of Information and Computer Science Series Aalto University publication series DOCTORAL DISSERTATIONS 158/2012 Field of research Computer and Information Sci...
متن کاملUnsupervised Learning of Word Boundary with Description Length Gain
This paper presents an unsupervised approach to lexical acquisition with the goodness measure description length gain (DLG) formulated following classic information theory within the minimum description length (MDL) paradigm. The learning algorithm seeks for an optimal segmentation of an utterance that maximises the description length gain from the individual segments. The resultant segments sh...
متن کاملUsing Morphology And Syntax Together In Unsupervised Learning
Unsupervised learning of grammar is a problem that can be important in many areas ranging from text preprocessing for information retrieval and classification to machine translation. We describe an MDL based grammar of a language that contains morphology and lexical categories. We use an unsupervised learner of morphology to bootstrap the acquisition of lexical categories and use these two lear...
متن کاملIterated learning framework for unsupervised part-of-speech induction
Computational approaches to linguistic analysis have been used for more than half a century. The main tools come from the field of Natural Language Processing (NLP) and are based on rule-based or corpora-based (supervised) methods. Despite the undeniable success of supervised learning methods in NLP, they have two main drawbacks: on the practical side, it is expensive to produce the manual anno...
متن کامل