Joint Bayesian Morphology learning for Dravidian languages
نویسندگان
چکیده
In this paper a methodology for learning the complex agglutinative morphology of some Indian languages using Adaptor Grammars and morphology rules is presented. Adaptor grammars are a compositional Bayesian framework for grammatical inference, where we define a morphological grammar for agglutinative languages and morphological boundaries are inferred from a plain text corpus. Once morphological segmentations are produce, regular expressions for sandhi rules and orthography are applied to achieve the final segmentation. We test our algorithm in the case of two complex languages from the Dravidian family. The same morphological model and results are evaluated comparing to other state-of-the art unsupervised morphology learning systems.
منابع مشابه
Morphology Based POS Tagging on Telugu
In this paper, we present a morphological based automatic tagging for Telugu without requiring any machine learning algorithm or training data. We believe that inflectional and agglutinating languages, the critical information required for tagging comes more from word internal structure than from the context and we show how a well designed morphological analyzer can assign correct tags and disa...
متن کاملUnity in Diversity: A Unified Parsing Strategy for Major Indian Languages
This paper presents our work to apply non linear neural network for parsing five r esource p oor I ndian L anguages belonging to two major language families Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas Kannada, Telugu and Malayalam belong to the Dravidian family. While little work has been done previously on Bengali and Telugu linear transition-based parsing, w...
متن کاملMachine Translation-Indian Regional Languages
Natural Language Processing is an emerging field of Machine Learning. NLP systems deal with making use of machines to translate text or speech. MT system can be classified according to approaches being followed for translation. In this paper, existing MT systems according to the regional languages of India are being analyzed. Key-Words: Machine Translation (MT), Natural Language Processing (NLP...
متن کاملSignificance of an Accurate Sandhi-Splitter in Shallow Parsing of Dravidian Languages
This paper evaluates the challenges involved in shallow parsing of Dravidian languages which are highly agglutinative and morphologically rich. Text processing tasks in these languages are not trivial because multiple words concatenate to form a single string with morpho-phonemic changes at the point of concatenation. This phenomenon known as Sandhi, in turn complicates the individual word iden...
متن کاملStatistical Sandhi Splitter for Agglutinative Languages
Sandhi splitting is a primary and an important step for any natural language processing (NLP) application for languages which have agglutinative morphology. This paper presents a statistical approach to build a sandhi splitter for agglutinative languages. The input to the model is a valid string in the language and the output is a split of that string into meaningful word/s. The approach adopte...
متن کامل