Morphological Induction Through Linguistic Productivity

نویسنده

  • Sarah A. Goodman
چکیده

The induction program we have crafted relies primarily on the linguistic notion of productivity to find affixes in unmarked text and without the aid of prior grammatical knowledge. In doing so, the algorithm unfolds in two stages. It first finds seed affixes, to include infixes and circumfixes, by assaying the character of all possible internal partitions of all words in a small corpus no larger than 3,000 tokens. It then selects a small subset of these seed affixes by examining the distribution patterns of roots they fashion to, as demonstrated in a possibly larger second training file. Specifically, it hypothesizes that valid roots take a partially overlapping affix-set, and develops this conjecture into agendas for both feature-set generation and binary clustering. It collects feature sets for each candidate by what we term affix-chaining, delineating (and storing) a path of affixes joined, with thresholding caveats, via the roots they share. After clustering these resultant sets, the program yields two affix groups, an ostensibly valid collection and a putatively spurious one. It refines the membership of the former by again examining the quality of shared root distributions across affixes. This second half of the program, furthermore, is iterative. This fact is again based in productivity, as we ration that, should a root take one affix, it most likely takes more. The code therefore seeds a subsequent iteration of training with affixes that associate with roots learned during the current pass. If, for example, it recognizes view on the first pass, and viewership occurs in the second training file, the program will evaluate -ership, along with its mate -er, via clustering and root connectivity on the second pass. The results of this method are thus far mixed according to training file size. Time constraints imposed by shortcomings in the algorithm’s code have thus far prevented us from fully training on a large file. For Morpho Challenge 2008, not only did we only train on just 1-30% of the offered text, thereby saddling the stemmer with a number of Out Of Voculary items, but, we also divided that text into smaller parts, thereby, as the results show, omitting valuable information about the true range of affix distributions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining Part of Speech Induction and Morphological Induction

Linguistic information is useful in natural language processing, information retrieval and a multitude of sub-tasks involving language analysis. Two types of linguistic information in all languages are part of speech and morphology. Part of speech information reflects syntactic structure and can assist in tasks such as speech recognition, machine translation and word sense disambiguation. Morph...

متن کامل

On Bootstrapping of Linguistic Features for Bootstrapping Grammars

We discuss a cue-based grammar induction approach based on a parallel theory of grammar. Our model is based on the hypotheses of interdependency between linguistic levels (of representation) and inductability of specific structural properties at a particular level, with consequences for the induction of structural properties at other linguistic levels. We present the results of three different ...

متن کامل

Productivity and Reuse in Language

We present a Bayesian model of the mirror image problems of linguistic productivity and reuse. The model, known as Fragment Grammar, is evaluated against several morphological datasets; its performance is compared to competing theoretical accounts including full–parsing, full–listing, and exemplar–based models. The model is able to learn the correct patterns of productivity and reuse for two ve...

متن کامل

Automatic Rule Induction in Arabic to English Machine Translation Framework

This chapter addresses the exploitation of a supervised machine learning technique to automatically induce Arabic-to-English transfer rules from chunks of parallel aligned linguistic resources. The induced structural transfer rules encode the linguistic translation knowledge for converting an Arabic syntactic structure into a target English syntactic structure. These rules are going to be an in...

متن کامل

Data Mining as a Method for Linguistic Analysis: Dutch Diminutives*

We propose to use data mining techniques (inductive techniques for the automatic acquisition of comprehensible knowledge from data) as a method in linguistic analysis. In the past, such techniques have mainly been used in linguistic engineering applications to solve knowledge acquisition bottlenecks. In this paper we show that they can also assist in linguistic theory formation by providing a n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008