Evaluation of automatic break insertion for an agglutinative and inflected language
نویسندگان
چکیده
This paper presents the evaluation of automatic break insertion for standard Basque. Basque is an agglutinative and inflected language and POS features, widely used for other languages, are not enough to accurately predict the insertion of breaks in the text. Other morpho-syntactic features, like grammatical case and information about syntagms have also been taken into account. With a textual corpus specially gathered for this study where the sentence internal punctuation marks have been removed, CARTs have been used to predict break locations. After applying parameter selection to the whole morpho-syntactic feature set, the best features were employed to build two CARTs, one that gives the same importance to deletion and insertion errors, T1, and another one, T2, that tries to minimise insertion errors. The objective evaluation of the break insertion algorithms gives a κ statistic of 0.518 and an F of 0.757 for T1 tree. The algorithms have also been subjectively evaluated and although T1 had better objective measures, the number of serious errors made by this tree is larger than the number of serious errors made by T2.
منابع مشابه
مدل دو مرحله ای شکاف- گلچین برای نمایه سازی خودکار متون فارسی
Purpose: Each language has its own problems. This leads to consider appropriate models for automatic indexing of every language. These models should concern the exhaustificity and specificity of indexing. This paper aims at introduction and evaluation of a model which is suited for Persian automatic indexing. This model suggests to break the text into the particles of candidate terms and to c...
متن کاملSpelling Correction: from Two-Level Morphology to Open Source
Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use...
متن کاملA Sequence Labeling Approach to Morphological Analyzer for Tamil Language
Morphological analysis is the basic process for any Natural Language Processing task. Morphology is the study of internal structure of the word. Morphological analysis retrieves the grammatical features and properties of a morphologically inflected word. Capturing the agglutinative structure of Tamil words by an automatic system is a challenging job. Generally rule based approaches are used for...
متن کاملMulti-granularity Word Alignment and Decoding for Agglutinative Language Translation
Lexical sparsity problem ismuchmore serious for agglutinative language translation due to the multitude of inflected variants of lexicons. In this paper, we propose a novel optimization strategy to ease spareness bymulti-granularity word alignment and translation for agglutinative language. Multiple alignment results are combined to catch the complementary information for alignments, and rules ...
متن کاملThe Production of Nominal and Verbal Inflection in an Agglutinative Language: Evidence from Hungarian
The contrast between regular and irregular inflectional morphology has been useful in investigating the functional and neural architecture of language. However, most studies have examined the regular/irregular distinction in non-agglutinative Indo-European languages (primarily English) with relatively simple morphology. Additionally, the majority of research has focused on verbal rather than no...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Speech Communication
دوره 50 شماره
صفحات -
تاریخ انتشار 2008