Evaluation of automatic break insertion for an agglutinative and inflected language

نویسندگان

Eva Navas

Inma Hernáez

Iñaki Sainz

چکیده

This paper presents the evaluation of automatic break insertion for standard Basque. Basque is an agglutinative and inflected language and POS features, widely used for other languages, are not enough to accurately predict the insertion of breaks in the text. Other morpho-syntactic features, like grammatical case and information about syntagms have also been taken into account. With a textual corpus specially gathered for this study where the sentence internal punctuation marks have been removed, CARTs have been used to predict break locations. After applying parameter selection to the whole morpho-syntactic feature set, the best features were employed to build two CARTs, one that gives the same importance to deletion and insertion errors, T1, and another one, T2, that tries to minimise insertion errors. The objective evaluation of the break insertion algorithms gives a κ statistic of 0.518 and an F of 0.757 for T1 tree. The algorithms have also been subjectively evaluated and although T1 had better objective measures, the number of serious errors made by this tree is larger than the number of serious errors made by T2.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مدل دو مرحله ای شکاف- گلچین برای نمایه سازی خودکار متون فارسی

Purpose: Each language has its own problems. This leads to consider appropriate models for automatic indexing of every language. These models should concern the exhaustificity and specificity of indexing. This paper aims at introduction and evaluation of a model which is suited for Persian automatic indexing. This model suggests to break the text into the particles of candidate terms and to c...

متن کامل

Spelling Correction: from Two-Level Morphology to Open Source

Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use...

متن کامل

A Sequence Labeling Approach to Morphological Analyzer for Tamil Language

Morphological analysis is the basic process for any Natural Language Processing task. Morphology is the study of internal structure of the word. Morphological analysis retrieves the grammatical features and properties of a morphologically inflected word. Capturing the agglutinative structure of Tamil words by an automatic system is a challenging job. Generally rule based approaches are used for...

متن کامل

Multi-granularity Word Alignment and Decoding for Agglutinative Language Translation

Lexical sparsity problem ismuchmore serious for agglutinative language translation due to the multitude of inflected variants of lexicons. In this paper, we propose a novel optimization strategy to ease spareness bymulti-granularity word alignment and translation for agglutinative language. Multiple alignment results are combined to catch the complementary information for alignments, and rules ...

متن کامل

The Production of Nominal and Verbal Inflection in an Agglutinative Language: Evidence from Hungarian

The contrast between regular and irregular inflectional morphology has been useful in investigating the functional and neural architecture of language. However, most studies have examined the regular/irregular distinction in non-agglutinative Indo-European languages (primarily English) with relatively simple morphology. Additionally, the majority of research has focused on verbal rather than no...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Speech Communication

دوره 50 شماره

صفحات -

تاریخ انتشار 2008

Evaluation of automatic break insertion for an agglutinative and inflected language

نویسندگان

چکیده

منابع مشابه

مدل دو مرحله ای شکاف- گلچین برای نمایه سازی خودکار متون فارسی

Spelling Correction: from Two-Level Morphology to Open Source

A Sequence Labeling Approach to Morphological Analyzer for Tamil Language

Multi-granularity Word Alignment and Decoding for Agglutinative Language Translation

The Production of Nominal and Verbal Inflection in an Agglutinative Language: Evidence from Hungarian

عنوان ژورنال:

اشتراک گذاری