Discriminating similar languages with token-based backoff

نویسندگان

  • Tommi Jauhiainen
  • Heidi Jauhiainen
  • Krister Lindén
چکیده

In this paper we describe the language identification system built within the Finno-Ugric Languages and the Internet project for the Discriminating between Similar Languages (DSL) shared task in LT4VarDial workshop at RANLP-2015. The system reached fourth place in normal closed submissions (94.7% accuracy) and second place in closed submissions with the named entities blinded (93.0% accuracy).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating HeLI with Non-Linear Mappings

In this paper we describe the non-linear mappings we used with the Helsinki language identification method, HeLI, in the 4th edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2017 workshop. Our SUKI team participated in the closed track together with 10 other teams. Our system reached the 7th position in the track. We describe ...

متن کامل

Using Maximum Entropy Models to Discriminate between Similar Languages and Varieties

DSLRAE is a hierarchical classifier for similar written languages and varieties based on maximum-entropy (maxent) classifiers. In the first level, the text is classified into a language group using a simple token-based maxent classifier. At the second level, a group-specific maxent classifier is applied to classify the text as one of the languages or varieties within the previously identified g...

متن کامل

HeLI, a Word-Based Backoff Method for Language Identification

In this paper we describe the Helsinki language identification method, HeLI, and the resources we created for and used in the 3rd edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2016 workshop. The shared task comprised of a total of 8 tracks, of which we participated in 7. The shared task had a record number of participants, ...

متن کامل

An Unsupervised Morphological Criterion for Discriminating Similar Languages

In this study conducted on the occasion of the Discriminating between Similar Languages shared task, I introduce an additional decision factor focusing on the token and subtoken level. The motivation behind this submission is to test whether a morphologically-informed criterion can add linguistically relevant information to global categorization and thus improve performance. The contributions o...

متن کامل

Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages

We propose a backoff model for phrasebased machine translation that translates unseen word forms in foreign-language text by hierarchical morphological abstractions at the word and the phrase level. The model is evaluated on the Europarl corpus for German-English and FinnishEnglish translation and shows improvements over state-of-the-art phrase-based models.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015