Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System

نویسندگان

  • Gokul Chittaranjan
  • Yogarshi Vyas
  • Kalika Bali
  • Monojit Choudhury
چکیده

We describe a CRF based system for word-level language identification of code-mixed text. Our method uses lexical, contextual, character n-gram, and special character features, and therefore, can easily be replicated across languages. Its performance is benchmarked against the test sets provided by the shared task on code-mixing (Solorio et al., 2014) for four language pairs, namely, EnglishSpanish (En-Es), English-Nepali (En-Ne), English-Mandarin (En-Cn), and Standard Arabic-Arabic (Ar-Ar) Dialects. The experimental results show a consistent performance across the language pairs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Voting in Multiple Classifier Systems for Word Level Language Identification

In social media communication, code switching has become quite a common phenomenon especially for multilingual speakers. Automatic language identification becomes both a necessary and challenging task in such an environment. In this work, we describe a CRF based system with voting approach for code-mixed query word labeling at word-level as part of our participation in the shared task on Mixed ...

متن کامل

The George Washington University System for the Code-Switching Workshop Shared Task 2016

We describe our work in the EMNLP 2016 second code-switching shared task; a generic language independent framework for linguistic code switch point detection (LCSPD). The system uses characters level 5-grams and word level unigram language models to train a conditional random fields (CRF) model for classifying input words into various languages. We participated in the Modern Standard Arabic (MS...

متن کامل

Language Identification in Code-Switching Scenario

This paper describes a CRF based token level language identification system entry to Language Identification in CodeSwitched (CS) Data task of CodeSwitch 2014. Our system hinges on using conditional posterior probabilities for the individual codes (words) in code-switched data to solve the language identification task. We also experiment with other linguistically motivated language specific as ...

متن کامل

The Tel Aviv University System for the Code-Switching Workshop Shared Task

We describe our entry in the EMNLP 2014 code-switching shared task. Our system is based on a sequential classifier, trained on the shared training set using various characterand word-level features, some calculated using a large monolingual corpora. We participated in the Twitter-genre Spanish-English track, obtaining an accuracy of 0.868 when measured on the tweet level and 0.858 on the word l...

متن کامل

The CMU Submission for the Shared Task on Language Identification in Code-Switched Data

We describe the CMU submission for the 2014 shared task on language identification in code-switched data. We participated in all four language pairs: Spanish–English, Mandarin–English, Nepali–English, and Modern Standard Arabic–Arabic dialects. After describing our CRF-based baseline system, we discuss three extensions for learning from unlabeled data: semi-supervised learning, word embeddings,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014