Word Level Language Identification in Online Multilingual Communication
نویسندگان
چکیده
Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.
منابع مشابه
Text analysis and language identification for polyglot text-to-speech synthesis
In multilingual countries, text-to-speech synthesis systems often have to deal with texts containing inclusions of multiple other languages in form of phrases, words, or even parts of words. In such multilingual cultural settings, listeners expect a high-quality text-to-speech synthesis system to read such texts in a way that the origin of the inclusions is heard, i.e., with correct language-sp...
متن کاملLeveraging Data-Driven Methods in Word-Level Language Identification for a Multilingual Alpine Heritage Corpus
This paper presents a data-driven, simple cluster-and-label approach using optimized count-based methods for word-level language identification for a large domain-specific multilingual diachronic corpus of periodicals published at least yearly between 1864 and 2014 in Switzerland. Our system requires no annotated data or training, only minimal human effort in evaluating and labeling 50 clusters...
متن کاملCode Mixing: A Challenge for Language Identification in the Language of Social Media
In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating, wh...
متن کاملFast and unsupervised methods for multilingual cognate clustering
In this paper we explore the use of unsupervised methods for detecting cognates in multilingual word lists. We use online EM to train sound segment similarity weights for computing similarity between two words. We tested our online systems on geographically spread sixteen different language groups of the world and show that the Online PMI system (Pointwise Mutual Information) outperforms a HMM ...
متن کاملAdaptive Voting in Multiple Classifier Systems for Word Level Language Identification
In social media communication, code switching has become quite a common phenomenon especially for multilingual speakers. Automatic language identification becomes both a necessary and challenging task in such an environment. In this work, we describe a CRF based system with voting approach for code-mixed query word labeling at word-level as part of our participation in the shared task on Mixed ...
متن کامل