Automatic Language Identification from Written Texts – An Overview
نویسنده
چکیده
Language Identification is the task of automatically identifying the language(s) in which the content is written in a document (web page, text document). Due to the widespread use of internet, identification of languages has become an important preprocessing step for a number of applications such as machine translation, Part-of-Speech tagging, linguistic corpus creation, supporting low-density languages, accessibility of social media or user-generated content, search engines and information extraction in addition to processing multilingual documents. In a multilingual country like India, Language Identification has wider scope to bridge the digital divide between different language users. This paper presents a brief overview of the challenges involved in automatic language identification, existing methodologies and some of the tools available for language identification.
منابع مشابه
Automatic identification of language varieties: The case of Portuguese
Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. Stateof-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classifica...
متن کاملAutomatic Identification of Learners' Language Background Based on Their Writing in Czech
The goal of this study is to investigate whether learners’ written data in highly inflectional Czech can suggest a consistent set of clues for automatic identification of the learners’ L1 background. For our experiments, we use texts written by learners of Czech, which have been automatically and manually annotated for errors. We define two classes of learners: speakers of Indo-European languag...
متن کاملAutomatic Detection of Antisocial Behaviour in Texts
A considerable amount of effort has been made to reduce the physical manifestation of antisocial behaviour (ASB) in communities. However, the key to the early detection of ASB is, in many cases, in observing its manifestations in written language, which has not been studied in detail. In this work, we search for linguistic features that pertain to ASB in order to use those features for the auto...
متن کاملGraph-Based N-gram Language Identification on Short Texts
Language identification (LI) is an important task in natural language processing. Several machine learning approaches have been proposed for addressing this problem, but most of them assume relatively long and well written texts. We propose a graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts. The results of our experimental study show that LIGA ...
متن کاملMaximizing Classification Accuracy in Native Language Identification
This paper reports our contribution to the 2013 NLI Shared Task. The purpose of the task was to train a machine-learning system to identify the native-language affiliations of 1,100 texts written in English by nonnative speakers as part of a high-stakes test of general academic English proficiency. We trained our system on the new TOEFL11 corpus, which includes 11,000 essays written by nonnativ...
متن کامل