Native Language Identification with PPM
نویسنده
چکیده
This paper reports on our work in the NLI shared task 2013 on Native Language Identification. The task is to automatically detect the native language of the TOEFL essays authors in a set of given test documents in English. The task was solved by a system that used the PPM compression algorithm based on an n-gram statistical model. We submitted four runs; word-based PPMC algorithm with normalization and without, character-based PPMC algorithm with normalization and without. The worst result was obtained on training and testing data during the evaluation procedure using the character-based PPM method and normalization: accuracy = 31.9%; the best one was macroaverage F-measure = 0.708 with the word-based PPMC algorithm without normalization.
منابع مشابه
A Cryptographic Approach to Language Identification: PPM
The problem of language discrimination may arise in situations when many texts belonging to different source languages are at hand but we are not sure to which language each belongs to. This might usually be the case during information retrieval via Internet. We propose a cryptographic solution to the language identification problem: Employing the Prediction by Partial Matching (PPM) model, we ...
متن کاملToken Identification Using HMM and PPM Models
Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language processing tasks including learning-based token identification. Most of the existing systems are domainand language-dependent. The power of retargetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on token...
متن کاملNative and Non-Native Teachers’ Changing Beliefs about Teaching English as an International Language
In view of the paucity of evidence on teachers’ conceptions of teaching English an International Language (EIL), the present study used panel discussions to investigate the beliefs of 10 native and 10 non-native English-speaking teachers about their roles in teaching English in the EIL contexts and the perceptions of EIL. The findings revealed that some aspects of teachers’ beliefs about their ...
متن کاملRecognizing English Learners' Native Language from Their Writings
Native Language Identification (NLI), which tries to identify the native language (L1) of a second language learner based on their writings, is helpful for advancing second language learning and authorship profiling in forensic linguistics. With the availability of relevant data resources, much work has been done to explore the native language of a foreign language learner. In this report, we p...
متن کاملFrom Language to Family and Back: Native Language and Language Family Identification from English Text
Revealing an anonymous author’s traits from text is a well-researched area. In this paper we aim to identify the native language and language family of a non-native English author, given his/her English writings. We extract features from the text based on prior work, and extend or modify it to construct different feature sets, and use support vector machines for classification. We show that nat...
متن کامل