Using N-gram and Word Network Features for Native Language Identification
نویسندگان
چکیده
We report on the performance of two different feature sets in the Native Language Identification Shared Task (Tetreault et al., 2013). Our feature sets were inspired by existing literature on native language identification and word networks. Experiments show that word networks have competitive performance against the baseline feature set, which is a promising result. We also present a discussion of feature analysis based on information gain, and an overview on the performance of different word network features in the Native Language Identification task.
منابع مشابه
Native Language Identification using large scale lexical features
This paper describes an effort to perform Native Language Identification (NLI) using machine learning on a large amount of lexical features. The features were collected from sequences and collocations of bare word forms, suffixes and character n-grams amounting to a feature set of several hundred thousand features. These features were used to train a linear Support Vector Machine (SVM) classifi...
متن کاملNorwegian Native Language Identification
We present a study of Native Language Identification (NLI) using data from learners of Norwegian, a language not yet used for this task. NLI is the task of predicting a writer’s first language using only their writings in a learned language. We find that three feature types, function words, part-of-speech n-grams and a hybrid part-of-speech/function word mixture n-gram model are useful here. Ou...
متن کاملA study of N-gram and Embedding Representations for Native Language Identification
We report on our experiments with Ngram and embedding based feature representations for Native Language Identification (NLI) as a part of the NLI Shared Task 2017 (team name: NLI-ISU). Our best performing system on the test set for written essays had a macro F1 of 0.8264 and was based on word uni, bi and trigram features. We explored n-grams covering word, character, POS and word-POS mixed repr...
متن کاملFewer features perform well at Native Language Identification task
This paper describes our results at the NLI shared task 2017. We participated in essays, speech, and fusion task that uses text, speech, and i-vectors for the task of identifying the native language of the given input. In the essay track, a linear SVM system using word bigrams and character 7-grams performed the best. In the speech track, an LDA classifier based only on i-vectors performed bett...
متن کاملSimple Yet Powerful Native Language Identification on TOEFL11
Native language identification (NLI) is the task to determine the native language of the author based on an essay written in a second language. NLI is often treated as a classification problem. In this paper, we use the TOEFL11 data set which consists of more data, in terms of the amount of essays and languages, and less biased across prompts, i.e., topics, of essays. We demonstrate that even u...
متن کامل